Hi, I am in a desperate need of help with "screen scraping" script. I'm currently using a web service that returns information in xml. This service will be switched off shortly and I have to revert to the "old technique" of reading information off a html page (all legal under the terms and conditions of the information provider).

I think I know what I want to achieve, just don't know how... Can you point me to a tutorial or some examples on how to do it? This is what I understand I need to do:

  1. read extract page from external site (cURL?)
  2. dump content to a "buffer" (?)
  3. extract text that falls between <pre>...</pre> tags as rows
  4. remove column headers (always same rows - spreaded around the file)
  5. break individual rows (space as delimiter)
  6. join with another bit of info based on a common name - from external txt file
  7. write the lot into MySQL database
  8. repeat the process every 20 min past the hour (cgi script?)

Phew, and it used to be so simple... I'm running my site on php 4.4.4. My knowledge of php is very limited.

ANY HELP MUCH APPRECIATED!

    I was hoping for a pointer to a good tutorial since I can't find anything that would go "all the way"... I'll struggle bit by bit, if you can assist, much appreciated.

    My first hurdle - I can read the file but can't extract what's between <pre></pre> tags... here is what I've got so far:

    <?php
    // function to extract what's between the <pre></pre> tags
    function find($start, $end, $data) 
    { 
        if(false !== ($pos = strpos(strtolower($data), strtolower($start)))) 
        { 
            if($remove = stristr(($cropped = substr($data, $pos + strlen($start))), $end)) 
            { 
                return trim(str_replace($remove, '', $cropped)); 
            } 
        } 
        return false; 
    } 
    
    // create a new cURL resource
    $ch = curl_init();
    
    // set URL and other appropriate options
    curl_setopt($ch, CURLOPT_URL, "http://othersite.com/thisfile.txt");
    curl_setopt($ch, CURLOPT_HEADER, 0);
    
    // grab URL and pass it to the browser
    //curl_exec($ch);
    // grab URL and pass it to a variable
    $data = curl_exec($ch);
    
    // close cURL resource, and free up system resources
    curl_close($ch);
    
    // print to browser for check
    echo(find("<pre>", "</pre>", $data));
    
    
    ?> 
    

    function find doesn't seem to work, I get everything written to the browser. But it works if I just read the local file:

    <?
    function find($start, $end, $data) 
    { 
        if(false !== ($pos = strpos(strtolower($data), strtolower($start)))) 
        { 
            if($remove = stristr(($cropped = substr($data, $pos + strlen($start))), $end)) 
            { 
                return trim(str_replace($remove, '', $cropped)); 
            } 
        } 
        return false; 
    } 
    
    // set file to read
    $file = 'localdata.txt' or die('Could not open file!'); 
    // open file 
    $fh = fopen($file, 'r') or die('Could not open file!'); 
    // read file contents 
    $data = fread($fh, filesize($file)) or die('Could not read file!'); 
    // close file 
    fclose($fh); 
    
    
    echo(find("<pre>", "</pre>", $data));
    
    
    ?>
    

    Any suggestions? Thanks in advance.

      #3) use explode(), using <pre> as the splitting point. Then set a variable equal to the $array_name[1] to get all data after the <pre>. Then explode() it again, using </pre> as the splitter. Set variable equal to $array_name[0] to get the data before the </pre>

      #4) How are header lines identified? Depending on how your data is laid out, you might be able to be sneaky and do an explode(), based on the headers as seperators, followed by an implode() of the array...

      #5) Sounds like another explode() job.

      #6) This is actually 2 parts. First, open and read a file. I suggest you do a search on fread() to get a feel for what commands you are going to need to do this.

      Second, is contatenation. That is basically using a period between items to join them together. example: echo "I want a " . $present; will concatenate a text phrase with a variable called $present.

      #7) Somewhere in my posts, I wrote a basic MySQL/PHP "cheat sheet". If you can locate that, it will probably take care of this one...

      #8) http://www.dwalker.co.uk/phpjobscheduler/

        Hi jkurrle, thanks for your suggestions! Indeed, explode is a handy function but I could use it only once. In the end I settled for extracting info with find function defined in my first post, then explode() to turn into an array of rows, then removing headers with array_slice() and merging individual parts with array_merge()... as per this example:

        $text1=  find("ID2", "ID3", $data);
        $rows1=explode("\n",$text1); 
        $output1 = array_slice($rows1, 22);
        
        $text2=  find("ID3", "ID4", $data);
        $rows2=explode("\n",$text2); 
        $output2 = array_slice($rows2, 22);
        
        $result = array_merge($output1, $output2);
        

        Now I need to split it further to output to a MySQL database...

        My array is of this pattern:

        [0] Name1 val1 val2 val3
        [1] Name2 val1 val2 val3

        I tried explode() with space as a delimiter but it doesn't work on arrays...

        I'm yet to find your cheat sheet- I hope there will be some pointers there. Much appraciate your assistance!

        Ed

          ad_aus,

          If you set a variable equal to the array item, you can explode again on the variable. Example:

          $counter=0;
          $loader_array=array();
          foreach ($result as $result_item)
          {
          $new_array=explode(" ",$result_item);
          $loader_array[$counter]["name"]=$new_array[0];
          $loader_array[$counter]["val1"]=$new_array[1];
          $loader_array[$counter]["val2"]=$new_array[2];
          $loader_array[$counter]["val3"]=$new_array[3];
          $counter=$counter+1;
          }

            Thanks guys, very handy info! I really appreciate.

            It turns out I was unable to use regex nor explode - rather, I had to revert to substr() function to extract fixed width strings. Here is an example:

            $counter=0;
            $loader_array=array();
            foreach ($result as $result_item)
            {
            //$new_array=explode(" ",$result_item); - can't use due to irregular naming conventions
            
            $new_array[0]=substr($result_item, 0, 10); 
            $new_array[1]=substr($result_item, 12, 4); 
            $new_array[2]=substr($result_item, 17, 5); 
            $new_array[3]=substr($result_item, 23, 2); 
            
            $loader_array[$counter]["name"]=$new_array[0];
            $loader_array[$counter]["var1"]=$new_array[1];
            $loader_array[$counter]["var2"]=$new_array[2];
            $loader_array[$counter]["var3"]=$new_array[3];
            $counter=$counter+1;
            }
            
            

            Next challenge is to do some more splits/reformats of the $new_array variables, eg:
            - remove extra spaces (eg "name " to "name")
            - convert string to a number (eg "3500" to -35.00)

            Only then I'll have the array in the fomat ready for import into MySQL. You've been most helpful so far so, any further pointers mych appreciated.

            Cheers!

              Removing the extra spaces is easy with [man]trim[/man]

              You could always use [man]sprintf[/man] or [man]printf[/man] to get the numbers in the format you want 😉

                5 days later

                I put this puzzle away only for a moment and... a week is gone!

                Thanks bpat, I'll explore these functions further. Meantime a queston on extracting specific info from the array (as per code above).

                If I want to loop only through a given names, not the entire list, how to set foreach () function?

                foreach( $loader_array as $counter) will loop through the whole list but I want to limit it only to say ["name'] = (name1, name5, name 45, etc...);

                Any suggestions much appreciated. Thanks!

                  Use a for() loop instead. Just do something like:

                  for($i=0, $max=count($array); $i<$max; $i++)
                  {
                    // Reference "name" like so: $array[$i]['name']
                    // e.g.:
                    echo $array[$i]['name'];
                  }

                    Thanks again for pointing me in the right direction!

                    As a follow up, I only want to output records matching specific names. If there are a few I can use if() with "or" statement but what to do if there are many? Can I somehow include an array of names to refer to in if() function?

                    for($i=0, $max=count($loader_array); $i<$max; $i++)
                    { 
                    
                    if($loader_array[$i]["name"]=="name1"||"name2"){
                    
                    //  output the right stuff...
                    
                    }
                    }
                    

                    I can't get my head around this array stuff and I could not find relevant examples either. Again, any pointers much appreciated!

                      You would want to specify an array of matching names:

                      $allowed = array('name1','name2');

                      Then use the [man]in_array/man function to see if $loader_array[$i]['name'] is in the array $allowed 😉

                        Easy 🆒

                        Thanks bpat, I'm rolling again! I looked at in_array() function before but totally missed its relevance here... now to some "trimming and pruning" and one part of the puzzle will be solved (writing into a database yet to come...).

                        Meantime, another question, how does server handle requests for a php file which is being opened (fopen) and written to? Does it wait for the file to be freed and then responds or does it thow an error?

                          PHP will process it. If it can't open the file, it will toss an error in your script, and if you catch that error, will keep going. Otherwise an warning is printed onto the screen and the script keeps rolling.

                          So php will wait for the current command to be completed before moving on to the next. It's very-much procedural in that it executes in a top-down fashion. One line executed at a time.

                            Write a Reply...