[RESOLVED] In need of a Function

Assorro

I'm in the middle of writing an old ASP script into PHP and I've hit a part that is beyond me and require some assistance.

What the script does is scrap a table from a webpage and store it in a variable but then I only want certain records displayed from that table.

I've set up an example to make it easier to understand. Here's the example table.

Example Table After Scraping

You will note that it's a student table and that only 5 of the 10 records shown have a image of a football. Those are the records that I want to display. Specifically, the name that is located between the two images so that the final result looks like this.

Final Result

If someone could provide me with at least a starting point to research the function required to do this It would be much appreciated. Here's what I am using to get the table itself. Thanks.

$scrape_table = GetBetween(file_get_contents('http://users6.nofeehost.com/truangel/example.html'), '<h2>Student List</h2>', '<h3>School News</h3>');

bradgrafelman

One method of scraping information from HTML is by using regular expressions, e.g.:

$pattern = '/<img [^>]*src=["\']image1.jpg["\'][^>]*>([^>]+)<img [^>]*src=["\']football.jpg["\']/i';
preg_match_all($pattern, $data, $matches);

where the above code would yield an array of $matches[1] like so:

[pre] [1] => Array
(
[0] => Name Number 1
[1] => Name Number 2
[2] => Name Number 4
[3] => Name Number 5
[4] => Name Number 8
)[/pre]

Benefits are that it's simple (provided you know regular expression syntax, of course :p) and compact - one function and the data is extracted. Drawbacks are that it's simple and compact. 🙂 It's nothing more than flexible pattern matching; if the pattern is changed very much, then the regexp pattern can be broken and will need to be updated.

Another approach is to use [man]DOM[/man] to actually parse the HTML document and traverse its structure. This does have the benefit of sometimes being more flexible than regexp (although it still requires you to know something about the structure of the data that you're looking for), but it takes a little more code.

Assorro

That's a good start and I thank you for the speedy response but I think I'm over my head with this one. This was what I tried to get the result.

$pattern = '/<img [^>]*src=["\']image1.jpg["\'][^>]*>([^>]+)<img [^>]*src=["\']football.jpg["\']/i';
$result = preg_match_all($pattern, $scrape_table, $matches); 
echo $result;

Which returns the value 5 but what I require is the list of the names between the two images so I can insert those name into a database. It should need to use a loop I believe.

The table won't change so this seems to be the ideal solution but I'm stuck as to how to return the data properly.

bradgrafelman

Read the manual for [man]preg_match_all/man; as I noted above, the actual data you want is in an array, namely $matches[1] (where $matches is a multi-dimensional array).

Assorro

Now that made sense. That's for the help. So all I needed was this

$pattern = '/<img [^>]*src=["\']image1.jpg["\'][^>]*>([^>]+)<img [^>]*src=["\']football.jpg["\']/i';

$result = preg_match_all($pattern, $scrape_table, $matches); 

for ( $counter = 0; $counter <= $result; $counter += 1) {
echo $matches[1][$counter];
echo "<br />";
}

I am not familiar with regular expression at all and to finish this I would ask to be shown how to write the follow as regular expression. They're the two pieces that are actually on either side of the data I need to grab. Thanks again.

class="smallflag" alt="Flag" title="Flag"> data I need </a><img src="/images/football.jpg"

bradgrafelman

Note that if you simply want to display all entries in the array, you could do something like this:

echo implode("<br />\n", $matches[1]);

However, if you wanted to do some processing on each entry, it'd probably be simpler to use a [man]foreach/man loop instead of a [man]for/man loop, e.g.:

foreach($matches[1] as $name) {
    echo "<div>$name</div>\n";
    do_something($name);

// etc.
}

EDIT: Also, don't forget to mark this thread resolved (if it is) using the link on the Thread Tools menu above.

Assorro

Aah, that's even better. Thanks for that as well. I will have some processing to do as well.

I'm still stuck however on how to write the following as regular expression. Sorry for being a nag but regular expression confuses me to no end. The name I need is between the two.

class="smallflag" alt="Flag" title="Flag"> the data I need is here </a><img src="/images/football.jpg"

bradgrafelman

In other words, there's an <A> element in between the two images (making the name a link)?

It's really sounding like you might want to start looking at some DOM tutorials, but for a regular expression I think this should work:

$pattern = '/<img [^>]*src=["\']image1.jpg["\'][^>]*><a[^>]*>([^<]+)</a><img [^>]*src=["\']football.jpg["\']/i';

EDIT: Also, for a great regular expressions reference, I always keep this website handy.

Assorro

Well here's the expression I can't seem to get to function.

$pattern = '/title=["\']Flag["\'][^>]>([^>]+)</a><img [^{>]*src=["\']/images/football.gif["\']/i';]}

and here's the error I keep getting.

Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'a' in /home/assorro/conquestofabsolution.com/test1.php

I've tried several different things but to be honest, I'm a fish out of water where regular expression is concerned.

The code above is different then the original code I posted.

bradgrafelman

That would be my fault, since I didn't escape (or change) the character I used for a delimiter ('/').

Try this:

$pattern = '@<img [^>]*src=["\']image1.jpg["\'][^>]*><a[^>]*>([^<]+)</a><img [^>]*src=["\']football.jpg["\']@i';

Assorro

Thanks for all your help Brad but the html from the example table isn't actually useful to me since the actual html is different.

The actual html involved that needs to be in the pattern is as follows. The names I need to grab lie where "data I need is here" is written. Thanks again for all your assistance.

class="smallimage" alt="Football" title="Football"> data I need is here <img src="/images/football.gif

bradgrafelman

What type of tag is that at the beginning that got cut off? There's a lot of room for programmer's preference here - I'm just picking some random identifying part of the HTML. Try taking the example patterns I gave above and making your own based on what the data actually looks like. If it doesn't work, post what pattern you tried as well as what data you tried to match it against.

Assorro

I've run through 25 pages and tried 25 different solutions and it still escapes me. I've come to believe that regular expression just isn't my thing and I can't move forward until I've resolved this issue unfortunately.

The name is encased between to images.

bradgrafelman

Well here's a basic pattern that will match what you want based on the data you gave:

$pattern = '@class="smallimage" alt="Football" title="Football"> (.*?) <img src="/images/football.gif@';

You could also simplify the resulting $matches array (down to an array of strings, rather than a multi-dimensional array) by complicating the pattern (:p) a bit:

$pattern = '@(?<=class="smallimage" alt="Football" title="Football"> ).*?(?= <img src="/images/football.gif)@';

Assorro

That did the trick.

This entire experience has opened my eyes to an entire different realm within program development and I have bookmarked the resources you have listed and fully intend to research this more thoroughly.

Thanks heaps for the assistance. This has been marked as resolved.