Link extraction from HTML page

multimediavt

Hello,

I am working on an automated link generator for a men's basketball site. Basically, there are stats that exist and can be linked to from a remote page. I want to read that page in (as a string), locate the correct stat URL for the game being sought from that page, and parse it out into another string.

I'm having a devil of a time trying to figure out exactly how to attack it. I know there are multiple ways of doing this, but the Keep It Simple Stupid rule has to apply as I won't be the primary keeper of this code once it's handed over to the client. So, I was looking at using the substr, stripos, and strlen commands to whittle the massive full page string down to the piece I need.

I do have the following pieces of data, and know the structure of the page being parsed.

$gameDate - the date of the game I need the stats for and how it appears in the page
$linkText - the text of the link the anchor tag is tied to

Unfortunately, the link itself (i.e., <a href='link.html'>) varies in length so I will need to parse out everything between the quotes. That's not too hard, but figuring out how to isolate that specific link on a page full of them is.

Some thoughts on my direction and any example code would be appreciated. I'm sure someone has done something like this before.

Jason

multimediavt

Nevermind...I figured it out. Here's the script if anyone wants it.

<?php

// set the game we're looking for; this will be passed by the update page
$gameDate = '12/04/05';

// this script retrieves the text of the page as raw code and puts it into a string variable; trim removes linebreaks from the text
$statPage = trim(file_get_contents('http://www.hokiesports.com/mbasketball/stats/index.html'));

// this defines the search string we're looking for in the text
$linkText = 'Box score';

// strpos will find the first occurence of the game date search string within the page to find where to look for the stat link
$offset = strpos($statPage, $gameDate);
$where = strpos($statPage, $linkText, $offset);

// this backs up 24 characters from the first occurrence in order to grab the URL string that precedes it with the next function, substr
$where = $where - 24;
$what = substr($statPage, $where, 24);

// this breaks the substring into pieces that are separated by the double quotes in the link code; it then searches the array elements for the array value that contains the link
$linkArray = explode("\"", $what);
foreach($linkArray as $value) {
if(strpos($value, ".html") > 0) {
echo $value . "<br />";
}
}