Hi there everyone!

I'm helping a guy who accidentally lost his informational website while hospitalized. It's a site that categorically stored about 35,000 links pointing to technical sites and pages.

I've got the entire site snapshot from the Wayback Machine and have begun writing a script to go through these files but am at the point at which I need to extract the links, which is what I'm having a problem with. Mostly, my issue is that the original site creator was a minimalist when it came to tags, classes and whatnot, so content I don't want often looks like the content that I do. Here's a representation of what I will be working with:

<tr>
<td width="/100%.jpg" align=left style="font-weight: bold; font-size: 14px; color: #CD3301; font-family: Arial;">Select A Category:</td>
</tr>
<tr bgcolor="#ebebeb">
<td valign=top style="font-family: arial; font-size: 12px;" width="/100%.jpg">
<a href="index.php?index=1021">
Lubricant Leaking from Rear Hub Seals Full Float Hub w/10.25" FULL-FLOATER Rear Axle TSB 94-19-24 for 86-94 F-250, F-350</a>
</td>
</tr>
</table>
<table width="/100%.jpg" cellpadding=8 cellspacing=0>
<tr>
<td width="/100%.jpg" align=left style="font-weight: bold; font-size: 14px; color: #CD3301; font-family: Arial;">Select A Link:</td>
</tr>
<tr>
<td width="/100%.jpg" height=2 bgcolor="#ffffff" align=left></td>
</tr>
<tr bgcolor="#ebebeb">
<td align=left style="font-size: 12px; font-family: Arial;" valign=top width="/100%.jpg">
<a href="http://broncozone.com/topic/20082-help-me-need-to-know-specs-88-bronco-xlt-58l-351w/">
 Identification Based on VIN, Door Jamb Label, Build Sheet (Ford 999 Report), Paint Color Code, VECI Label, Transmission/Differential Pan & Gasket Sizes/Shapes, etc.; "... made a mistake 15 years ago by telling someone to use the Driver's side label to ID their Rear Differential (axle, pumpkin type, etc.); turned out that a previous owner had swapped a Dana 60 in place of the stock 8.8..."</a>
<br>
Source: by miesk5 at Ford Bronco Zone Forums 
</td>
</tr>
<tr bgcolor="#ffffff">
<td align=left style="font-size: 12px; font-family: Arial;" valign=top width="/100%.jpg">
<a href="http://broncozone.com/topic/21803-new-member/page__gopid__113959#entry113959">
"...Ford built our Broncos & other 4x4 trucks & vans with a numerically lower front gear ratio in the front Dana 44 than the rear so that off-road steering is enhanced. A Bronco built with 3.55 rear ratio would have a 3.54 ration in the front Dana 44; or; 3.08 in the 8.8 & 3.07 in the Dana 44; or 4.11 in the 8.8 & 4.10 in the Dana 44, etc..."; Following was in my MS WORD Notes and the source, Randy's Ring & Pinion has removed it from their current web site; The gear ratio in the front of a four wheel drive has to be different from the front so the front wheels will pull more. There have been many different ratio combinations used in four-wheel drive vehicles, but not so that the front will pull more. Gear manufactures use different ratios for many different reasons. Some of those reasons are: strength, gear life, noise (or lack of it), geometric constraints, or simply because of the tooling they have available. I have seen Ford use a 3.50 ratio in the rear with a 3.54 in the front, or a 4.11 in the rear with a 4.09 in the front. As long as the front and rear ratios are within 1%, the vehicle works just fine on the road, and can even be as different as 2% for off-road use with no side effects. point difference in ratio is equal to 1%. To find the percentage difference in ratios it is necessary to divide, not subtract. In order to find the difference, divide one ratio by the other and look at the numbers to the right of the decimal point to see how far they vary from 1.00. For example: 3.54 ÷ 3.50 = 1.01, or 1%, not 4% different. And likewise 4.11 ÷ 4.09 = 1.005, or only a 1/2% difference. These differences are about the same as a 1/3" variation in front to rear tire height, which probably happens more often than we realize. A difference in the ratio will damage the transfer case. Any extreme difference in front and rear ratios or front and rear tire height will put undue force on the drive train. However, any difference will put strain on all parts of the drivetrain. The forces generated from the difference have to travel through the axle assemblies and the driveshafts to get to the transfer case. These excessive forces can just as easily break a front u-joint or rear spider gear as well as parts in the transfer case.</a>
<br>
Source: by miesk5 at Ford Bronco Zone Forums
</td>
</tr>

I need only the links and not the categories. Since he used the same table info for the categories section, I need to retrieve information only after:

<td width="/100%.jpg" align=left style="font-weight: bold; font-size: 14px; color: #CD3301; font-family: Arial;">Select A Link:</td>

Then once I'm there, I need to do the following with a table row:

<tr bgcolor="#ffffff">
<td align=left style="font-size: 12px; font-family: Arial;" valign=top width="/100%.jpg">
<a href="http://broncozone.com/topic/21803-new-member/page__gopid__113959#entry113959">
"...Ford built our Broncos & other 4x4 trucks & vans with a numerically lower front gear ratio in the front Dana 44 than the rear so that off-road steering is enhanced. A Bronco built with 3.55 rear ratio would have a 3.54 ration in the front Dana 44; or; 3.08 in the 8.8 & 3.07 in the Dana 44; or 4.11 in the 8.8 & 4.10 in the Dana 44, etc..."; Following was in my MS WORD Notes and the source, Randy's Ring & Pinion has removed it from their current web site; The gear ratio in the front of a four wheel drive has to be different from the front so the front wheels will pull more. There have been many different ratio combinations used in four-wheel drive vehicles, but not so that the front will pull more. Gear manufactures use different ratios for many different reasons. Some of those reasons are: strength, gear life, noise (or lack of it), geometric constraints, or simply because of the tooling they have available. I have seen Ford use a 3.50 ratio in the rear with a 3.54 in the front, or a 4.11 in the rear with a 4.09 in the front. As long as the front and rear ratios are within 1%, the vehicle works just fine on the road, and can even be as different as 2% for off-road use with no side effects. point difference in ratio is equal to 1%. To find the percentage difference in ratios it is necessary to divide, not subtract. In order to find the difference, divide one ratio by the other and look at the numbers to the right of the decimal point to see how far they vary from 1.00. For example: 3.54 ÷ 3.50 = 1.01, or 1%, not 4% different. And likewise 4.11 ÷ 4.09 = 1.005, or only a 1/2% difference. These differences are about the same as a 1/3" variation in front to rear tire height, which probably happens more often than we realize. A difference in the ratio will damage the transfer case. Any extreme difference in front and rear ratios or front and rear tire height will put undue force on the drive train. However, any difference will put strain on all parts of the drivetrain. The forces generated from the difference have to travel through the axle assemblies and the driveshafts to get to the transfer case. These excessive forces can just as easily break a front u-joint or rear spider gear as well as parts in the transfer case.</a>
<br>
Source: by miesk5 at Ford Bronco Zone Forums
</td>
</tr>

the link inside the a href needs to become my $link. The content between the opening and ending a href needs to be my $title and the content on the line after "Source: by " needs to become my $submitter.

My googling is leading me towards DOM parsers but there's a lot to choose from and I'm afraid to tie my horse to the wrong cart and invest a lot of time learning something that isn't properly suited to do what I need.

Could someone suggest a class, function, script, method, etc. for me to begin working toward solving this issue? I really would like to be able to help him but I just don't know in which direction to go and Googling is just offering too many directions.

Thanks for your time!

    Well, PHP has its [man]DOM[/man] extension.

      I have used regex for this before. I put this together for your specific problem:

      		// this is all of your source that you want to search, get this however you want
      		$all_src = file_get_contents("/tmp/foo");
      
      	// this first pattern isolates the section of the document from after 'select a link' to the end
      	$pattern = '/<td[^>]+>Select A Link:<\/td>(.+)$/ms';
      	$matches = NULL;
      	$match_count = preg_match_all($pattern, $all_src, $matches);
      	if ($match_count != 1) {
      		throw new Exception("match count was ". $match_count);
      	}
      	$src_to_search = $matches[1][0];
      	$this->pre($src_to_search);
      
      	// now get all the links/data from it:
      	$pattern = '/<tr[^>]+bgcolor[^>]+>[^<]+<td[^>]*>[^<]*<a[^>]+href="([^"]+)"[^>]*>([^<]+)<\/a>[^<]*<br>\s*Source: by([^<]+)</msU';
      	$matches = NULL;
      	$match_count = preg_match_all($pattern, $src_to_search, $matches);
      	echo $match_count . " matches found";
      	if ($match_count < 1) {
      		throw new Exception("No matches found");
      	}
      	// loop through matches
      	$results = array();
      	for($i=0; $i<$match_count; $i++) {
      		$results[$i] = array(
      			"link" => $matches[1][$i],
      			"title" => $matches[2][$i],
      			"submitter" => $matches[3][$i]
      		);
      	}
      
      
      
        NogDog wrote:

        Well, PHP has its DOM extension.

        Yup.

        $dom = @DOMDocument::loadHTML(file_get_contents($somefile));
        
        if (!is_object($dom)) crapout();
        
        $links = $dom->getElementsByTagName("a");
        
        $urls = array();
        
        foreach ($links as $link) {
           if ($link->hasAttribute('src')) {
              $urls[] = $link->getAttribute('src');
           }
        }
          sneakyimp;11056017 wrote:

          I have used regex for this before. I put this together for your specific problem:

          Sneaky, thank you for this. While I'm never savvy enough to write it myself, I can at least understand what I'm looking at with regex such as this. I've spent three days reading through the various DOM documentation I've found and I still have not a single iota of understanding towards it.

          It's throwing the following error when I run it:

          Fatal error: Using $this when not in object context on line 56

          on this code:

          $src_to_search = $matches[1][0];
          $this->pre($src_to_search); <-- Line 56

          On my Google searches, I'm finding a lot of SE posts stating that I need to declare something as public, but I'm not sure how. I tried "public $src_to_search", but that threw a new error.

          What should I do to resolve it?

            Fatal error: Using $this when not in object context on line 56

            This error happens when you use $this somewhere other than as part of an object method.

              Would the resolution be to convert what's below that line into a function and then apply it to $src_to_search?

                What's the object that $this refers to?

                $this won't work from a static object, a class that's not been instantiated, or the global namespace (procedural code). I'm pretty sure it also won't work within an abstract class, but don't quote me on that...

                  maxxd;11056043 wrote:

                  What's the object that $this refers to?

                  I'm having problems figuring that out and still haven't managed to turn it into a function. Gotta be pretty close though.

                    Open the file that contains the $this that's throwing the error. Scroll up until you see "class XXX{". That's the class that $this refers to.

                    If you don't see that, then that's the problem - '$this' is a pseudo-variable that refers to the class that contains the code currently running. If you're not programming OOP-style and pre() is a function that exists, remove '$this->'.

                      maxxd;11056115 wrote:

                      Open the file that contains the $this that's throwing the error. Scroll up until you see "class XXX{". That's the class that $this refers to.

                      If you don't see that, then that's the problem - '$this' is a pseudo-variable that refers to the class that contains the code currently running. If you're not programming OOP-style and pre() is a function that exists, remove '$this->'.

                      Thanks very much maxxd. I tried removing $this->, but then got an error on the missing function of pre so I commented the line out completely and it looks like it's working as it should. I'll play around some more but it looks like maybe sneaky was using it for use in his own environment and that it wasn't needed for use.

                        Write a Reply...