Parsing a webpage for information

iceman42

There really is no easy way of titling this topic. I have a script that I wrote that takes from the database a name and a url. The url is taken and used to grab a webpage. The content is placed into a string and then parsed for the name, once the name is found the script then attempts to stript out most of the HTML tags, does a crappy job of that, and leaves in a few tags that seperate the various pieces of data that pertain to that name. I then replace the few tags with a | sign, which should leave me with, name|1|23|4| .... this works for most, but is probably not the best way to do it, but I have noticed two things, some of the names even though pulling from the same url does not get all of the data and then seems to have the rest of the HTML still included after it should be stripped. so rather then getting a nice name|1|23|4| I get name|1|23|4|<T D WIDTH="9" NOWRAP > < /TD >< TD ALIGN="LEFT" VALIGN="TOP" >
< h1 >< span class="specheading" >... to infinity or the end of the page, on one occassion I have gotten name|1|23 even though its pulling from the same page as others.

Really wouldnt mind some advice or tips on pulling this crap in better.