i'm new to screen scraping and need some help. my ultimate goal is to grab text from a webpage and to store it in a mysql database so i can then search through it.
i'm confused about writing regular expression patterns. i've looked at some examples online in screen scraping tutorials, but none of them really explain how to write them.
in an online example of how to put a weather script on your webpage, this is the html for weather.com
<TD WIDTH="50%" VALIGN=MIDDLE ALIGN=CENTER STYLE="padding:0px 0px 10px 0px;">
<IMG SRC=http://image.weather.com/web/common/wxicons/52/28.gif WIDTH=52 HEIGHT=52 BORDER=0 ALT=><BR>
<B CLASS=obsTextA>Mostly Cloudy</B>
</TD>
<TD WIDTH="50%" VALIGN=MIDDLE ALIGN=CENTER>
<DIV STYLE="padding: 10px 0px 3px 5px;">
<B CLASS=obsTempTextA>83°F</B><BR>
<B CLASS=obsTextA>Feels Like<BR> 82°F</B>
</DIV>
</TD>
since you can actually see "mostly cloudy" and "83F" it is fairly easy to see how he created his regular expression script:
"/<TD WIDTH=\"50%\" VALIGN=MIDDLE ALIGN=CENTER><DIV STYLE=\"padding: 10px 0px 3px 5px;\">
<B CLASS=obsTempTextA>([0-9])°F<\/B><BR>
<B CLASS=obsTextA>Feels Like<BR> ([0-9])°F<\/B><\/DIV><\/TD>/"
"/<TD width=\"290\" ALIGN=\"left\"><H2 CLASS=\"moduleTitleBar\">
<B>Right Now for<\/B><BR>([a-zA-Z0-9,()\s]*)<BR>/"
however, on the website i need to get data from, i can't simply reuse this script.
this is the source code:
<html>
<head>
<title>This Week's Menus</title>
</head>
<frameset rows="115,72%" frameborder="No" cols="*" framespacing="0">
<frame name="topframe" src="top_frame.html?naFlag=1&sName=HARVARD+UNIVERSITY+DINING+SERVICES&locationNum=05&locationName=Hot+Entrees%2C+Starches%2C+Bean%2FGrain+and+Vegetables%3C%2Ffont%3E%3C%2Fa%3E%3Cbr%3E%3Cfont%3E%26nbsp%3C%2Ffont%3E%3Ca%3E%3Cfont%3E" SCROLLING="NO" NORESIZE>
<frameset cols="163,76%" rows="*" frameborder="NO" framespacing="0">
<frame name="leftframe" src="left_frame.asp?naFlag=1&sName=HARVARD+UNIVERSITY+DINING+SERVICES&locationNum=05&locationName=Hot+Entrees%2C+Starches%2C+Bean%2FGrain+and+Vegetables%3C%2Ffont%3E%3C%2Fa%3E%3Cbr%3E%3Cfont%3E%26nbsp%3C%2Ffont%3E%3Ca%3E%3Cfont%3E" SCROLLING="NO" NORESIZE>
<frame name="centerframe" src="center_frame.asp?naFlag=1&sName=HARVARD+UNIVERSITY+DINING+SERVICES&locationNum=05&locationName=Hot+Entrees%2C+Starches%2C+Bean%2FGrain+and+Vegetables%3C%2Ffont%3E%3C%2Fa%3E%3Cbr%3E%3Cfont%3E%26nbsp%3C%2Ffont%3E%3Ca%3E%3Cfont%3E" NORESIZE>
</frameset>
</frameset>
<noframes>
</noframes>
</html>
this is the actual website: http://www.huds.harvard.edu/foodpro/frameset.asp?sName=HARVARD+UNIVERSITY+DINING+SERVICES&locationNum=05&locationName=Hot+Entrees%2C+Starches%2C+Bean%2FGrain+and+Vegetables%3C%2Ffont%3E%3C%2Fa%3E%3Cbr%3E%3Cfont%3E%26nbsp%3C%2Ffont%3E%3Ca%3E%3Cfont%3E&naFlag=1
now i want to be able to grab the text of the meals, say "Meatloaf" and store that data. in the source code for this webpage, unlike weather.com, these phrases do not appear.
i don't need to show any of this information on a website, i just need to store the data so i can search it. can anyone help me? thanks in advance!