I have tons of experience related to this, but need a more specific question. Generally:
- Get the page code with file_get_contents() or cURL().
- If you want, write it down locally (file_put_contents()).
- At this point, you must decide how to parse it. as NogDog states, DOMDocument is often your friend. Other methods might include SimpleXML or even preg_match() but these are specialized, potentially nasty-hackish things depending on what the HTML you've grabbed looks like.
Here's some obfuscated code:
// my Scraper program saves pages so we can run it again against the same HTML
if (!$use_cache) {
$file = "http://www.booyah.com/page.php";
$data = file_get_contents($file,0,$context); //global $context contains the User-Agent string for our bot.
$write = file_put_contents("cache/page.html",$data);
} else {
$data = file_get_contents("cache/page.html");
}
// We were only particularly interested in the contents of one <table> on this page
$divider= '<table width="100%" border="0" cellspacing="0" cellpadding="2">';
$ex = explode($divider,$data);
$data = $ex[1];
// DOMDocument was our friend here. Error suppression, I think, because we're actually loading a fragment, IIRC.
// However, if that's true, we should've just used loadfragment() instead??
$dom = @DOMDocument::loadHTML($data);
$cells = $dom->getElementsByTagName("td");
foreach ($cells as $cell) {
//parsing logic here
}