Scrap HTML

NZ_Kiwis

I want to scrap a websites HTML for news (No RSS). As an example this page; https://corporate.aucklandairport.co.nz/news/latest-media

All I want to do is get the headline, the URL is points to and maybe the text article.

I'm going to link it back to the site so not going to 'steal' it. How can I do this?

NogDog

I assume you mean "scrape", not "scrap"? 😉

I haven't looked at the actual link, but the typical method, I think, is to use the PHP DOMDocument extension. If there is an HTML id attribute you can key on in the target page, then you can use the getElementByID() method to grab it, then grab its content.

dalecosp

I have tons of experience related to this, but need a more specific question. Generally:

Get the page code with file_get_contents() or cURL().
If you want, write it down locally (file_put_contents()).
At this point, you must decide how to parse it. as NogDog states, DOMDocument is often your friend. Other methods might include SimpleXML or even preg_match() but these are specialized, potentially nasty-hackish things depending on what the HTML you've grabbed looks like.

Here's some obfuscated code:

// my Scraper program saves pages so we can run it again against the same HTML
if (!$use_cache) {
   $file = "http://www.booyah.com/page.php";
   $data = file_get_contents($file,0,$context);         //global $context contains the User-Agent string for our bot.
   $write = file_put_contents("cache/page.html",$data);
} else {
   $data = file_get_contents("cache/page.html");
}

// We were only particularly interested in the contents of one <table> on this page

$divider= '<table width="100%" border="0" cellspacing="0" cellpadding="2">';
$ex     = explode($divider,$data);
$data   = $ex[1];

// DOMDocument was our friend here.  Error suppression, I think, because we're actually loading a fragment, IIRC.
// However, if that's true, we should've just used loadfragment() instead??
$dom = @DOMDocument::loadHTML($data);
$cells = $dom->getElementsByTagName("td");

foreach ($cells as $cell) {

   //parsing logic here

}