Hi there,
I'm trying to do some 'scraping' I guess it's called.
First thing I want to do is to get the title of a web page.
Should be simple. I guess I'm having problems figuring out DOMDocument or something, but I've been stuck here for 5 hours..
So check this code. You can see I'm getting the phpbuilder home page, and can extract all the links from it and all the images, but I can't figure out how to just get the darned title. Down at the end you can see I've done it borrowing some regex stuff, but really it seems like I should have it already somehow...
$ch = curl_init('http://phpbuilder.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
//echo $curl_scraped_page;
$ss = substr($webpage,25,25);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $burl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$str = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadHTML($str);
$hrefs = $doc->getElementsByTagName('a');
$srcs = $doc->getElementsByTagName('img');
$titles = $doc->getElementsByTagName('title');
echo 'Amount of Links found:'.count($hrefs).'<br/>';
echo 'Amount of Sources found:'.count($srcs).'<br/>';
foreach($hrefs as $href)$matches_href[]= $href->getAttribute('href');
foreach($srcs as $src) $matches_src[] = $src->getAttribute('src');
//foreach($titles as $title) $matches_title[] = $title->??? ; //how can I just get the content of the title tag?
echo nl2br(var_export($matches_href,true)).'<br/>';
echo nl2br(var_export($matches_src,true)).'<br/>';
$str = stristr($str, '<title>');
$rest = substr($str, 7);
$extra = stristr($str, '</title>');
$titlelen = strlen($rest) - strlen($extra);
$gettitle = trim(substr($rest, 0, $titlelen));
echo "<br>Coming from: " . $gettitle . "<p>";
So any help appreciated. I'm definitely in the early stages of picking up php. Should I be using some sax parser or other html parser? Is this a reasonable approach? I guess I'll also want to grab things like the link text along with the href and I may need to sift thru some of the text on the page as well. I'm just using phpbuilder.com for this example. Where should I have gone to find the answer to this question on my own?
Thanks,
Bagus