Hi all,
I have been trying to write this simple web crawler to harvest metadata from a site.
Before I start the owner of the site has given me permission to scrape their site.
$baseurl = 'https://library.comicsplusapp.com/';
//Fetch categories
$categoriesHtml = fetch($baseurl . 'categories.php');
//fetching catagory IDs out of page
preg_match_all('/<a href="category_comics.php\?id=(\d+?)" class="category">/',$categoriesHtml,$categories);
foreach ($categories[1] as $cId) {
$catData = fetch($baseurl . 'comics.php?page=1&page_size=10&category_id=' . $cId);
//Utilise DOM parser to navigate the HTML
$dom = new DOMDocument();
@$dom->loadHTML($catData);
$xpath = new DOMXpath($dom);
$rows = $xpath->query('/html/body/div[@class="container"]/div[@class="comic_list_wide"]/div[@class="row"]');
foreach ($rows as $row) {
$thumbImg = $xpath->query('descendant::div[@class="col-sm-3"]/a/img', $row);
print_r($thumbImg->item(0));
var_dump($thumbImg->item(0)->getAttribute('src'));
die;
}
}
Here is the code I have written so far. First I just want to vent some frustration with DomXPath, performing nested xpath queries was a massive pita. I had to use "descendant::" for some reason, passing the 2nd parameter for scoping the query did absolutely nothing without it. It was returning elements out of scope without it before I prepending my query with descendant..... it's weird that every example of nested xpath queries I could find didn't use this.
Here is the response from my script:
DOMElement Object
(
[tagName] => img
[schemaTypeInfo] =>
[nodeName] => img
[nodeValue] =>
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] =>
[lastChild] =>
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => img
[baseURI] =>
[textContent] =>
)
string(0) ""
I can't return any attributes form the DomElement... item(0)->getAttribute('src') returns a blank string.
PS. fetch is just my curl function, I am behind a proxy that requires authentication so I didn't bother including the code for that function, file_get_contents would work as a direct replacement.