PHP DOMDocument nightmares

Finnsk3

Hi all,

I have been trying to write this simple web crawler to harvest metadata from a site.

Before I start the owner of the site has given me permission to scrape their site.

$baseurl = 'https://library.comicsplusapp.com/';

//Fetch categories
$categoriesHtml = fetch($baseurl . 'categories.php');
//fetching catagory IDs out of page
preg_match_all('/<a href="category_comics.php\?id=(\d+?)" class="category">/',$categoriesHtml,$categories);

foreach ($categories[1] as $cId) {
	$catData = fetch($baseurl . 'comics.php?page=1&page_size=10&category_id=' . $cId);

//Utilise DOM parser to navigate the HTML
$dom = new DOMDocument();
@$dom->loadHTML($catData);
$xpath = new DOMXpath($dom);

$rows = $xpath->query('/html/body/div[@class="container"]/div[@class="comic_list_wide"]/div[@class="row"]');
foreach ($rows as $row) {
	$thumbImg = $xpath->query('descendant::div[@class="col-sm-3"]/a/img', $row);
	print_r($thumbImg->item(0));
	var_dump($thumbImg->item(0)->getAttribute('src'));
	die;
}
}

Here is the code I have written so far. First I just want to vent some frustration with DomXPath, performing nested xpath queries was a massive pita. I had to use "descendant::" for some reason, passing the 2nd parameter for scoping the query did absolutely nothing without it. It was returning elements out of scope without it before I prepending my query with descendant..... it's weird that every example of nested xpath queries I could find didn't use this.

Here is the response from my script:

DOMElement Object
(
    [tagName] => img
    [schemaTypeInfo] =>
    [nodeName] => img
    [nodeValue] =>
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] =>
    [lastChild] =>
    [previousSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] =>
    [prefix] =>
    [localName] => img
    [baseURI] =>
    [textContent] =>
)
string(0) ""

I can't return any attributes form the DomElement... item(0)->getAttribute('src') returns a blank string.

PS. fetch is just my curl function, I am behind a proxy that requires authentication so I didn't bother including the code for that function, file_get_contents would work as a direct replacement.

Finnsk3

/facepalm... I found the issue.

The image doesn't have a source in the original response, javascript is taking the "data-original" attribute and setting that as src.. which is really odd, possibly to stop robots, no idea.
I was inspecting the element in chrome and didn't think to look at the raw html.

Weedpacket

Didn't look at the AUP either.

Finnsk3

Weedpacket;11056475 wrote:
Didn't look at the AUP either.

You obviously didn't read my thread. I have explicit authorisation in writing to crawl their site.
We are a client of theirs, they promised to give us all their metadata as of our licence agreement with them but it was going to take a long time for them to produce the export so I just asked if I could harvest it from their site.

Weedpacket

My apologies; I did read that line and then forgot it. Too many people post screen-scraping questions without having permission.
Having permission and yet having to scrape is odd - no doubt when they were building the site they never anticipated the possibility that a machine-readable version of their information would be useful to someone.
Again, my bad.

Finnsk3

Weedpacket;11056499 wrote:
My apologies; I did read that line and then forgot it. Too many people post screen-scraping questions without having permission.
Having permission and yet having to scrape is odd - no doubt when they were building the site they never anticipated the possibility that a machine-readable version of their information would be useful to someone.
Again, my bad.

No worries,

I have to deal with a lot of subscription services like this and what tends to happen is they get a company to develop their system for them then they maintain it. Organisations like us then say, ok, we want a bulk licence and we want to make all the resources available via our discovery mechanism then the companies go.... ohh... we will have to go back to the development company and get them to write enhancements which can take a long time.