parse and process HTML in PHP: fetching Wordpress-plugin-Metadata with a scraper

dilbert2010 · May 8, 2020

How do parse and process HTML in PHP - a simple scraper

I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP. I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)

The project: for a list of meta-data of popular wordpress-plugins (cf. https://de.wordpress.org/plugins/browse/popular/ and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...

[upl-image-preview url=https://board.phpbuilder.com/assets/files/2020-05-08/1588951028-41772-image.png]

so to take one page into consideration - fetching the meta-data of one Wordpress-plugin: With simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) i guess that there is a appropiate way and method to do this without any other external libraries/classes. So far I've also tried using generally (DOM)-DOCDocument classes http://docs.php.net/manual/en/domdocument.loadhtml.php), loading the HTML and displaying it on the screen, and now i am musing about the proper way to do it. i consider simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) for this. It will make it very easy. Here is an example of how to pull the title, and the meta-text(description).

<?php
require 'simple_html_dom.php';

$html = file_get_html('https://wordpress.org/plugins/wp-job-manager/');
$title = $html->find ("h1", class_="plugin-title").text];
$text  = $html->find(class_="entry-meta").text];

echo $title->plaintext."<br>\n";
echo $texte->text;
?>

see the source: https://wordpress.org/plugins/wp-job-manager/ we have the following set of meta-data for each wordpress-plugin:

Version: 1.9.5.12
installations: 10,000+

WordPress Version: 5.0 or higher
Tested up to: 5.4 PHP

Version: 5.6 or higher

Tags 3 Tags: database member sign-up form volunteer
Last updated: 19 hours ago
plugin-ratings

the project consits of two parts: the looping-part: looping over this URL https://de.wordpress.org/plugins/browse/popular/ and gathering approx 50 to 80 URLs (which seems to be pretty straightforward). the parser-part: where i have some issues - to get propperly the data for the tags and the plugin-rating...

Weedpacket · May 8, 2020

So what are the problems you're having? Apart from the invalid syntax in $title = $html->find ("h1", class_="plugin-title").text]; and the next line I presume that code words (I don't use simplehtmldom). It just looks like your selections need to be a bit more refined than they are.

All that said, Wordpress does provide an API so that you can get plugin information already in a machine-readable form. That way you don't have to muck around with screenscraping.

Over the last few weeks I have been wondering on how to possibly pull data about my plugins hosted on WordPress.org and display it on my website. The first thing that came to mind was "Web Scraping" but quite frankly this is a lot of work, feels like going back in time, and is not something a good web citizen should do. In some cases, it could be illegal.
https://code.tutsplus.com/tutorials/communicating-with-the-wordpressorg-plugin-api--wp-33069