How do parse and process HTML in PHP - a simple scraper
I'm currently working on a parser to make a small preview of a page from a URL given by the user in PHP. I'd like to retrieve only the title of the page and a little chunk of information (a bit of text)
The project: for a list of meta-data of popular wordpress-plugins (cf. https://de.wordpress.org/plugins/browse/popular/ and gathering the first 50 URLs - that are 50 plugins which are of interest! The challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
[upl-image-preview url=https://board.phpbuilder.com/assets/files/2020-05-08/1588951028-41772-image.png]
so to take one page into consideration - fetching the meta-data of one Wordpress-plugin: With simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) i guess that there is a appropiate way and method to do this without any other external libraries/classes. So far I've also tried using generally (DOM)-DOCDocument classes http://docs.php.net/manual/en/domdocument.loadhtml.php), loading the HTML and displaying it on the screen, and now i am musing about the proper way to do it. i consider simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) for this. It will make it very easy. Here is an example of how to pull the title, and the meta-text(description).
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://wordpress.org/plugins/wp-job-manager/');
$title = $html->find ("h1", class_="plugin-title").text];
$text = $html->find(class_="entry-meta").text];
echo $title->plaintext."<br>\n";
echo $texte->text;
?>
see the source: https://wordpress.org/plugins/wp-job-manager/ we have the following set of meta-data for each wordpress-plugin:
Version: 1.9.5.12
installations: 10,000+
WordPress Version: 5.0 or higher
Tested up to: 5.4 PHP
Version: 5.6 or higher
Tags 3 Tags: database member sign-up form volunteer
Last updated: 19 hours ago
plugin-ratings
the project consits of two parts: the looping-part: looping over this URL https://de.wordpress.org/plugins/browse/popular/ and gathering approx 50 to 80 URLs (which seems to be pretty straightforward). the parser-part: where i have some issues - to get propperly the data for the tags and the plugin-rating...