OK - I got the php-xml loaded and running
Now this little script is working ok
but I want to use the both method to sift out the https
urls from the normal http ones.
So I need to find: <a href="https://
My script so far is :
require("my_functions.php");
$target_url = "http://www.support-focus.com/customer-service-software.html";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
echo "<br>Starting<br>Target_url: $target_url<br><br>";
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$page= curl_exec($ch);
if (!$page) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($page);
//echo $doc->saveHTML();
$links = $doc->getElementsByTagName('a'); // Find the a hrefs
$k=0;
foreach ($links as $link) //go to each section 1 by 1
{
$the_link = $links->item($k)->getAttribute('href');
$query = "INSERT INTO pages ( page_url, link, out, secure)
VALUES ('$target_url', '$the_link', 0,0 )";
mysql_query($query) or die('Error, insert query failed');
echo "<br>4) Link stored: $the_link";
$k++;
}
The field labeled out should show if the link is out-bound
or internal (not sure how to do that yet )
And the secure is to show if it is https or not.
So what would you recommend to to in order to find
these <a href="https://
thanks for helping.