Is my regex wrong or is it the curl ?

habbardone · Oct 22, 2009

I have been reading about cUrl and it seems that using
cUrl may be faster or "better" than using the
fopen() fread() and file_get_contents($site)

I have written some code that should grab the webpage and list the links
But it is ouputting nothing.

Here is my code

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.my-site.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($ch);
curl_close($ch);

if (preg_match_all("/<a href=\"http:\"(.*?)\".*?>(.*?)<\/a>/i",$page,$matches) ) { 
  print_r($matches); 
  }
?>

I am not sure that I have done the pattern correctly.

Is it OK to use the string output $page in this way?

When I ran this script I got zero output

Any ideas what I have done wrong ?

johanafm · Oct 22, 2009

Start by verifying that you have anything in $page:

echo $page;

Also, why not make life easier for yourself, and incidentally anyone reading your code. This way you don't need to escape a single character.

// Don't use / as opening and closing delimiter for the pattern if you need to use / inside the pattern.
// Also, use single quotes to enclose the pattern if it needs to contain double quotes.
$pattern = '#<a href="http:"(.*?)".*?>(.*?)</a>#i';

And you most likely don't want the second " in your pattern. i.e.

$pattern = '#<a href="http:(.*?)".*?>(.*?)</a>#i';

Installer · Oct 22, 2009

Your title and question imply that you don't know if curl is returning the web page or not. So why not just echo $page (or even just "echo strlen($page);")? That would clear that up right away.

curl is an excellent tool, especially since it has many configuation/action options, can be used for multiple simultaneous downloads (curl_multi_*), can easily be used with several protocols, and so forth. For simply downloading the content of one or a couple web pages, file_get_contents() works just as well; the difference in speed in that case, if any, is negligible.

Your regex looks like (I may be wrong) it's meant to grab links from the page. If so, using the DOM extension may be a better fit.

BTW, if "my-site.com" is meant to be fictitious, it's not. "example.com" is the standard "example" website.

habbardone · Oct 22, 2009

Thanks for your input.

I have changed my code to this:

require("my_functions.php");

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.expert-world.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($ch);
curl_close($ch);

echo $page;

$pattern = '#<a href="http:(.*?)".*?>(.*?)</a>#i'; 

if (preg_match_all($pattern,$page,$matches) ) { 
  print_r($matches); 
  }

I get no output from $page so I guess my cUrl is wrong ?

I have used the pattern variable as suggested.

Installer · Oct 22, 2009

Your curl code works fine for me. Try file_get_contents() and see what happens.

habbardone · Oct 22, 2009

OK

I accidentally had two "php" statements in the script !

It returns the page ok now.

I now get this as the output from my
print_r($matches);

Array ( [0] => Array ( [0] => Site map [1] => Bookmark and Share [2] => Expert-World.com ) [1] => Array ( [0] => //www.expert-world.com/sitemap.html [1] => //www.addthis.com/bookmark.php?v=20 [2] => //www.expert-world.com/index.php ) [2] => Array ( [0] => Site map [1] => Bookmark and Share [2] => Expert-World.com ) )

The links are there but so are the link-names but I now want to insert just the links into a database table.

I guess I use explode() to extract them ?

Your regex looks like (I may be wrong) it's meant to grab links from the page. If so, using the DOM extension may be a better fit.

That sound interesting. How would I use the DOM extension to gets the links ?

Installer · Oct 22, 2009

Try this:

$page = strstr($page, '<!');
$dom = new DOMDocument();
$dom->loadHTML($page);
foreach ($dom->getElementsByTagName('a') as $node) {
    $link_array[] = $node->attributes->getNamedItem('href')->value;
}

echo '<pre>';
print_r($link_array);
echo '</pre>';

Edit: To get just the non-relative links, change the foreach loop to something like this:

foreach ($dom->getElementsByTagName('a') as $node) {
    $link = $node->attributes->getNamedItem('href')->value;
    if (strpos($link, 'http') === 0) {
        $link_array[] = $link;
    }
}

habbardone · Oct 23, 2009

Thanks for replying,

I tried your suggestion but I get not output.:o

This is what I have

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.expert-world.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($ch);
curl_close($ch);

//echo $page;


$page = strstr($page, '<!');
$dom = new DOMDocument();
$dom->loadHTML($page);

foreach ($dom->getElementsByTagName('a') as $node) {
    $link = $node->attributes->getNamedItem('href')->value;
    if (strpos($link, 'http') === 0) {
        $link_array[] = $link;
    }
}

echo '<pre>';
print_r($link_array);
echo '</pre>';

With the page not commented out, I get the page displayed
but nothing more. With the code as above, I get nothing.

Installer · Oct 23, 2009

Well, it works perfectly for me. With that exact code I get this output:

Array
(
    [0] => http://www.addthis.com/bookmark.php?v=20
    [1] => http://www.expert-world.com/index.php
)

BTW, for this I'd use file_get_contents(), curl being a bit of overkill.

habbardone · Oct 23, 2009

Thanks for your comment.

I don't know why I do not see any result

What do you mean by being overkill ?

Am I using up unnecessary resources ?

As I am new to curl I wanted to get
things working using it. But I still don't know why
sometimes I should use fopen() plus fread()
or file_get_contents() or curl

habbardone · Oct 23, 2009

I just turned the error reporting on and
form this code:

require("my_functions.php");

echo "Starting:<br>";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.expert-world.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($ch);
curl_close($ch);

//echo $page;


$page = strstr($page, '<!');
$dom = new DOMDocument();
$dom->loadHTML($page);

foreach ($dom->getElementsByTagName('a') as $node) {
    $link = $node->attributes->getNamedItem('href')->value;
    if (strpos($link, 'http') === 0) {
        $link_array[] = $link;
    }
}

echo '<pre>';
print_r($link_array);
echo '</pre>'; 

echo "Finished<br>";

?>

I get this output:

Starting:

Fatal error: Class 'DOMDocument' not found in /home/tolly/public_html/con1.php on line 16

The manual says this:

There is no installation needed to use these functions; they are part of the PHP core.

Maybe my php.ini file needs changing in order to allow use of DOMDocument class ?

Installer · Oct 23, 2009

My calling it overkill was itself overkill. What I meant was that you're just making a simple GET request, no more than what file_get_content() does (if not provided the optional parameters). One's as good at the other here, I guess.

The only reason I can think of for DOMDocument not being found is that you are using PHP 4.x, but I'm sure you wouldn't do that. So I'm at a loss.

habbardone · Oct 24, 2009

OK - I got the php-xml loaded and running

Now this little script is working ok
but I want to use the both method to sift out the https
urls from the normal http ones.

So I need to find: <a href="https://

My script so far is :

require("my_functions.php");

$target_url = "http://www.support-focus.com/customer-service-software.html";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

echo "<br>Starting<br>Target_url: $target_url<br><br>";

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$page= curl_exec($ch);
if (!$page) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($page);

//echo $doc->saveHTML();

$links = $doc->getElementsByTagName('a'); // Find  the a hrefs
$k=0;
foreach ($links as $link) //go to each section 1 by 1
{
  $the_link = $links->item($k)->getAttribute('href');

$query = "INSERT INTO pages ( page_url, link, out, secure) 
VALUES ('$target_url', '$the_link', 0,0 )";
mysql_query($query) or die('Error, insert query failed');

echo "<br>4) Link stored: $the_link";
  $k++;   

}

The field labeled out should show if the link is out-bound
or internal (not sure how to do that yet )

And the secure is to show if it is https or not.

So what would you recommend to to in order to find
these <a href="https://

thanks for helping.

Is my regex wrong or is it the curl ?

Hhabbardone

Jjohanafm

IInstaller

Hhabbardone

IInstaller

Hhabbardone

IInstaller

Hhabbardone

IInstaller

Hhabbardone

Hhabbardone

IInstaller

Hhabbardone