I have been reading about cUrl and it seems that using
cUrl may be faster or "better" than using the
fopen() fread() and file_get_contents($site)

I have written some code that should grab the webpage and list the links
But it is ouputting nothing.

Here is my code

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.my-site.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($ch);
curl_close($ch);

if (preg_match_all("/<a href=\"http:\"(.*?)\".*?>(.*?)<\/a>/i",$page,$matches) ) { 
  print_r($matches); 
  }
?>

I am not sure that I have done the pattern correctly.

Is it OK to use the string output $page in this way?

When I ran this script I got zero output

Any ideas what I have done wrong ?

    Start by verifying that you have anything in $page:

    echo $page;

    Also, why not make life easier for yourself, and incidentally anyone reading your code. This way you don't need to escape a single character.

    // Don't use / as opening and closing delimiter for the pattern if you need to use / inside the pattern.
    // Also, use single quotes to enclose the pattern if it needs to contain double quotes.
    $pattern = '#<a href="http:"(.*?)".*?>(.*?)</a>#i';
    

    And you most likely don't want the second " in your pattern. i.e.

    $pattern = '#<a href="http:(.*?)".*?>(.*?)</a>#i';
    

      Your title and question imply that you don't know if curl is returning the web page or not. So why not just echo $page (or even just "echo strlen($page);")? That would clear that up right away.

      curl is an excellent tool, especially since it has many configuation/action options, can be used for multiple simultaneous downloads (curl_multi_*), can easily be used with several protocols, and so forth. For simply downloading the content of one or a couple web pages, file_get_contents() works just as well; the difference in speed in that case, if any, is negligible.

      Your regex looks like (I may be wrong) it's meant to grab links from the page. If so, using the DOM extension may be a better fit.

      BTW, if "my-site.com" is meant to be fictitious, it's not. "example.com" is the standard "example" website.

        Thanks for your input.

        I have changed my code to this:

        require("my_functions.php");

        <?php
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, 'http://www.expert-world.com');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $page = curl_exec($ch);
        curl_close($ch);
        
        echo $page;
        
        $pattern = '#<a href="http:(.*?)".*?>(.*?)</a>#i'; 
        
        if (preg_match_all($pattern,$page,$matches) ) { 
          print_r($matches); 
          }
         

        I get no output from $page so I guess my cUrl is wrong ?

        I have used the pattern variable as suggested.

          Your curl code works fine for me. Try file_get_contents() and see what happens.

            OK

            I accidentally had two "php" statements in the script !

            It returns the page ok now.

            I now get this as the output from my
            print_r($matches);

            Array ( [0] => Array ( [0] => Site map [1] => Bookmark and Share [2] => Expert-World.com ) [1] => Array ( [0] => //www.expert-world.com/sitemap.html [1] => //www.addthis.com/bookmark.php?v=20 [2] => //www.expert-world.com/index.php ) [2] => Array ( [0] => Site map [1] => Bookmark and Share [2] => Expert-World.com ) )

            The links are there but so are the link-names but I now want to insert just the links into a database table.

            I guess I use explode() to extract them ?

            Your regex looks like (I may be wrong) it's meant to grab links from the page. If so, using the DOM extension may be a better fit.

            That sound interesting. How would I use the DOM extension to gets the links ?

              Try this:

              $page = strstr($page, '<!');
              $dom = new DOMDocument();
              $dom->loadHTML($page);
              foreach ($dom->getElementsByTagName('a') as $node) {
                  $link_array[] = $node->attributes->getNamedItem('href')->value;
              }
              
              echo '<pre>';
              print_r($link_array);
              echo '</pre>';

              Edit: To get just the non-relative links, change the foreach loop to something like this:

              foreach ($dom->getElementsByTagName('a') as $node) {
                  $link = $node->attributes->getNamedItem('href')->value;
                  if (strpos($link, 'http') === 0) {
                      $link_array[] = $link;
                  }
              }

                Thanks for replying,

                I tried your suggestion but I get not output.:o

                This is what I have

                $ch = curl_init();
                curl_setopt($ch, CURLOPT_URL, 'http://www.expert-world.com');
                curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
                $page = curl_exec($ch);
                curl_close($ch);
                
                //echo $page;
                
                
                $page = strstr($page, '<!');
                $dom = new DOMDocument();
                $dom->loadHTML($page);
                
                foreach ($dom->getElementsByTagName('a') as $node) {
                    $link = $node->attributes->getNamedItem('href')->value;
                    if (strpos($link, 'http') === 0) {
                        $link_array[] = $link;
                    }
                }
                
                echo '<pre>';
                print_r($link_array);
                echo '</pre>'; 
                

                With the page not commented out, I get the page displayed
                but nothing more. With the code as above, I get nothing.

                  Well, it works perfectly for me. With that exact code I get this output:

                  Array
                  (
                      [0] => http://www.addthis.com/bookmark.php?v=20
                      [1] => http://www.expert-world.com/index.php
                  )

                  BTW, for this I'd use file_get_contents(), curl being a bit of overkill.

                    Thanks for your comment.

                    I don't know why I do not see any result 🙁

                    What do you mean by being overkill ?

                    Am I using up unnecessary resources ?

                    As I am new to curl I wanted to get
                    things working using it. But I still don't know why
                    sometimes I should use fopen() plus fread()
                    or file_get_contents() or curl 🙁

                      I just turned the error reporting on and
                      form this code:

                      require("my_functions.php");
                      
                      echo "Starting:<br>";
                      
                      $ch = curl_init();
                      curl_setopt($ch, CURLOPT_URL, 'http://www.expert-world.com');
                      curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
                      $page = curl_exec($ch);
                      curl_close($ch);
                      
                      //echo $page;
                      
                      
                      $page = strstr($page, '<!');
                      $dom = new DOMDocument();
                      $dom->loadHTML($page);
                      
                      foreach ($dom->getElementsByTagName('a') as $node) {
                          $link = $node->attributes->getNamedItem('href')->value;
                          if (strpos($link, 'http') === 0) {
                              $link_array[] = $link;
                          }
                      }
                      
                      echo '<pre>';
                      print_r($link_array);
                      echo '</pre>'; 
                      
                      echo "Finished<br>";
                      
                      ?>
                      

                      I get this output:

                      Starting:

                      Fatal error: Class 'DOMDocument' not found in /home/tolly/public_html/con1.php on line 16

                      The manual says this:

                      There is no installation needed to use these functions; they are part of the PHP core.

                      Maybe my php.ini file needs changing in order to allow use of DOMDocument class ?

                        My calling it overkill was itself overkill. What I meant was that you're just making a simple GET request, no more than what file_get_content() does (if not provided the optional parameters). One's as good at the other here, I guess.

                        The only reason I can think of for DOMDocument not being found is that you are using PHP 4.x, but I'm sure you wouldn't do that. So I'm at a loss.

                          OK - I got the php-xml loaded and running

                          Now this little script is working ok
                          but I want to use the both method to sift out the https
                          urls from the normal http ones.

                          So I need to find: <a href="https://

                          My script so far is :

                          require("my_functions.php");
                          
                          $target_url = "http://www.support-focus.com/customer-service-software.html";
                          $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
                          
                          echo "<br>Starting<br>Target_url: $target_url<br><br>";
                          
                          // make the cURL request to $target_url
                          $ch = curl_init();
                          curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
                          curl_setopt($ch, CURLOPT_URL,$target_url);
                          curl_setopt($ch, CURLOPT_FAILONERROR, true);
                          curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
                          curl_setopt($ch, CURLOPT_AUTOREFERER, true);
                          curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
                          curl_setopt($ch, CURLOPT_TIMEOUT, 10);
                          $page= curl_exec($ch);
                          if (!$page) {
                          	echo "<br />cURL error number:" .curl_errno($ch);
                          	echo "<br />cURL error:" . curl_error($ch);
                          	exit;
                          }
                          
                          // parse the html into a DOMDocument
                          $doc = new DOMDocument();
                          $doc->loadHTML($page);
                          
                          //echo $doc->saveHTML();
                          
                          $links = $doc->getElementsByTagName('a'); // Find  the a hrefs
                          $k=0;
                          foreach ($links as $link) //go to each section 1 by 1
                          {
                            $the_link = $links->item($k)->getAttribute('href');
                          
                          $query = "INSERT INTO pages ( page_url, link, out, secure) 
                          VALUES ('$target_url', '$the_link', 0,0 )";
                          mysql_query($query) or die('Error, insert query failed');
                          
                          echo "<br>4) Link stored: $the_link";
                            $k++;   
                          }

                          The field labeled out should show if the link is out-bound
                          or internal (not sure how to do that yet )

                          And the secure is to show if it is https or not.

                          So what would you recommend to to in order to find
                          these <a href="https://

                          thanks for helping.

                            Write a Reply...