I am wanting to do a simple extraction of the
three key header elements from a web page namely these:

<title>This the Title</title>
<meta name="keywords" content="PHP, javascript, other keywords" />
<meta name="description" content="This is the description." />

Is the preg_match() function the best way to find them and put them
into variables ?

If they are not found of the web page I would like to fill the relevant variable
ith "Not found".

I have wriiten this code but I am not sure if it is the best approach
or if the logic is correct.

 
$title = preg_match("/<title>(.*?)</title>/",$text,$matches);
if ($title === false) {
   $title = "None found";
   }

$descrip = preg_match("/<meta name=\"description\" content=\"(.*?)\"/",$text,$matches);
if ($descrip === false) {
   $descrip = "None found";
   }

$keys = preg_match("/<meta name=\"keywords\" content=\"(.*?)\"/",$text,$matches);
if ($keys === false) {
   $keys = "None found";
   }

Any suggestions, corrections most welcome. 🙂

    Could use some error-checking and such, but should give you an alternative idea:

    $text = file_get_contents('url or path to web page');
    $dom = new DOMDocument();
    $dom->loadHTML($text);
    $title = $dom->getElementsByTagName('title');
    if($title->length > 0)
    {
       $data['title'] = $title->item(0)->textContent;
    }
    $metaTags = $dom->getElementsByTagName('meta');
    foreach($metaTags as $meta)
    {
       if($name = $meta->getAttribute('name'))
       {
          switch($name)
          {
             case 'keywords':
                $data['keywords'] = $meta->getAttribute('content');
                break;
             case 'description':
                $data['description'] = $meta->getAttribute('content');
                break;
          }
       }
    }
    echo "<pre>" . print_r($data, 1) . "</pre>";
    

      Yes,
      Thats a good idea.

      Is the DOMDocument() likely to work a bit faster than
      a regex ?

      And I guess if I initalize my values with "Not found" then I
      won't have to worry about their possible non-existence .

      Can I make a more direct connection like:

      $title = $dom->getElementsByTagName('title');
      $title =$title[0]->textContent;
      

        I changed my original code and got rid of the errors
        but I am still not picking up content.

        this is what I have:

        $title = "None found";
        $descrip = "None found";
        $keys = "None found";
        
        $flag = preg_match("/<title>(.*?)<\/title>/",$text,$matches);
         if ($flag == 1) {
            $title = $matches[0];
            }
        
        $flag = preg_match("/<meta name=\"description\" content=\"(.*?)\"/",$text,$matches);
         if ($flag == 1) {
            $descrip = $matches[0];
            }
        
        $flag = preg_match("/<meta name=\"keywords\" content=\"(.*?)\"/",$text,$matches);
         if ($flag == 1) {
            $keys = $matches[0];
            }
        
        echo "<br>Title: $title<br>Descrip: $descrip<br>Keys: $keys";
        

        Of course my output is:

        Title: None found
        Descrip: None found
        Keys: None found

        any ideas ??

          habbardone;10933189 wrote:

          Yes,
          Is the DOMDocument() likely to work a bit faster than
          a regex ?
          [/code]

          I think the DOM method might be slower, but for extracting html tag info, it really is the better choice to regex imo.

            Write a Reply...