Nedd a little help with this preg_match()

habbardone

I am wanting to do a simple extraction of the
three key header elements from a web page namely these:

<title>This the Title</title>
<meta name="keywords" content="PHP, javascript, other keywords" />
<meta name="description" content="This is the description." />

Is the preg_match() function the best way to find them and put them
into variables ?

If they are not found of the web page I would like to fill the relevant variable
ith "Not found".

I have wriiten this code but I am not sure if it is the best approach
or if the logic is correct.

 
$title = preg_match("/<title>(.*?)</title>/",$text,$matches);
if ($title === false) {
   $title = "None found";
   }

$descrip = preg_match("/<meta name=\"description\" content=\"(.*?)\"/",$text,$matches);
if ($descrip === false) {
   $descrip = "None found";
   }

$keys = preg_match("/<meta name=\"keywords\" content=\"(.*?)\"/",$text,$matches);
if ($keys === false) {
   $keys = "None found";
   }

Any suggestions, corrections most welcome. 🙂

NogDog

Could use some error-checking and such, but should give you an alternative idea:

$text = file_get_contents('url or path to web page');
$dom = new DOMDocument();
$dom->loadHTML($text);
$title = $dom->getElementsByTagName('title');
if($title->length > 0)
{
   $data['title'] = $title->item(0)->textContent;
}
$metaTags = $dom->getElementsByTagName('meta');
foreach($metaTags as $meta)
{
   if($name = $meta->getAttribute('name'))
   {
      switch($name)
      {
         case 'keywords':
            $data['keywords'] = $meta->getAttribute('content');
            break;
         case 'description':
            $data['description'] = $meta->getAttribute('content');
            break;
      }
   }
}
echo "<pre>" . print_r($data, 1) . "</pre>";

habbardone

Yes,
Thats a good idea.

Is the DOMDocument() likely to work a bit faster than
a regex ?

And I guess if I initalize my values with "Not found" then I
won't have to worry about their possible non-existence .

Can I make a more direct connection like:

$title = $dom->getElementsByTagName('title');
$title =$title[0]->textContent;

habbardone

I changed my original code and got rid of the errors
but I am still not picking up content.

this is what I have:

$title = "None found";
$descrip = "None found";
$keys = "None found";

$flag = preg_match("/<title>(.*?)<\/title>/",$text,$matches);
 if ($flag == 1) {
    $title = $matches[0];
    }

$flag = preg_match("/<meta name=\"description\" content=\"(.*?)\"/",$text,$matches);
 if ($flag == 1) {
    $descrip = $matches[0];
    }

$flag = preg_match("/<meta name=\"keywords\" content=\"(.*?)\"/",$text,$matches);
 if ($flag == 1) {
    $keys = $matches[0];
    }

echo "<br>Title: $title<br>Descrip: $descrip<br>Keys: $keys";

Of course my output is:

Title: None found
Descrip: None found
Keys: None found

any ideas ??

nrg_alpha

habbardone;10933189 wrote:
Yes,
Is the DOMDocument() likely to work a bit faster than
a regex ?
[/code]

I think the DOM method might be slower, but for extracting html tag info, it really is the better choice to regex imo.