please help me preg_replace!

tgmsocal

I have a document database that is essentially just a tree of directories that holds .htm and .html files. The search script opens up all files in a certain directory and searches each one for keywords, then returns scored results.

I want the script to disply like following:

X. TITLE
DESCRIPTION

I need help pulling out the title and description from the .htm files.

here is my current code, which isn't working:

$fp = fopen($file, r);
$fread = fread($fp, filesize($file));
fclose($fp);

$fc = preg_replace("/(<\/?)(\w+)([^>]*>)/e", "'\\1'.strtolower('\\2').'\\3'", $fread); // make html tags lowercase
$title = preg_replace("!.*?<title>(.*?)</title>.*?!is", "$1", $fc);  // get the title
$raw_body = preg_replace("!.*?<body.*?>(.*?)</body>.*?!is", "$1", $fc); // get body from file

I'm almost positive that the code is not in a logical order. I am teaching myself PHP and am having a bit of trouble with preg_match (is there somewhere that has a good tutorial for all the switches, etc??)

here is what I'm trying to get the code to do:

1) Open the file and read the entire contents - make all <> tags lowercase

2) Remove the text between <title> and </title> for $title

3) Remove the text between <body *> and </body> for $raw_body

4) Remove the HTML tags from $raw_body so it is now just plain text ($body)

After that I go ahead and search everything and format the output.

If possible, I also need to know how to grab 75 characters both ways from a found term (for 150 char description).. I.e., if someone searches for "help" and help is found in $body, I need 75 characters before "help" and after "help" for the description.

Thanks for any and all help anyone can provide!

tgmsocal

Kudose

What about:

$title = eregi("<title>(.*)</title>", $source);
$title = striptags($title);

tgmsocal

That code will work for the title, but what about the body?

Thanks,

tgmsocal

Kudose

Do the same thing, but use the body tags instead of the title tags.

$body = eregi("<body(*.)</body>", $source);
$body = strip_tags($body);//if you dont want tags

tgmsocal

That didn't work.. The title works, but all I can get that body eregi to return is "1".

I tried both of the following:

$body = eregi("<body[.*]>(.*)</body>", $fc);
$body = strip_tags($body);

$body = eregi("<body(.*)</body>", $fc);
$body = strip_tags($body);

Neither one works.

The following code works for the body, but wherever there is a link, it strips the link and the text being linked.. Where did I go wrong?

$body_start = strpos($fc, "<body ");
$body_end = strpos($fc, "</body>");
$body = substr($fc, $body_start, ($body_end - $body_start));
$body = preg_replace("[<body .*>]", "", $body);
$body = preg_replace("[<.*>]", "", $body); // get rid of the html tags in body

I know the error is in the preg_replace strings somewhere .. I'm not good with those PERL reg expressions so perhaps thats where it is?

Thanks for the help so far! Is there a good tutorial somewhere on all the options for preg_replace as well as regular expressions? The php.net manual has very limited info on preg_replace's options and uses.

-- tgmsocal

bodzan

regexp tutorial

I've found very handy in understanding the basics of regular expression and how they work. You might find the olution to your problem there...