creating a global site search script

Anon

Suppose I have a website composed of static pages, cgi/perl scripts and 2 mysql databases both using php.

I know how create a search engine for a database, but how will I create a site search which includes a combination of these ?

Should I create a new database which catalogues everything in the site ? But if I do that, I would have a difficult time trying to catalog data in databases with totally different structures. And what if I decided to add a 3rd database to the site ?

Any ideas would be greatly apprecaited. =)

Anon

There are two open source site search engines I know of available, htdig and swishe. I use htdig at work and am very happy with it. It allows me to index all parts of my site fairly quickly (about 15 minutes to index the whole site, which is about 2 gigs of data.)

Htdig allows you to match on soundex, mutex, prefix, endings, and a few other cool heuristic methods, and you can change the weighting of the methods very easily in the config file.

Anon

My approach would be this:

Index your static content, db content, and perl content all seperately. Then in your btree index put a type on the document id - so you have document 1111 and it's type 3 which means you look in some table for id 1111 which correlates to a static page named your_faq.html of which you have indexed. Type 2 could be db content and thus just do a join on the db with the info you have indexed. Perl in the same manner.

In order to index static pages just do something like this:

<pre>
<?

$LS = ls /path/to/htdocs;
$static_files = explode("\n",$LS);
while(list($key,$val) = each($static_files)){
$fp = fopen($static_files);
// read it in
// grab between <title></title> and throw it in the DB
// throw the name in the DB and give it an id
// throw the first 50 words in the db to be indexed
fclose($fp);
}

?>
</pre>

Then you index the static pages from that table. I have been thinking about doing this on my company's site (TONS of static content, shitloads of DB content, etc.)

--Joe
http://www.miester.org/Joe/Stump/Joe_Stump_resume.html