Hi vincent,
to me the engine discussed in the "slapping together..." article simply looks like a re-implementation on sql level what the fulltext feature does internally.
Subwords are not matched and a lot of duplicate data is generated in the search_table. At least that's how it looks to me.
Why not implementing like this (just a rough idea, be gentle):
1) Find a fast algorithm that fits your needs and returns the relevancy of a text for a given (sub)word or phrase. (phonetic search, levenshtein, custom, don't know)
2) Now add two more tables (indexes not mentioned):
CREATE TABLE searchwords (
wordID BIGINT(20),
word VARCHAR(255)
);
CREATE TABLE relevancy (
wordID BIGINT(20),
textID BIGINT(20),
relevancy FLOAT(3,2)
)
The table searchwords contains all search phrases that have been processed in the past.
The table relevancy references every text (must be identified by id, otherwise tune table) with relevant searchwords.
3) When a new search is performed check the phrase agains table searchwords. If it exists:
SELECT r.textID FROM relevancy AS r, searchwords AS s WHERE s.word='$phrase' AND s.wordID=r.wordID GROUP BY r.textID ORDER BY r.relevancy DESC
and you get all relevant textIDs ordered by relevance.
If the phrase does NOT exist apply the algorithm from 1) to every text and INSERT all relevancies into table relevancy and the phrase itself in searchwords.
4) When a new text is inserted match it against every entry in table searchwords using the algorithm from 1) and INSERT the appropriate entries in table relevancy.
5) When a given text is updated, delete all entries from table relevancy and treat it as a new text.
(end)
I'm aware that this is nothing new, even not very different from the method in the article. But I think it eliminates some of the difficulties/problems of the discussed method.
The key to success appeares to be in the algorithm. If that one is fast enough to handle a million texts (I'm talking newspaper level now) in a few seconds AND still fulfills the needs of relevancy I believe this method could actually be good...
What do you (everyone, not just vincent) think?
Greets,
Dominique
P.S.: vincent: Thanks for the hint in the register_shutdown_function() thread. I've got a working script. It'll be on zend.com this weekend.