Config ft_min_word_len var in MySQL

Anon · Apr 5, 2002

I have tried many different ways to change the FULLTEXT variable ft_min_word_len but even when I get the message that 0 rows have been affected, new installation of MySQL 4, the variable is still the same. I have run out of ideas, have tried to follow every tip in the manual but nothing seems to work.

I am positioned in the root.

[deleted] · Apr 5, 2002

Did you do what the manual says; add it as a startup parameter to mysqld?

and did the new value show up in 'show variables'

and why are you using crappy-old FULLTEXT that only finds whole words anyway? It's very easy to write your own searchengine thingy, look at the 'slapping together a search engine' article.

A forum, a FAQ, email notification, what else do you need?

Time to YAPF!: http://www.hvt-automation.nl/yapf/

Anon · Apr 5, 2002

Eventually I got it changed:
shell> vim my.cnf
added:
set variable = ft_min_word_len=1
under [mysqld]
saved the file and restarted MySQL with:
shell> /etc/init.d/mysql stop
shell> /etc/init.d/mysql start

Well Vincent, Fulltext isn't that old in MySQL. And I wanted to test the binary Fulltext MySQL 4.
Will take a look at the article though, Thanks. :-)

saloon12yrd · Apr 5, 2002

Hi vincent,

to me the engine discussed in the "slapping together..." article simply looks like a re-implementation on sql level what the fulltext feature does internally.
Subwords are not matched and a lot of duplicate data is generated in the search_table. At least that's how it looks to me.

Why not implementing like this (just a rough idea, be gentle):

1) Find a fast algorithm that fits your needs and returns the relevancy of a text for a given (sub)word or phrase. (phonetic search, levenshtein, custom, don't know)

2) Now add two more tables (indexes not mentioned):
CREATE TABLE searchwords (
wordID BIGINT(20),
word VARCHAR(255)
);

CREATE TABLE relevancy (
wordID BIGINT(20),
textID BIGINT(20),
relevancy FLOAT(3,2)
)

The table searchwords contains all search phrases that have been processed in the past.
The table relevancy references every text (must be identified by id, otherwise tune table) with relevant searchwords.

3) When a new search is performed check the phrase agains table searchwords. If it exists:

SELECT r.textID FROM relevancy AS r, searchwords AS s WHERE s.word='$phrase' AND s.wordID=r.wordID GROUP BY r.textID ORDER BY r.relevancy DESC

and you get all relevant textIDs ordered by relevance.
If the phrase does NOT exist apply the algorithm from 1) to every text and INSERT all relevancies into table relevancy and the phrase itself in searchwords.

4) When a new text is inserted match it against every entry in table searchwords using the algorithm from 1) and INSERT the appropriate entries in table relevancy.

5) When a given text is updated, delete all entries from table relevancy and treat it as a new text.

(end)

I'm aware that this is nothing new, even not very different from the method in the article. But I think it eliminates some of the difficulties/problems of the discussed method.
The key to success appeares to be in the algorithm. If that one is fast enough to handle a million texts (I'm talking newspaper level now) in a few seconds AND still fulfills the needs of relevancy I believe this method could actually be good...

What do you (everyone, not just vincent) think?

Greets,

Dominique

P.S.: vincent: Thanks for the hint in the register_shutdown_function() thread. I've got a working script. It'll be on zend.com this weekend.

[deleted] · Apr 5, 2002

Well the search engine in the article does have it's shortcomings, but in functionality it's equal to/surpasses FULLTEXT.

For example, fulltext does not do sub-words. Ever. In the 'slapped' search engine you can decide wether you want to search whole words or subwords or both (if a user wants to do a sub-word search and it's slow, tough cheese, he wanted it!)

With a small tweak, the script in the article can incorporate a hitcount, giving you not only the number of words that matched on your keywords, but also how many times the word occurs on the found page.
The current version just stores every word as found, but with a peanuts little query you can do a group by and get a wordcount.
Then you'd also be storing much less data than the real article does.

Apart from that a good alogrithm for deciding what is a match is a must. Trouble is, I usually don't agree with what my search-engine tells me is a good match.

PS: zend.com? ZEND.COM? why not on YAPF? hmm? :-)

A forum, a FAQ, email notification, what else do you need?

Time to YAPF!: http://www.hvt-automation.nl/yapf/

saloon12yrd · Apr 5, 2002

My main problem is that I don't know how good this method performs on the said million texts. Do you (or anybody else) have any experience?
Besides that I don't really get the subword issue but maybe I should read the comments or get more coffee. Maybe both. I'll take a look...

Apart from that a good alogrithm for
deciding what is a match is a must.
Trouble is, I usually don't agree with
what my search-engine tells me is a good
match.

Hm yeah I know what you mean. For me Google is the only 'good' one by now. I (okay, we) implemented a phonetic search for german language texts that by now is quite accurate, but it's like you say - everybody expects different things. Thus the "custom" option for the algorithm.

PS: zend.com? ZEND.COM? why not on YAPF?
hmm? :-)
Hehe, I got no problem with YAPF. Can you handle a .zip file with source, README and LICENCE? I'll send you a copy by weekend, you decide

Dominique

[deleted] · Apr 6, 2002

"My main problem is that I don't know how good this method performs on the said million texts."

never tried it but it's always better than a LIKE search.

You can always normalize further, creating one table for basically all words, and one table to link the word to the article. Then you'd have a small (few kRows) table with words to match against, and you can get the attached article-id's from the keyword-article table. That would minimize the number of string-string comparisons, and increase the efficiency of the word-article lookups. You'd still have one row per word-article relation, but it would be indexed numerically, and thus be smaller and faster than when the table includes the word too.

hmm.. I feel a project coming on...

PS: I can handle anything, I'm the YAPF master!