is MySQL FULLTEXT index helpful for LIKE or REGEXP queries?

sneakyimp · Mar 11, 2021

I think the answer is no, but will adding a FULLTEXT index to a VARCHAR/CHAR/TEXT field in MySQL improve performance for a LIKE or REGEXP query?

SELECT * FROM mytable WHERE col LIKE 'foo%';
SELECT * FROM mytable WHERE col REGEXP('^[0-9]');

I've googled and don't really see any reputable-looking sources. The MySQL docs are not especially helpful in answering this question, either, saying only:

Full-text searching is performed using MATCH() ... AGAINST syntax.

I suspect the answer to my question is that FULLTEXT indexes do NOT help those queries, but would appreciate knowing more.

NogDog · Mar 11, 2021

Yeah, from what I see, that index only gets used for MATCH/AGAINST. However, I don't think anything would stop you from also assigning other index types to that column if it would be advantageous for other queries, e.g. a LIKE can benefit from a regular index as long as the % is not at the start of the string.

sneakyimp · Mar 11, 2021

NogDog Thanks for the response. Someone in the #MySQL IRC chat also confirmed that FULLTEXT index only provides performance improvement for MATCH. Of course, I don't know who that person is, so I don't know if they are any expert. Surely there's some authoritative source around on the ins and outs of indexing?

Weedpacket · Mar 11, 2021

I'm no MySQL expert, but I kinda doubt that a fulltext index would speed up a LIKE condition (I'm not in the mood for a dive into the source code). Two reasons for thinking that: (1) LIKE is language-agnostic, but fulltext matching expects the text to be in an actual language; and (2) If there was some aspect of FULLTEXT search that sped up LIKE, that part would end up being factored out and applied to the implementation of LIKE to begin with, and FULLTEXT would then be taking advantage of that.

I wonder if a more specific index would be more useful: in particular, you can have a generated column that captures hints about what the real search is supposed to match but is much easier to match tentatively (the first three characters; the leading digits), and then put an index on that column. Apparently MySQL's query planner is smart enough to recognise cases when such an index would be useful even when the generated column isn't selected (it's looked at if the real column it's generated from is used), but I don't know if it's smart enough in the general case to recognise that, say, an index on the length-three prefixes of column "col" is useful when trying to match something 'col LIKE "foo%"'. But it would be something to experiment with. Build some queries and see if the index is invoked.
https://dev.mysql.com/doc/refman/8.0/en/generated-column-index-optimizations.html

dalecosp · Mar 12, 2021

Is this for a specific application? My back-of-the-napkin calculations a few years back for a product search indicated that the performance of a specific software set (ElasticSearch, Lucene/Solr, Sphinx, et al) gave superior performance by at least an OOM over anything I could write using LIKE or REGEXP. Ended up using Sphinx for that job and it really performed well for us....

sneakyimp · Mar 24, 2021

I was trying to understand Weedpacket's thinking, and couldn't really understand how fulltext would have any language-specific aspects. I do recall there's a NATURAL LANGUAGE MODE syntax for MySQL, but that hasn't yet clarified much for me. I am still trying to absorb it but haven't had much time.

I stumbled across myisam_ftdump, a command which allows you to get a look inside a MyISAM fulltext index. I did a quick experiment, creating this SQL table:

CREATE TABLE `foo` (
  `id` int NOT NULL,
  `col1` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `foo` (`id`, `col1`) VALUES
(1, ' Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'),
(2, 'アメリカ合衆国の著作権法 (英語: Copyright law of the United States) は、文芸・映像・音楽・美術・ソフトウェアなどの著作物と、その著作者などの権利を保護するアメリカ合衆国の法律である。米国民の創作した著作物だけでなく、米国内に流通す');

ALTER TABLE `foo`
  ADD PRIMARY KEY (`id`);
ALTER TABLE `foo` ADD FULLTEXT KEY `col1` (`col1`);

ALTER TABLE `foo`
  MODIFY `id` int NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=3;

The output is fairly informative. The index appears to work by splitting up a data field along the whitespace and punctuation characters, indexing the non-whitespace characters:

# myisam_ftdump -d foo 1
        0            0.8787346 brought
        0            0.8787346 conceived
        0            0.8787346 continent
       bc            0.9157509 copyright
        0            0.8787346 created
        0            0.8787346 dedicated
        0            0.8787346 equal
        0            0.8787346 fathers
        0            0.8787346 liberty
        0            0.8787346 nation
        0            0.8787346 proposition
        0            0.8787346 score
       bc            0.9157509 states
       bc            0.9157509 united
        0            0.8787346 years
       bc            0.9157509 その著作者などの権利を保護するアメリカ合衆国の法律である
       bc            0.9157509 アメリカ合衆国の著作権法
       bc            0.9157509 ソフトウェアなどの著作物と
       bc            0.9157509 米国内に流通す
       bc            0.9157509 米国民の創作した著作物だけでなく

I don't speak Japanese -- and don't know if they use spaces like we do or whether those clusters are meaningful like words are or if the fulltext search would be useless for your typical japanese search -- but I thought this might provide meaningful detail here.

dalecosp Is this for a specific application?

I'm migrating a site we built 16 years ago to new tech. A lot of the coding and database decisions were less than masterful and I'm wondering whether we need all those indexes on certain tables or if they are useless. There are about 100 tables so it's a fairly involved process.

Weedpacket · Mar 24, 2021

Those five blocks of Japanese text in the dump translate according to Google (which in this instance turns out to be fairly good, since the writing is more formal and less idiomatic) as:
"It is a law of the United States that protects the rights of its authors, etc. "
"Copyright law of the United States "
"With copyrighted works such as software "
"Distributed in the United States "
"Not only copyrighted works created by Americans "

So, entire sentence fragments, rather than words – probably because the fulltext indexer doesn't understand Japanese as well as it understands English (fulltext searching is language-sensitive). I don't know if it's because you only posted part of the dump, but the bit you've shown doesn't have "英語", meaning "English Language". It also misses "seven" and "forth". Words like "are" and "and" and "to" would be discarded from fulltext searching because those are so common in English text that using them as a filter would be comparatively useless. Other languages would have different collections of such "stopwords".

Further, a fulltext search for "dedicate" ought to find a hit on "dedicated". PostgreSQL has several text parsers and dictionaries, depending on the kind of text to be searched and what to be searching for; including ones based on the Snowball stemmer for the languages that project currently has support for.

is MySQL FULLTEXT index helpful for LIKE or REGEXP queries?

Ssneakyimp

NogDog

Ssneakyimp

Weedpacket

dalecosp

Ssneakyimp

Weedpacket