Free-form text search of multiple tables

sneakyimp · 2018-09-06T00:15:53+00:00

I'm updating some code from about fifteen years ago (!) and, looking at the search code written by some subcontractors, I see that it looks problematic for a...

Free-form text search of multiple tables

sneakyimp

in support of my theory, i created a few entries in my tasks table where the task column just contains the word "tiger". I made four records

99999  - tiger
100000 - tiger tiger
100001 - tiger tiger tiger tiger 
100002 - tiger tiger tiger tiger tiger tiger tiger tiger

and ran this query:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE)
ORDER BY score ASC

Sure enough, the MNLS scores the records higher that have more instances of the word. In fact, the score is precisely proportional to the number of occurrences:

99999 	13.566825866699219
100000 	27.133651733398438
100001 	54.267303466796875
100002 	108.53460693359375

More precisely score=number_of_occurrences * 13.566825866699219

dalecosp

I hate to jump into a highly technical discussion, especially with a less than technical point; but is there any reason why you couldn't bolt a real search engine into your app? (Sphinx, Elasticsearch, etc.)?

MySQL's relatively poor search methods were part of the reason we decided to do this a few years back.

sneakyimp

dalecosp A welcome suggestion! Can you elaborate? The data is from a third party and arrives in a DB. I'm wondering how outside app would get at the data?

sneakyimp

Two additional observations.

First, assuming I've got those tiger*n entries just mentioned above, a search for just "tiger" yields different relevance scores than a search for "tiger tiger" but the latter search also matches the record with a just a single "tiger" in it. The latter search yields all the same results as the former search, except the scores are all reduced by a factor of 1.4289674259:

// search results for 'tiger tiger':
99999 	9.494146347045898
100000 	18.988292694091797
100001 	37.976585388183594
100002 	75.95317077636719

Second observation. MNLS recognizes quotes in your search string. This search:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE)
ORDER BY score DESC

Excludes the record with only one "tiger" in it.

Weedpacket

It looks like Sphinx can suck text directly from a MySQL db.
I presume Elasticsearch (based on Lucene) can as well, but being a Java application there's a shipload of configuration involved and I didn't wave my machete enough to find the source of the data.

NogDog

We use Elastisearch (along with the entire ELK stack) at work for a number of things, and it seems quite good -- though I'm far from an expert, mostly just using the Kibana part of the stack for my needs. I dabbled with Sphinx several years ago, and it seemed quite performant and relatively easy to use in conjunction with MySQL, if my memory serves. Either is almost certainly going to perform much better than full table scans in MySQL against regexes if the data reaches any significant size (for an undefined value of "significant").

PS: I suppose somewhere in between might be to leverage "full text search" in MySQL? https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

dalecosp

Sphinxsearch is written in C, but most distros and FreeBSD, etc., have packages available or you can build from a source tarball. It requires an SQL server; unlike Weed we're into MySQL (not necessarily completely or by choice), and Sphinx is good at that---off the top of my head I'm not sure what else they support.

Sphinx runs its own "searchd" on a different port (9306 is default as I recall) and the syntax for querying it is pretty much standard/MySQL. So, we have a "searchDB" object that manages it own connection to localhost 9306, and our search objects do something like "select foo, bar, baz from ombe_index where match('$term_string')" to get data to send back. IIRC, I clean $term_string with a must-match regexp, which is pretty much what you said you wanted to do originally.

A cronjob runs "indexer" just after 3 AM daily to update ombe_index, and a secondary "delta" indexer runs a few times during the day to catch up on new product listings, etc. --- which probably means my example query above is too simple ... there's probably a UNION statement so the delta index is also included in search results.

Hope that's helpful.

Weedpacket

Hopefully it'd be more performant than if you happen to enter a common word into this forum's search box....

dalecosp

I've been fairly pleased with it in terms of performance. The "commercial package" that once ran our search sometimes took tens of seconds to produce a results page. Now it takes a couple seconds at most including the redirect-after-POST page load.

I hope I've not oversimplified it in my comment above. I should probably mention that the "SphinxQL" is not exactly MySQL and lacks a few language features you might expect in a 'modern' MySQL; it feels to me a little like talking to a rather old MySQL server, which might really annoy people who are used to NoSQL(?), Maria(?) or PostGres(?) or some other lingo ...

As for stopwords, you define them yourself in text files (referenced in sphinx.conf) in grand UNIX tradition.

sneakyimp

These 3rd party things look verrrrrry interesting. Another person had mentioned SOLR, but it's also written in Java apparenty.

NogDog

sneakyimp

sneakyimp Another person had mentioned SOLR

One of our projects uses that for caching/indexing provider data to optimize searches: not sure how effective it is for fuzzy matches, but definitely fast for explicit matches.

sneakyimp

So I've written a first-in-first-out cache class to help speed search results. My data rarely changes so I wrote this caching class which limits the size of the cache by rotating out the oldest cache items. The cache creates files on the file system that match the key used. That being the case, I'd like to be able to match cache items to the user search strings supplied so I need to hash them. MD5 is vulnerable to collisions (which are apparently pretty rare) but I've read that it's faster than sha1. Any thoughts on which hash function to use?

E.g., if user searches for 'my dream career,' I'd like to hash that into a filename so I can locate the cache entry the next time someone searches for that same string.

Weedpacket

sneakyimp
Do you even need to hash them if it's just to make a filename? All that gets you is that all the names will be the same length. Just use the words themselves. If you don't want to do that, base64 encode them. Unless the problem you have is the risk of Excessively Long Filenames.

All hash functions are vulnerable to collisions - it's in their nature. But if you find two different lists of search terms (modulo the order in which the words are written: is "tiger human" a different search than "human tiger"?) that have the same hash under MD5 or SHA1 or SHA3 then you'll have something to post on sci.crypt - the first hash collision under those algorithms found naturally occurring in the wild (as opposed to hunted down as has been the case for MD5 and SHA1).

An MD5 hash is designed to have an entropy of 128 bits. Grabbing a handy wordlist (110573 words) I find I need to pick about eight words at random from it to generate that much entropy. Even more if order doesn't matter. So here are my search terms:

ascospore, subinterval, lederhosen, trichinous, soukhouma, unacceptable, frat, biped

And you can be restricting the saved search terms to words that actually appear in the text being searched, so your wordlist would probably be a fair bit smaller.

And other hash algorithms have even higher entropies. For SHA3-224 I'd need at least fourteen words.

I suppose what is needed is to determine, given a probability p, the number of words that need to be in two randomly-selected search lists for the chance of them having the same hash is p (see birthday paradox). I tried working that out, but even with approximations my code kept either underflowing or overflowing if I tried to count how many words were needed to get p above 10^-30 .

So if the alternative is worrying about search terms producing hash collisions your energies would be better spent trying to avoid being personally hit by a meteorite.

Now, speaking of unacceptable frat bipeds, I've got some housework that needs doing.

sneakyimp

Weedpacket Just use the words themselves. If you don't want to do that, base64 encode them. Unless the problem you have is the risk of Excessively Long Filenames.

Unfortunately, I can't just use the search string for the file name. My latest version of the code, rather than screening out all the puncutation, takes the user's search string verbatim--except whitespace is normalized to spaces and extra spaces trimmed/removed--and feeds that search string directly into a MySQL Natural Language Full-Text Search query. Quotes are meaningful to this search functionality in that you can specify search for exact string matches. A user might also enter slashes and these would exist in the search string.

I very much like the idea of a conversion that is reversible because this would make it easy to understand more about what people are searching for by looking at the cache filenames. Unfortunately, base64 uses slashes which looks like a directory separator.

I'm not especially concerned about especially long filenames because I'm limiting the search input to 50 characters at the moment.

If the search string uses quotes, then the order of the words matters.

"human tiger"

yields different results than

"tiger human"

But I believe if you don't quote the string, the search results will be identical:

human tiger
tiger human

I truly appreciate your entropy/probability analysis of the hash functions. I was under the impression that collisions are rare and figured it would be extremely unlikely for a collision to occur for any actual searches someone might perform.

I'm primarily interested in making sure a user-supplied string cannot be used to exploit my file system. These cache files will be outside the webroot. I just want to certain that no illegal filename characters are used. Apparently ext3 and ext4 allow all chars except the null char and a slash. I'm thinking I might base64 encode the user input and replace any slashes with an underscore or some other char not used by base64.

Weedpacket

sneakyimp Unfortunately, base64 uses slashes which looks like a directory separator.

The conventional "URL-safe" variant uses a strtr($base64, '+/', '-_') mapping. This also works for file paths. (See RFC4648 §5 and Table 2.)

sneakyimp

Weedpacket The conventional "URL-safe" variant uses a strtr($base64, '+/', '-_') mapping. This also works for file paths. (See RFC4648 §5 and Table 2.)

I'm delighted to see this as it is pretty much exactly what I was thinking. Always a relief to know that my instincts are matched by smarter people. And thanks for the tidy code snippet. I'll definitely use it.

zonbin

thank you so much