Free-form text search of multiple tables

Weedpacket · Sep 10, 2018

sneakyimp Are things really so bad? Docs explain that you can check default stopwords like so:

Well yeah, but that would mean installing MySQL.

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical.

So you define a stopword table that does include "and".

Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

As is "The Who".

// clean spaces $clean_string = trim(preg_replace('/\s+/', " ", $str));

Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here. Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

sneakyimp · Sep 11, 2018

Weedpacket Well yeah, but that would mean installing MySQL.

I'm guessing you use PostGreSQL?

Weedpacket So you define a stopword table that does include "and".

I have certainly been considering this since realizing the nasty effect "and" has on my search performance. The reason I didn't simply do so is because I was worried that the problem might happen with other words -- and I'm more convinced now that it would if such a word appeared commonly.

I feel like just adding stop words is a bit like whack-a-mole. I believe that frequent+short words might cause performance problems, but I don't really understand why this would make things slow. Is it because the code has to munge larger volumes of data somehow? It doesn't seem like the issue arises at the UNION stage because the individual subqueries --which just calculate a floating-point score for each record -- are slow. I feel like something must be really inefficient, but don't really have the time (or coding chops) to get to the bottom of it.

Weedpacket As is "The Who".

I expect I will define an array of stopwords -- in code rather than defining additional tables because the stop words clearly seem important to the text being searched and the type of search we would like to work on that text. I doubt anyone will search career-related data for "the who" or "to be or not to be" and assert, as before, that "and" is totally unimportant to the types of searches to be performed on this data.

Weedpacket Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here.

Thanks for that suggestion. It occurs to me now that preg_split along \s+ is sufficient.

Weedpacket Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

Not sure what you mean by 'final array' but I'm guessing that you are referring to my array of search words. I want to check first if duplicate search words change the scores. E.g., if a search for "tiger" yields any different results than a search for "tiger tiger."

Also, I might pick a nit and say that the appearance of a search term more than once in the record to be searched does matter and, in fact, will yield a higher score. I want to say (but have no proof) that the additional effort to score multiple matches higher than a single match is very closely related to the performance problem with searching for "and." My old approach, using REGEXP and the word-boundary markers is actually faster than the mysql natural language search for this query:

search term is "such using equipment"
old has 1011 matches
old_elapsed: 1.6434950828552
new has 1011 matches
new_elapsed: 4.6821620464325

I believe this is because the REGEXP just returns a binary true/false once a match is found, whereas the mysql natural language search continues to munge the text looking for additional matches to finish calculating a relevance score.

sneakyimp · Sep 11, 2018

I've developed a theory about the slowness of the mysql natural language search (MNLS). The slowness is because it must fully munge all of the text in any matching record. If you get a lot of matches, this is a lot of munging. The MNLS benefits from a full-text search index but that just identifies which records contain a given word. My original REGEXP query must do a full table scan every time apparently, but as soon as it finds a match, it can return TRUE and ignore the rest of the text. Because MNLS must calculate a relevance score, it has to munge all the text stored in that record's column to fully calculate the relevance score.

sneakyimp · Sep 11, 2018

in support of my theory, i created a few entries in my tasks table where the task column just contains the word "tiger". I made four records

99999  - tiger
100000 - tiger tiger
100001 - tiger tiger tiger tiger 
100002 - tiger tiger tiger tiger tiger tiger tiger tiger

and ran this query:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE)
ORDER BY score ASC

Sure enough, the MNLS scores the records higher that have more instances of the word. In fact, the score is precisely proportional to the number of occurrences:

99999 	13.566825866699219
100000 	27.133651733398438
100001 	54.267303466796875
100002 	108.53460693359375

More precisely score=number_of_occurrences * 13.566825866699219

dalecosp · Sep 11, 2018

I hate to jump into a highly technical discussion, especially with a less than technical point; but is there any reason why you couldn't bolt a real search engine into your app? (Sphinx, Elasticsearch, etc.)?

MySQL's relatively poor search methods were part of the reason we decided to do this a few years back.

sneakyimp · Sep 11, 2018

dalecosp A welcome suggestion! Can you elaborate? The data is from a third party and arrives in a DB. I'm wondering how outside app would get at the data?

sneakyimp · Sep 11, 2018

Two additional observations.

First, assuming I've got those tiger*n entries just mentioned above, a search for just "tiger" yields different relevance scores than a search for "tiger tiger" but the latter search also matches the record with a just a single "tiger" in it. The latter search yields all the same results as the former search, except the scores are all reduced by a factor of 1.4289674259:

// search results for 'tiger tiger':
99999 	9.494146347045898
100000 	18.988292694091797
100001 	37.976585388183594
100002 	75.95317077636719

Second observation. MNLS recognizes quotes in your search string. This search:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE)
ORDER BY score DESC

Excludes the record with only one "tiger" in it.

Weedpacket · Sep 11, 2018

It looks like Sphinx can suck text directly from a MySQL db.
I presume Elasticsearch (based on Lucene) can as well, but being a Java application there's a shipload of configuration involved and I didn't wave my machete enough to find the source of the data.

NogDog · Sep 12, 2018

We use Elastisearch (along with the entire ELK stack) at work for a number of things, and it seems quite good -- though I'm far from an expert, mostly just using the Kibana part of the stack for my needs. I dabbled with Sphinx several years ago, and it seemed quite performant and relatively easy to use in conjunction with MySQL, if my memory serves. Either is almost certainly going to perform much better than full table scans in MySQL against regexes if the data reaches any significant size (for an undefined value of "significant").

PS: I suppose somewhere in between might be to leverage "full text search" in MySQL? https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

dalecosp · Sep 12, 2018

Sphinxsearch is written in C, but most distros and FreeBSD, etc., have packages available or you can build from a source tarball. It requires an SQL server; unlike Weed we're into MySQL (not necessarily completely or by choice), and Sphinx is good at that---off the top of my head I'm not sure what else they support.

Sphinx runs its own "searchd" on a different port (9306 is default as I recall) and the syntax for querying it is pretty much standard/MySQL. So, we have a "searchDB" object that manages it own connection to localhost 9306, and our search objects do something like "select foo, bar, baz from ombe_index where match('$term_string')" to get data to send back. IIRC, I clean $term_string with a must-match regexp, which is pretty much what you said you wanted to do originally.

A cronjob runs "indexer" just after 3 AM daily to update ombe_index, and a secondary "delta" indexer runs a few times during the day to catch up on new product listings, etc. --- which probably means my example query above is too simple ... there's probably a UNION statement so the delta index is also included in search results.

Hope that's helpful.

Weedpacket · Sep 13, 2018

Hopefully it'd be more performant than if you happen to enter a common word into this forum's search box....

dalecosp · Sep 13, 2018

I've been fairly pleased with it in terms of performance. The "commercial package" that once ran our search sometimes took tens of seconds to produce a results page. Now it takes a couple seconds at most including the redirect-after-POST page load.

I hope I've not oversimplified it in my comment above. I should probably mention that the "SphinxQL" is not exactly MySQL and lacks a few language features you might expect in a 'modern' MySQL; it feels to me a little like talking to a rather old MySQL server, which might really annoy people who are used to NoSQL(?), Maria(?) or PostGres(?) or some other lingo ...

As for stopwords, you define them yourself in text files (referenced in sphinx.conf) in grand UNIX tradition.

sneakyimp · Sep 13, 2018

These 3rd party things look verrrrrry interesting. Another person had mentioned SOLR, but it's also written in Java apparenty.

NogDog · Sep 13, 2018

sneakyimp

sneakyimp Another person had mentioned SOLR

One of our projects uses that for caching/indexing provider data to optimize searches: not sure how effective it is for fuzzy matches, but definitely fast for explicit matches.

sneakyimp · Sep 15, 2018

So I've written a first-in-first-out cache class to help speed search results. My data rarely changes so I wrote this caching class which limits the size of the cache by rotating out the oldest cache items. The cache creates files on the file system that match the key used. That being the case, I'd like to be able to match cache items to the user search strings supplied so I need to hash them. MD5 is vulnerable to collisions (which are apparently pretty rare) but I've read that it's faster than sha1. Any thoughts on which hash function to use?

E.g., if user searches for 'my dream career,' I'd like to hash that into a filename so I can locate the cache entry the next time someone searches for that same string.

Weedpacket · Sep 16, 2018

sneakyimp
Do you even need to hash them if it's just to make a filename? All that gets you is that all the names will be the same length. Just use the words themselves. If you don't want to do that, base64 encode them. Unless the problem you have is the risk of Excessively Long Filenames.

All hash functions are vulnerable to collisions - it's in their nature. But if you find two different lists of search terms (modulo the order in which the words are written: is "tiger human" a different search than "human tiger"?) that have the same hash under MD5 or SHA1 or SHA3 then you'll have something to post on sci.crypt - the first hash collision under those algorithms found naturally occurring in the wild (as opposed to hunted down as has been the case for MD5 and SHA1).

An MD5 hash is designed to have an entropy of 128 bits. Grabbing a handy wordlist (110573 words) I find I need to pick about eight words at random from it to generate that much entropy. Even more if order doesn't matter. So here are my search terms:

ascospore, subinterval, lederhosen, trichinous, soukhouma, unacceptable, frat, biped

And you can be restricting the saved search terms to words that actually appear in the text being searched, so your wordlist would probably be a fair bit smaller.

And other hash algorithms have even higher entropies. For SHA3-224 I'd need at least fourteen words.

I suppose what is needed is to determine, given a probability p, the number of words that need to be in two randomly-selected search lists for the chance of them having the same hash is p (see birthday paradox). I tried working that out, but even with approximations my code kept either underflowing or overflowing if I tried to count how many words were needed to get p above 10^-30 .

So if the alternative is worrying about search terms producing hash collisions your energies would be better spent trying to avoid being personally hit by a meteorite.

Now, speaking of unacceptable frat bipeds, I've got some housework that needs doing.

sneakyimp · Sep 17, 2018

Weedpacket Just use the words themselves. If you don't want to do that, base64 encode them. Unless the problem you have is the risk of Excessively Long Filenames.

Unfortunately, I can't just use the search string for the file name. My latest version of the code, rather than screening out all the puncutation, takes the user's search string verbatim--except whitespace is normalized to spaces and extra spaces trimmed/removed--and feeds that search string directly into a MySQL Natural Language Full-Text Search query. Quotes are meaningful to this search functionality in that you can specify search for exact string matches. A user might also enter slashes and these would exist in the search string.

I very much like the idea of a conversion that is reversible because this would make it easy to understand more about what people are searching for by looking at the cache filenames. Unfortunately, base64 uses slashes which looks like a directory separator.

I'm not especially concerned about especially long filenames because I'm limiting the search input to 50 characters at the moment.

If the search string uses quotes, then the order of the words matters.

"human tiger"

yields different results than

"tiger human"

But I believe if you don't quote the string, the search results will be identical:

human tiger
tiger human

I truly appreciate your entropy/probability analysis of the hash functions. I was under the impression that collisions are rare and figured it would be extremely unlikely for a collision to occur for any actual searches someone might perform.

I'm primarily interested in making sure a user-supplied string cannot be used to exploit my file system. These cache files will be outside the webroot. I just want to certain that no illegal filename characters are used. Apparently ext3 and ext4 allow all chars except the null char and a slash. I'm thinking I might base64 encode the user input and replace any slashes with an underscore or some other char not used by base64.

Weedpacket · Sep 17, 2018

sneakyimp Unfortunately, base64 uses slashes which looks like a directory separator.

The conventional "URL-safe" variant uses a strtr($base64, '+/', '-_') mapping. This also works for file paths. (See RFC4648 §5 and Table 2.)

sneakyimp · Sep 17, 2018

Weedpacket The conventional "URL-safe" variant uses a strtr($base64, '+/', '-_') mapping. This also works for file paths. (See RFC4648 §5 and Table 2.)

I'm delighted to see this as it is pretty much exactly what I was thinking. Always a relief to know that my instincts are matched by smarter people. And thanks for the tidy code snippet. I'll definitely use it.

zonbin · Oct 2, 2018

thank you so much

Free-form text search of multiple tables

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

dalecosp

Ssneakyimp

Ssneakyimp

Weedpacket

NogDog

dalecosp

Weedpacket

dalecosp

Ssneakyimp

NogDog

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Zzonbin