Free-form text search of multiple tables

sneakyimp · Sep 9, 2018

Weedpacket Because preg_quote itself doesn't know your intentions about where in the regular expression you're going to be inserting the string it was passed: for all it knows you might have been collecting (and quoting) a set of characters with the intention of putting it inside a [] pair. All it's able to assume is that the characters of the string you give it are not to be interpreted as regexp operators.

I want to say it's noteworthy that the devs of preg have not bothered to make some kind of prepare_preg function that is context-aware when escaping some abritrary piece of data to be merged into a regex. I expect there's probably little need or demand for such a function. I do wonder about 3/4 of regexes with varying numbers of backslashes above all returning the same matching results. I don't think I'm equipped to really ruminate on this and have some grand epiphany about how to generally solve the problem of escaping from PHP->SQL->REGEXP with deterministic results. Certainly not on this project.

I also believe that my experience with the MySQL Natural Language Full-Text Search functionality is reinforcing my instincts as far as stripping keywords goes, but I don't want to get too uppity. As previously mentioned, the presence of the word "and" is bogging down my searches. Having somewhat painstakingly examined various search strings and their results, various EXPLAIN statements, forum posts, documentation, and rants I am starting to think this and-keyword-slowness problem is not due to any functionally different treatment of this word as an operator but rather the fact that the word "and" is very common, yielding a very large number of matches which forces a lot of data comparisons. I find it puzzling that "and" is missing from the default MySQL stopwords file precisely because it is going to appear in pretty much any English string of sufficient length. I think my instincts were right to exclude this word from a full text search. HOWEVER, I'd like to definitively resolve why the presence of "and" in any search will bog things down.

You don't even need to run the whole big UNIONized query to get slowness. My tasks subquery runs slow -- around 8 seconds:

SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE)) * 3 AS score, 'q_ts' AS qid FROM tasks ts, careers c WHERE ts.id_code = c.id_code AND MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE) GROUP BY c_i

If I EXPLAIN it I get this. Seems OK to me but I might be overlooking something? I'm not really sure what to make of it.

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra   
1   SIMPLE  ts  fulltext    id_code,task    task    0   NULL    1   Using where; Using temporary; Using filesort
1   SIMPLE  c   ALL     id_code     NULL    NULL    NULL    1110    Using where

The query returns 87% of the careers (967 out of 1110). If I drop "and" from the search and just search "pig whistle" I get only 2 careers out of 1110 and the query runs in about 5 milliseconds.

Is there some way to inspect the words stored in a MySQL full-text search index for an Innodb table? I see there's a myisam_ftdump function but don't see an innodb one. I'm thinking it'd be informative to see what the most common words are. Perhaps then I could formulate some query using another common word and see if it's also slow.

If anyone has thoughts about why "and" makes a search slow and, more importantly, if this problem might also happen with other strings, then I'd very much like to hear your thoughts.

In the meantime, there seem to be some good reasons for excluding punctuation and words like "and."
- for this application, users are probably unlikely to search for specific punctuation sequences or attempt to use regex patterns
- punctuation in particular introduces regex problems for my original search function.
- mysql full text search excludes any strings less than 3 chars long by default
- common words like "and" appear to cause performance problems
- better devs than I (the mysql devs) have utilized a stopwords list (possibly for performance reasons?)
- search results get dramatically expanded with unhelpful matches when common words are used, clouding the utility of the results

Weedpacket · Sep 10, 2018

It's more concerning than that. A quick squizz at the source code says "and" is a stopword.

...skip 23 words ... am among amongst anandanother any anybody anyhow... skip 511 words

Oh, wait, that's the myisam implementation: storage/myisam/ft_static.cc (not ft_stopwords.cc it turns out).

Yeah.... different storage engines do full-text search differently. Let's see if they use different lists of stopwords, too Refactoring? What's that?.

Let's look under innodb: Um, that would be storage/innobase/fts I suppose.

/** InnoDB default stopword list:
There are different versions of stopwords, the stop words listed
below comes from "Google Stopword" list. Reference:
http://meta.wikimedia.org/wiki/Stop_word_list/google_stop_word_list.
The final version of InnoDB default stopword list is still pending
for decision */
const char *fts_default_stopword[] = {
    "a",    "about", "an",  "are",  "as",   "at",    "be",   "by",
    "com",  "de",    "en",  "for",  "from", "how",   "i",    "in",
    "is",   "it",    "la",  "of",   "on",   "or",    "that", "the",
    "this", "to",    "was", "what", "when", "where", "who",  "will",
    "with", "und",   "the", "www",  NULL};

(from fts0fts.cc)

For extra giggles, have a look at the list referenced in the comment. "Pending for decision". This list in its current form and that comment are at least eight years old.

<Insomniak`> Stupid Google <Insomniak`> "The" is a common word, and was not included in your search <Insomniak`> "Who" is a common word, and was not included in your search
http://bash.org/?514353

sneakyimp · Sep 10, 2018

I just wrote a lengthy response to this and firefox crashed

sneakyimp · Sep 10, 2018

lengthy response condensed:

Weedpacket It's more concerning than that. A quick squizz at the source code says "and" is a stopword.

Are things really so bad? Docs explain that you can check default stopwords like so:

SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD

On my workstation, "and" is not among them:

a about an are as at be by com de en for from how i in is it la of on or that the this to was what when where

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical. Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

Mostly worried about:
- When will I encounter a 20-second search? Is "and" the only culprit or are there others?
- MySQL FTS only searches whole words. I.e., search for "soft" won't match "software." My slow legacy function would find these partial words.
- is there some way to inspect the contents of MySQL Innodb full-text index? Alternative is that I must roll my own code to compile word counts. Was hoping to check what the second-most-popular word is (and how popular in comparison to "and") and maybe search for that. Already tried a few candidates with no joy.

sneakyimp · Sep 10, 2018

I wrote a quick script to count the frequency of words in a column in my tasks table:

public function words() {
	$sql = "SELECT c.id_code AS ci, c.title as c_t, ts.task FROM tasks ts, careers c
WHERE ts.id_code = c.id_code";
	$query = $this->db->query($sql);
	if (!$query) {
		throw new Exception("Query failed");
	}
	$retval = array();
	$i = 0;
	while ($row = $query->unbuffered_row("array")) {
		$i++;
		$word_arr = self::parse_words($row["task"]);
		foreach($word_arr as $word) {
			if (array_key_exists($word, $retval)) {
				$retval[$word]++;
			} else {
				$retval[$word] = 1;
			}
		}
	}
	echo "$i records<br>";
	echo count($retval) . " distinct words encountered<br>";
	arsort($retval);
	var_dump($retval);
}

private static function parse_words($str) {
	// clean spaces
	$clean_string = trim(preg_replace('/\s+/', " ", $str));
	// remove all but letters & numbers & spaces
	$clean_string = preg_replace("/[^\pL\pN\pZ]/u", "", $clean_string);
	// return an array of words
	return preg_split('/\s+/', $clean_string);
}

As expected, "and" is the most common word....the first lines of output:

19530 records
13898 distinct words encountered

array (size=13898)
  'and' => int 17498
  'or' => int 15182
  'to' => int 10451
  'of' => int 6098
  'as' => int 3492
  'for' => int 3413
  'such' => int 2896
  'in' => int 2745
  'the' => int 2479
  'with' => int 2350
  'using' => int 2153
  'equipment' => int 2125
...

Weedpacket · Sep 10, 2018

sneakyimp Are things really so bad? Docs explain that you can check default stopwords like so:

Well yeah, but that would mean installing MySQL.

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical.

So you define a stopword table that does include "and".

Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

As is "The Who".

// clean spaces $clean_string = trim(preg_replace('/\s+/', " ", $str));

Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here. Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

sneakyimp · Sep 11, 2018

Weedpacket Well yeah, but that would mean installing MySQL.

I'm guessing you use PostGreSQL?

Weedpacket So you define a stopword table that does include "and".

I have certainly been considering this since realizing the nasty effect "and" has on my search performance. The reason I didn't simply do so is because I was worried that the problem might happen with other words -- and I'm more convinced now that it would if such a word appeared commonly.

I feel like just adding stop words is a bit like whack-a-mole. I believe that frequent+short words might cause performance problems, but I don't really understand why this would make things slow. Is it because the code has to munge larger volumes of data somehow? It doesn't seem like the issue arises at the UNION stage because the individual subqueries --which just calculate a floating-point score for each record -- are slow. I feel like something must be really inefficient, but don't really have the time (or coding chops) to get to the bottom of it.

Weedpacket As is "The Who".

I expect I will define an array of stopwords -- in code rather than defining additional tables because the stop words clearly seem important to the text being searched and the type of search we would like to work on that text. I doubt anyone will search career-related data for "the who" or "to be or not to be" and assert, as before, that "and" is totally unimportant to the types of searches to be performed on this data.

Weedpacket Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here.

Thanks for that suggestion. It occurs to me now that preg_split along \s+ is sufficient.

Weedpacket Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

Not sure what you mean by 'final array' but I'm guessing that you are referring to my array of search words. I want to check first if duplicate search words change the scores. E.g., if a search for "tiger" yields any different results than a search for "tiger tiger."

Also, I might pick a nit and say that the appearance of a search term more than once in the record to be searched does matter and, in fact, will yield a higher score. I want to say (but have no proof) that the additional effort to score multiple matches higher than a single match is very closely related to the performance problem with searching for "and." My old approach, using REGEXP and the word-boundary markers is actually faster than the mysql natural language search for this query:

search term is "such using equipment"
old has 1011 matches
old_elapsed: 1.6434950828552
new has 1011 matches
new_elapsed: 4.6821620464325

I believe this is because the REGEXP just returns a binary true/false once a match is found, whereas the mysql natural language search continues to munge the text looking for additional matches to finish calculating a relevance score.

sneakyimp · Sep 11, 2018

I've developed a theory about the slowness of the mysql natural language search (MNLS). The slowness is because it must fully munge all of the text in any matching record. If you get a lot of matches, this is a lot of munging. The MNLS benefits from a full-text search index but that just identifies which records contain a given word. My original REGEXP query must do a full table scan every time apparently, but as soon as it finds a match, it can return TRUE and ignore the rest of the text. Because MNLS must calculate a relevance score, it has to munge all the text stored in that record's column to fully calculate the relevance score.

sneakyimp · Sep 11, 2018

in support of my theory, i created a few entries in my tasks table where the task column just contains the word "tiger". I made four records

99999  - tiger
100000 - tiger tiger
100001 - tiger tiger tiger tiger 
100002 - tiger tiger tiger tiger tiger tiger tiger tiger

and ran this query:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE)
ORDER BY score ASC

Sure enough, the MNLS scores the records higher that have more instances of the word. In fact, the score is precisely proportional to the number of occurrences:

99999 	13.566825866699219
100000 	27.133651733398438
100001 	54.267303466796875
100002 	108.53460693359375

More precisely score=number_of_occurrences * 13.566825866699219

dalecosp · Sep 11, 2018

I hate to jump into a highly technical discussion, especially with a less than technical point; but is there any reason why you couldn't bolt a real search engine into your app? (Sphinx, Elasticsearch, etc.)?

MySQL's relatively poor search methods were part of the reason we decided to do this a few years back.

sneakyimp · Sep 11, 2018

dalecosp A welcome suggestion! Can you elaborate? The data is from a third party and arrives in a DB. I'm wondering how outside app would get at the data?

sneakyimp · Sep 11, 2018

Two additional observations.

First, assuming I've got those tiger*n entries just mentioned above, a search for just "tiger" yields different relevance scores than a search for "tiger tiger" but the latter search also matches the record with a just a single "tiger" in it. The latter search yields all the same results as the former search, except the scores are all reduced by a factor of 1.4289674259:

// search results for 'tiger tiger':
99999 	9.494146347045898
100000 	18.988292694091797
100001 	37.976585388183594
100002 	75.95317077636719

Second observation. MNLS recognizes quotes in your search string. This search:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE)
ORDER BY score DESC

Excludes the record with only one "tiger" in it.

Weedpacket · Sep 11, 2018

It looks like Sphinx can suck text directly from a MySQL db.
I presume Elasticsearch (based on Lucene) can as well, but being a Java application there's a shipload of configuration involved and I didn't wave my machete enough to find the source of the data.

NogDog · Sep 12, 2018

We use Elastisearch (along with the entire ELK stack) at work for a number of things, and it seems quite good -- though I'm far from an expert, mostly just using the Kibana part of the stack for my needs. I dabbled with Sphinx several years ago, and it seemed quite performant and relatively easy to use in conjunction with MySQL, if my memory serves. Either is almost certainly going to perform much better than full table scans in MySQL against regexes if the data reaches any significant size (for an undefined value of "significant").

PS: I suppose somewhere in between might be to leverage "full text search" in MySQL? https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

dalecosp · Sep 12, 2018

Sphinxsearch is written in C, but most distros and FreeBSD, etc., have packages available or you can build from a source tarball. It requires an SQL server; unlike Weed we're into MySQL (not necessarily completely or by choice), and Sphinx is good at that---off the top of my head I'm not sure what else they support.

Sphinx runs its own "searchd" on a different port (9306 is default as I recall) and the syntax for querying it is pretty much standard/MySQL. So, we have a "searchDB" object that manages it own connection to localhost 9306, and our search objects do something like "select foo, bar, baz from ombe_index where match('$term_string')" to get data to send back. IIRC, I clean $term_string with a must-match regexp, which is pretty much what you said you wanted to do originally.

A cronjob runs "indexer" just after 3 AM daily to update ombe_index, and a secondary "delta" indexer runs a few times during the day to catch up on new product listings, etc. --- which probably means my example query above is too simple ... there's probably a UNION statement so the delta index is also included in search results.

Hope that's helpful.

Weedpacket · Sep 13, 2018

Hopefully it'd be more performant than if you happen to enter a common word into this forum's search box....

dalecosp · Sep 13, 2018

I've been fairly pleased with it in terms of performance. The "commercial package" that once ran our search sometimes took tens of seconds to produce a results page. Now it takes a couple seconds at most including the redirect-after-POST page load.

I hope I've not oversimplified it in my comment above. I should probably mention that the "SphinxQL" is not exactly MySQL and lacks a few language features you might expect in a 'modern' MySQL; it feels to me a little like talking to a rather old MySQL server, which might really annoy people who are used to NoSQL(?), Maria(?) or PostGres(?) or some other lingo ...

As for stopwords, you define them yourself in text files (referenced in sphinx.conf) in grand UNIX tradition.

sneakyimp · Sep 13, 2018

These 3rd party things look verrrrrry interesting. Another person had mentioned SOLR, but it's also written in Java apparenty.

NogDog · Sep 13, 2018

sneakyimp

sneakyimp Another person had mentioned SOLR

One of our projects uses that for caching/indexing provider data to optimize searches: not sure how effective it is for fuzzy matches, but definitely fast for explicit matches.

sneakyimp · Sep 15, 2018

So I've written a first-in-first-out cache class to help speed search results. My data rarely changes so I wrote this caching class which limits the size of the cache by rotating out the oldest cache items. The cache creates files on the file system that match the key used. That being the case, I'd like to be able to match cache items to the user search strings supplied so I need to hash them. MD5 is vulnerable to collisions (which are apparently pretty rare) but I've read that it's faster than sha1. Any thoughts on which hash function to use?

E.g., if user searches for 'my dream career,' I'd like to hash that into a filename so I can locate the cache entry the next time someone searches for that same string.

Free-form text search of multiple tables

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

dalecosp

Ssneakyimp

Ssneakyimp

Weedpacket

NogDog

dalecosp

Weedpacket

dalecosp

Ssneakyimp

NogDog

Ssneakyimp