Free-form text search of multiple tables

sneakyimp · Sep 6, 2018

If I understand laserlight's post correctly -- and i haven't bothered to examine any DBMS source code here -- the theoretical difference is that prepared statements deliver your query to the DBMS with SQL as one string and the parameters as entirely separate data structures. Because of this separation, the absence of quotes in your SQL or the presence of umatched quotes in these separate data structures no longer leads to an SQL injection vulnerability. An analogy might be found in parsing CSV data versus having an array containing distinct values. The data in prepared statements is more highly structured such that the parsing engine needn't bother to check for escape sequences or quotes -- it has the separate data objects and treats them as such. Phrased another way, prepared statements provide better data integrity because your query and its data are more highly isolated structured in transit from your PHP code to the DBMS and less vulnerable to exploit because they don't have to be translated into the less-structured of format of an SQL string, which is in reality a mixture of data and instructions for manipulating the data. Phrased yet another way: prepared statements are a more highly structured client-server protocol.

CodeIgniter may be abstracting that away if you use their modelling paradigm, and thus be fine -- unless/until you override it with your own explicit SQL.

Codeigniter's query builder apparently mimics the behavior of prepared statements, but does not use the actual PDO prepare & execute functions even if your config file sets dbdriver=pdo. That being the case, its success & security against SQL injection will depend on its own implementation of quoting/escaping functionality which is probably NOT aware of column data types. I don't know for certain that it's not column-aware, but suspect it is because all data retrieved from the DBMS comes back as string values.

In my search code above, you can see that I am in fact constructing explicit SQL. That code doesn't yet quote/escape the keywords supplied. I'm looking into using PDO and prepared statements first, but the fact that these keywords are fed into REGEXP expressions complicates matters.

NogDog · Sep 6, 2018

I admittedly have not followed this whole discussion, but the escaping does not actually end up in the DB, just in the SQL*. So when you either do subsequent queries using DB regex functions, or pull the data into your app and apply application regex functions, you should not have to worry about that escaping affecting it -- or I'm totally missing the context and you should just ignore me.

======================
* Sort of like the backslashes do not actually get output in PHP:

echo 'This here\'s a test, y\'all.';

sneakyimp · Sep 7, 2018

NogDog

I'm familiar with how escape characters in one's code don't actually make their way into the string defined. Escaping them is necessary because, for instance, you need to distinguish quotes that delimit your string from the quotes you might want IN your string. Totally understand that.

What concerns me is that I want my keywords to be part of a REGEXP expression inside and sql expression and escaping is different for SQL than it is for REGEXP. PHP has a preg_quote function, for example.

Some example questions:
- what if one of my keywords is wh*ee -- do I need to escape the asterisk? Is there a function or guidelines for this?
- what if one of my keywords is [[:alnum:]]+ -- should I escape the square brackets?
- what if the db escape (like PDO::quote) function returns quotes in its output? Does that mean I need to escape my entire regexp expression even though this expression is intended to be interpreted as SQL regexp?

For that last example, consider this code:

$db = new PDO("mysql:host=localhost;dbname=my_db", "user", "pass");
$keyword = "foo";
$sql = "SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]" . $db->quote($keyword) . "[[:>:]]'";
echo $sql . "\n";

The output is broken SQL because the PDO::quote function adds single quotes:

SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]'foo'[[:>:]]'

I can fix that particular query by escaping the keyword surrounded by the regexp along with the keyword:

$keyword = "[[:<:]]foo[[:>:]]";
$sql = "SELECT * FROM my_table WHERE my_col REGEXP " . $db->quote($keyword);

But is this really what I want to do for a general solution? Are there any chars that might be in my regex which I DON'T want quoted. Like what if my keyword itself contained a single or double quote? This code:

$keyword = '[[:<:]]f"oo[[:>:]]';
$sql = "SELECT * FROM my_table WHERE my_col REGEXP " . $db->quote($keyword);

results in the double quote also being escaped:

SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]f\"oo[[:>:]]'

NOTE this does actually work -- if you run that query it'll locate records containing the string f"oo but I think it illustrates my concern about escaping. I'd like to avoid crosstalk between escaping REGEXP chars and escaping SQL search keywords.

I'd also point out that the mysql docs on regex don't talk much about escaping anything. Nor does the ICU Reference that it refers to.

laserlight · Sep 7, 2018

sneakyimp wrote:
I had seen this -- and I may be paranoid -- but I'm very suspicious of this claim of immunity and believe that relying on this automatic escaping to be far too credulous. I'm curious what sort of logic/mechanism might insure this 'immunity' but dread the prospect of trying to read source code for MySQL, PostGreSQL, etc.

I explained this in my elaboration. It isn't "automatic escaping", unless emulated. I'm more familiar with SQLite because they have a digestible newbie explanation for what's going on, but basically we're looking at the SQL statement getting parsed into byte code, and from there the data is processed by the byte code. Preparing the statement means that this parsing doesn't have to happen each time the statement is executed (without using a query cache), which can provide a small advantage in efficiency, but where we are concerned about SQL injection, it means that the parser has no chance to mistake data for SQL code and hence produce byte code that is different from what was intended by the author of the SQL code. That's what grants immunity from SQL injection.

sneakyimp wrote:
I'd also point out that PHP's PDO sometimes just emulates real PDO behavior, depending on how one has installed the MySQL client.

It sounds like you didn't read my elaboration, otherwise you would have seen that I already mentioned it:

laserlight wrote:
The key idea is that unless the prepared statement's bound parameters lacks native support such that it is actually emulated by quoting and escaping, the data is separated from the SQL statement. Hence, security related bugs aside, it is impossible to construct an SQL injection because the data is always merely data; it will never be treated as SQL code. If you use PDO::quote, then at some point the data becomes part of the SQL statement, although it is after quoting and escaping.

Of course, if it does fall back to emulation, then while it is no better than plain escaping, it's no worse either (and I guess that's where "automatic" as an advantage comes in, but it probably isn't much of an advantage when you have frameworks that take care of things for you).

sneakyimp wrote:
Furthermore, application-level validation can check if your data matches application-level requirements. E.g., the DBMS engine's value escaping won't check if the supplied email address is valid before cramming it into a VARCHAR field.

Yes, of course: preventing SQL injection through the use of bound parameters, or some kind of escaping mechanism, is not the same thing as application-level validation, and typically they would both be applied. But do you have application-level requirements other than "stop SQL injection"? You didn't mention any other than "we exclude trivial words like and and or", and that's technically not validation either: you're cleaning the data through the removal of those words, but not validating it to reject it if it contains those words.

sneakyimp · Sep 7, 2018

laserlight
I apologize if I did not make reference to your elaboration, which I appreciate very much. I did in fact read it several times and had difficulty understanding the first half. It was the latter half that helped me to comprehend the actual mechanism by which prepared statements prevent SQL injection. Only you calling me out here brought me to understand that I was paraphrasing what you already explained. Your further elucidation has made it even clearer and I am even deeper in your debt. Thank you for the additional detail about byte code et. al. My reading comprehension skills are not especially sharp sometimes and I apologize if I seem obtuse or ungrateful. Also -- and this may be an irritating habit -- paraphrasing the descriptions of others really helps me to solidify my understanding of complicated concepts.

Regarding my application-level validation, I'm stripping out everything (or trying to anyway) except letters, numbers, and spaces from the user search string. This serves two purposes (albeit in a ham-fisted way): 1) it should prevent SQL injection arguably more effectively than the dubious stack of PDO emulation and framework code I'm building on and 2) should eliminate the need to worry about escaping any special REGEX chars like asterisks or square brackets or $ or ^ or whatever.

No one has commented much on the SQL-generating code I posted which makes use of the word-end boundary markers ([[:<:]] and [[:>:]]). My regex cleanup in PHP should still allow users to search for actual words and numbers so it still supports fundamentally useful functionality and at the same time I need not face the task of sorting out how to escape my keywords and/or regex expressions in order to preserve the original keywords while still providing safe, functional code.

I guess you could say that one of my application requirements is "search the four different columns in the four different tables while preserving the the original keywords, my word-boundary logic, and the relative scoring of the tables and full matches versus partial matches" I'm concerned that the word-boundary regex I have (which does a fair job on our production server) will just run into trouble if we start allowing more punctuation into the SQL. I'm entirely open to other methods of full text search, but have no experience with them. If anyone wants to suggest something, I'd be grateful. I appreciated Weedpacket's sound-alike suggestion very much but this sounds like next-level enhancement. I'm mostly concerned about performance and precision at this point.

Weedpacket · Sep 7, 2018

As far as escaping for REGEXP goes: since (I'm assuming) you don't want users writing arbitrary regular expressions as their search terms, you'd need to quote anything your DBMS considers having significance in regular expressions (although you probably aren't interested in any of them: just keep whitespace, numberlike, and letterlike characters, and maybe the hyphen and there wouldn't be anything to disrupt a regexp).

As far as my soundex suggestion goes, even without that there is still the literal index lookup idea: keep a table <word> | <record containing that word>: look in that table for each word in the search string, that gives you the records in which each word appears. That saves having to do potentially hairy searching of the records themselves. One thing I did in my code was score each record by how many of the search terms it contained, and sorted matches by that. (Note that this is basically a dumbed-down full-text-search index; if I had one at the time I'd've used that.) Of course it's necessary to keep that table up to date if/when the records being searched change.

(Oh, and I've finally remembered what the idea behind the extra "keyword lookup" table is called: KWIC indexing).

sneakyimp · Sep 7, 2018

Weedpacket

As far as escaping for REGEXP goes: since (I'm assuming) you don't want users writing arbitrary regular expressions as their search terms, you'd need to quote anything your DBMS considers having significance in regular expressions (although you probably aren't interested in any of them: just keep whitespace, numberlike, and letterlike characters, and maybe the hyphen and there wouldn't be anything to disrupt a regexp).

If only there were some function like preg_quote for escaping keywords to be fed into an SQL REGEXP expression. I expect this would be a fairly involved problem to solve in any general way -- especially considering that hyphens are meaningful within square brackets but not outside of the square brackets. I'm reminded of the complicated url validation issue, which is what led me to the decision of stripping all but numbers & letters & spaces (and maybe hyphens too).

Weedpacket As far as my soundex suggestion goes, even without that there is still the literal index lookup idea...

I remember seeing this approach used by PHPBB back in the day and, while it seems useful and fairly clever, I expect that there's a fair amount of effort involved both initially and ongoing to munge the various tables' columns and generate the index tables and update these indexes when the data changes, etc. I cannot help but think that to do so would be reinventing the wheel.

I looked into MySQL Natural Language Full-Text Searches and it's pretty exciting. PostgreSQL has something similar.

With MySQL, I can simply feed the user's original search string into my SQL query and it handles all the detail:

SELECT id_code, title, description, MATCH(description) AGAINST ('video games' IN NATURAL LANGUAGE MODE) AS score FROM other_data_table ORDER by score DESC

I've only just done a cursory inspection of the search results but it would appear that this natural language search, like my suggested approach, ignores punctuation. This is hardly scientific or comprehensive, but this search yields essentially the same search results:

SELECT id_code, title, description, MATCH(description) AGAINST ('[video]--! * ., ., .,  games\\\\ ' IN NATURAL LANGUAGE MODE) AS score FROM other_data_table ORDER by score DESC

The scores returned in the second gobbledygook punctuation search are slightly lower (maybe 2-5%) than in the first query, but the ids and titles returned are the same and in the same order. I feel that, to some degree, this vindicates my earlier decision to strip out the punctuation.

It is noteworthy that I must add a fulltext index on the columns to be searched, but it's a lot easier to run one SQL command or click a link in phpmyadmin than to create my own indexing scheme. Additionally, I can apparently feed the user's search query right into the query without massaging the keywords and constructing numerous queries of my own. This is a considerable improvement over my code which has 4 queries per keyword.

Weedpacket · Sep 7, 2018

sneakyimp especially considering that hyphens are meaningful within square brackets but not outside of the square brackets.

Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.

And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.

sneakyimp · Sep 9, 2018

Weedpacket Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.

My (perhaps cowardly) point about the hyphen in a regex expression was to suggest dread at the prospect of writing some function, analagous to preg_quote, to escape keywords designed to be fed from PHP code into a REGEXP inside an SQL statement. I attempted to point out that a hyphen in a regex need not be escaped unless it is part of a square-bracketed character range that you want to actually match a hyphen e.g., this one that matches either a single digit or a hypen: /[0-9\-]/

Interestingly, this script has the same output for the first 3 regexes:

$regexes = array(
  '/-/',
  '/\-/',
  '/\\-/',
  '/\\\\-/'
);

foreach ($regexes as $r) {
  echo $r . "\n";
  $matches = null;
  if (!preg_match($r, "\-", $matches)) {
    echo "no match\n";
  } else {
    var_dump($matches);
  }
  echo "\n";
}

Also interesting is that preg_quote escapes a hyphen:

// outputs:  string(2) "\-"
var_dump(preg_quote("-", "/"));

And this is only one of the many special regex characters. Check out this script:

$matches = null;
if (!preg_match('/[*]/', "*", $matches)) {
  echo "no match\n";
} else {
  var_dump($matches);
}

The bracketed char range [*] matches the string with one asterisk. The output:

array(1) {
  [0] =>
  string(1) "*"
}

Whereas preg_quote will definitely escape an asterisk:

// outputs: string(2) "\*"
var_dump(preg_quote("*", "/"));

I see that [\*] also will match a string containing a single asterisk, but I'm definitely feeling uneasy about the prospect that will properly escape any keyword or char that I might want to feed into any partial SQL expression which is going inside from SQL statement. The way that multiple regexes with our without escaped special chars identically match a given string seems especially complicated when I might be feeding it into either of these expressions:

SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]mychar'
SELECT * FROM my_table WHERE my_col REGEXP '[mychar0-9]'

Then also consider that I might be using a prepared statement where each value to be merged is represented by the bind character, ?.

As I said before, it starts to remind me of that nasty url-validating issue.

Weedpacket And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.

I am always impressed with the depth to which you understand code. Such a thing would never have occurred to me. I will not be doing this. I'd much rather pose the question why allow punctuation or line breaks in a full text search at all? For coders and pedants like us, we might want to search for some peculiar series of characters. E.g., I frequently grep search for something->methodName or whatever, but I doubt the ~~chuckleheads~~ very nice people who use my site have any need at all for such a thing. I'm pretty comfortable depriving my users of punctuation search -- unless someone can give a good reason not to.

MEANWHILE...

I've had some good luck using MySQL's natural language search functionality. A couple of noteworthy points:
- The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. This setting can apparently be configured and is applied at the time an index is generated. See caveats in the docs for more info.
- There are numerous stopwords.
- "and" is apparently not one of the default stop words and, for some mystifying reason, causes a modest fulltext search to run VERY slowly and return A LOT more results. I don't know why this happens.
- overall, my old search approach where I generate the regex tends to return more results (unless "and" is present in the search string)
- the old search that uses regexes is MUCH slower (unless "and" is present in the search terms).
- I've not done a thorough test yet, but the sequence of the search keywords apparently doesn't make any difference

ALSO: New code uses PDO

I'll post the new code momentarily. This post is already a bit long.

sneakyimp · Sep 9, 2018

This new function uses PDO and MySQL natural language search. It is about 4 times faster than the prior function I posted unless the word "and" is in my search terms.

/**
 * This function trims and cleans the search query so that it can be
 * safely fed directly to a mysql natural language search
 * @param string $search_query
 * @return string
 */
private static function clean_search_string($search_query) {
	// convert any newline chars or multiple spaces into single spaces
	$clean_string = trim(preg_replace('/\s+/', " ", $search_query));
	
	// TODO: consider removing trivial or bad words like and, or, etc. Maybe scrubbing for hacker shit?
	
	return $clean_string;
}

/**
 * Updated search function that uses MySQL natural language searching; only requires 4 queries
 * @see https://dev.mysql.com/doc/refman/8.0/en/fulltext-natural-language.html
 * @param CI_DB $db CodeIgniter DB
 * @param string $search_query user-supplied search query MAY COME DIRECTLY FROM USER INPUT SO USE CAUTION
 * @throws Exception
 */
public static function career_search_new($db, $search_query) {
	$clean_string = self::clean_search_string($search_query);
	if (mb_strlen($clean_string) == 0) {
		throw new Exception("search_query empty after cleaning");
	}
	
	$sql = array();
	// careers.title
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, (MATCH(title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_title_factor AS score, 'q_ct' AS qid
			FROM " . TABLE_1 . " c
			WHERE MATCH(title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)";
	
	// career task statements
	// takes an average of all rows matching the current c_i
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_task_statement_factor AS score, 'q_ts' AS qid
			FROM " . TASK_STATEMENTS_TABLE . " ts, " . TABLE_1 . " c
			WHERE ts.id_code = c.id_code
				AND MATCH(ts.task) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
			GROUP BY c_i";
	
	//	alternate titles
	// takes an average of all rows matching the current c_i
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(oat.alternate_title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_alternate_title_factor AS score, 'q_at' AS qid
			FROM " . ALTERNATE_TITLES_TABLE . " oat, " . TABLE_1 . " c
			WHERE oat.id_code = c.id_code
				AND MATCH(oat.alternate_title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
			GROUP BY c_i";

	// occupation data
	// takes an average of all rows matching the current c_i
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(od.description) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_occupation_data_factor AS score, 'q_od' AS qid
			FROM " . OD_TABLE . " od, " . TABLE_1 . " c
			WHERE od.id_code = c.id_code
				AND MATCH(od.description) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
			GROUP BY c_i";
	
	// aggregate the above queries using UNION into a single query for efficiency
	$combined_sql = "SELECT c_i, c_t, seo_title, SUM(score) AS score
			FROM (" . implode("\nUNION\n", $sql) . ") AS union_query
			GROUP BY c_i";

	// for testing/inspection
// 		$combined_sql = implode("\nUNION\n", $sql);
	
	// params to supply to the PDO prepared statement
	$sql_params = array(
			":clean_string" => $clean_string,
			":career_title_factor" => self::career_title_factor,
			":career_task_statement_factor" => self::career_task_statement_factor,
			":career_alternate_title_factor" => self::career_alternate_title_factor,
			":career_occupation_data_factor" => self::career_occupation_data_factor
	);
	
	$retval = self::pdo_fetch_all($db, $combined_sql, $sql_params);
	
	// for testing, remove for production
// 		usort($retval, "self::sort_by_score");
	
	return $retval;
	
}

/**
 * Bypasses CodeIgniter db and uses PDO object directly to prepare statement and execute with provided parameters
 * @param CI_DB $db Codeigniter DB object
 * @param string $sql SQL statement to be prepared for execution
 * @param array $params Either an associative array of named bindings or just an array of values for ? bindings
 * @throws Exception
 */
private static function pdo_fetch_all($db, $sql, $params) {

	// when using pdo for db connection, the PDO object is $db->conn_id;
	// set PDO to throw exceptions for errors or you might have trouble figuring out problems
	if (!$db->conn_id->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION)) {
		throw new Exception("Unable to set PDO attribute");
	}
	
	$stmt = $db->conn_id->prepare($sql);
	if (!$stmt) {
		throw new Exception("Statement prepare failed");
	}
	$query = $stmt->execute($params);
	if (!$query) {
		throw new Exception("query failed");
	}
	// this would be an array of arrays
	$retval = $stmt->fetchAll(PDO::FETCH_ASSOC);
	
	return $retval;
	
}

Weedpacket · Sep 9, 2018

sneakyimp Also interesting is that preg_quote escapes a hyphen:

Because preg_quote itself doesn't know your intentions about where in the regular expression you're going to be inserting the string it was passed: for all it knows you might have been collecting (and quoting) a set of characters with the intention of putting it inside a [] pair. All it's able to assume is that the characters of the string you give it are not to be interpreted as regexp operators.

sneakyimp · Sep 9, 2018

Weedpacket Because preg_quote itself doesn't know your intentions about where in the regular expression you're going to be inserting the string it was passed: for all it knows you might have been collecting (and quoting) a set of characters with the intention of putting it inside a [] pair. All it's able to assume is that the characters of the string you give it are not to be interpreted as regexp operators.

I want to say it's noteworthy that the devs of preg have not bothered to make some kind of prepare_preg function that is context-aware when escaping some abritrary piece of data to be merged into a regex. I expect there's probably little need or demand for such a function. I do wonder about 3/4 of regexes with varying numbers of backslashes above all returning the same matching results. I don't think I'm equipped to really ruminate on this and have some grand epiphany about how to generally solve the problem of escaping from PHP->SQL->REGEXP with deterministic results. Certainly not on this project.

I also believe that my experience with the MySQL Natural Language Full-Text Search functionality is reinforcing my instincts as far as stripping keywords goes, but I don't want to get too uppity. As previously mentioned, the presence of the word "and" is bogging down my searches. Having somewhat painstakingly examined various search strings and their results, various EXPLAIN statements, forum posts, documentation, and rants I am starting to think this and-keyword-slowness problem is not due to any functionally different treatment of this word as an operator but rather the fact that the word "and" is very common, yielding a very large number of matches which forces a lot of data comparisons. I find it puzzling that "and" is missing from the default MySQL stopwords file precisely because it is going to appear in pretty much any English string of sufficient length. I think my instincts were right to exclude this word from a full text search. HOWEVER, I'd like to definitively resolve why the presence of "and" in any search will bog things down.

You don't even need to run the whole big UNIONized query to get slowness. My tasks subquery runs slow -- around 8 seconds:

SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE)) * 3 AS score, 'q_ts' AS qid FROM tasks ts, careers c WHERE ts.id_code = c.id_code AND MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE) GROUP BY c_i

If I EXPLAIN it I get this. Seems OK to me but I might be overlooking something? I'm not really sure what to make of it.

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra   
1   SIMPLE  ts  fulltext    id_code,task    task    0   NULL    1   Using where; Using temporary; Using filesort
1   SIMPLE  c   ALL     id_code     NULL    NULL    NULL    1110    Using where

The query returns 87% of the careers (967 out of 1110). If I drop "and" from the search and just search "pig whistle" I get only 2 careers out of 1110 and the query runs in about 5 milliseconds.

Is there some way to inspect the words stored in a MySQL full-text search index for an Innodb table? I see there's a myisam_ftdump function but don't see an innodb one. I'm thinking it'd be informative to see what the most common words are. Perhaps then I could formulate some query using another common word and see if it's also slow.

If anyone has thoughts about why "and" makes a search slow and, more importantly, if this problem might also happen with other strings, then I'd very much like to hear your thoughts.

In the meantime, there seem to be some good reasons for excluding punctuation and words like "and."
- for this application, users are probably unlikely to search for specific punctuation sequences or attempt to use regex patterns
- punctuation in particular introduces regex problems for my original search function.
- mysql full text search excludes any strings less than 3 chars long by default
- common words like "and" appear to cause performance problems
- better devs than I (the mysql devs) have utilized a stopwords list (possibly for performance reasons?)
- search results get dramatically expanded with unhelpful matches when common words are used, clouding the utility of the results

Weedpacket · Sep 10, 2018

It's more concerning than that. A quick squizz at the source code says "and" is a stopword.

...skip 23 words ... am among amongst anandanother any anybody anyhow... skip 511 words

Oh, wait, that's the myisam implementation: storage/myisam/ft_static.cc (not ft_stopwords.cc it turns out).

Yeah.... different storage engines do full-text search differently. Let's see if they use different lists of stopwords, too Refactoring? What's that?.

Let's look under innodb: Um, that would be storage/innobase/fts I suppose.

/** InnoDB default stopword list:
There are different versions of stopwords, the stop words listed
below comes from "Google Stopword" list. Reference:
http://meta.wikimedia.org/wiki/Stop_word_list/google_stop_word_list.
The final version of InnoDB default stopword list is still pending
for decision */
const char *fts_default_stopword[] = {
    "a",    "about", "an",  "are",  "as",   "at",    "be",   "by",
    "com",  "de",    "en",  "for",  "from", "how",   "i",    "in",
    "is",   "it",    "la",  "of",   "on",   "or",    "that", "the",
    "this", "to",    "was", "what", "when", "where", "who",  "will",
    "with", "und",   "the", "www",  NULL};

(from fts0fts.cc)

For extra giggles, have a look at the list referenced in the comment. "Pending for decision". This list in its current form and that comment are at least eight years old.

<Insomniak`> Stupid Google <Insomniak`> "The" is a common word, and was not included in your search <Insomniak`> "Who" is a common word, and was not included in your search
http://bash.org/?514353

sneakyimp · Sep 10, 2018

I just wrote a lengthy response to this and firefox crashed

sneakyimp · Sep 10, 2018

lengthy response condensed:

Weedpacket It's more concerning than that. A quick squizz at the source code says "and" is a stopword.

Are things really so bad? Docs explain that you can check default stopwords like so:

SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD

On my workstation, "and" is not among them:

a about an are as at be by com de en for from how i in is it la of on or that the this to was what when where

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical. Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

Mostly worried about:
- When will I encounter a 20-second search? Is "and" the only culprit or are there others?
- MySQL FTS only searches whole words. I.e., search for "soft" won't match "software." My slow legacy function would find these partial words.
- is there some way to inspect the contents of MySQL Innodb full-text index? Alternative is that I must roll my own code to compile word counts. Was hoping to check what the second-most-popular word is (and how popular in comparison to "and") and maybe search for that. Already tried a few candidates with no joy.

sneakyimp · Sep 10, 2018

I wrote a quick script to count the frequency of words in a column in my tasks table:

public function words() {
	$sql = "SELECT c.id_code AS ci, c.title as c_t, ts.task FROM tasks ts, careers c
WHERE ts.id_code = c.id_code";
	$query = $this->db->query($sql);
	if (!$query) {
		throw new Exception("Query failed");
	}
	$retval = array();
	$i = 0;
	while ($row = $query->unbuffered_row("array")) {
		$i++;
		$word_arr = self::parse_words($row["task"]);
		foreach($word_arr as $word) {
			if (array_key_exists($word, $retval)) {
				$retval[$word]++;
			} else {
				$retval[$word] = 1;
			}
		}
	}
	echo "$i records<br>";
	echo count($retval) . " distinct words encountered<br>";
	arsort($retval);
	var_dump($retval);
}

private static function parse_words($str) {
	// clean spaces
	$clean_string = trim(preg_replace('/\s+/', " ", $str));
	// remove all but letters & numbers & spaces
	$clean_string = preg_replace("/[^\pL\pN\pZ]/u", "", $clean_string);
	// return an array of words
	return preg_split('/\s+/', $clean_string);
}

As expected, "and" is the most common word....the first lines of output:

19530 records
13898 distinct words encountered

array (size=13898)
  'and' => int 17498
  'or' => int 15182
  'to' => int 10451
  'of' => int 6098
  'as' => int 3492
  'for' => int 3413
  'such' => int 2896
  'in' => int 2745
  'the' => int 2479
  'with' => int 2350
  'using' => int 2153
  'equipment' => int 2125
...

Weedpacket · Sep 10, 2018

sneakyimp Are things really so bad? Docs explain that you can check default stopwords like so:

Well yeah, but that would mean installing MySQL.

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical.

So you define a stopword table that does include "and".

Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

As is "The Who".

// clean spaces $clean_string = trim(preg_replace('/\s+/', " ", $str));

Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here. Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

sneakyimp · Sep 11, 2018

Weedpacket Well yeah, but that would mean installing MySQL.

I'm guessing you use PostGreSQL?

Weedpacket So you define a stopword table that does include "and".

I have certainly been considering this since realizing the nasty effect "and" has on my search performance. The reason I didn't simply do so is because I was worried that the problem might happen with other words -- and I'm more convinced now that it would if such a word appeared commonly.

I feel like just adding stop words is a bit like whack-a-mole. I believe that frequent+short words might cause performance problems, but I don't really understand why this would make things slow. Is it because the code has to munge larger volumes of data somehow? It doesn't seem like the issue arises at the UNION stage because the individual subqueries --which just calculate a floating-point score for each record -- are slow. I feel like something must be really inefficient, but don't really have the time (or coding chops) to get to the bottom of it.

Weedpacket As is "The Who".

I expect I will define an array of stopwords -- in code rather than defining additional tables because the stop words clearly seem important to the text being searched and the type of search we would like to work on that text. I doubt anyone will search career-related data for "the who" or "to be or not to be" and assert, as before, that "and" is totally unimportant to the types of searches to be performed on this data.

Weedpacket Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here.

Thanks for that suggestion. It occurs to me now that preg_split along \s+ is sufficient.

Weedpacket Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

Not sure what you mean by 'final array' but I'm guessing that you are referring to my array of search words. I want to check first if duplicate search words change the scores. E.g., if a search for "tiger" yields any different results than a search for "tiger tiger."

Also, I might pick a nit and say that the appearance of a search term more than once in the record to be searched does matter and, in fact, will yield a higher score. I want to say (but have no proof) that the additional effort to score multiple matches higher than a single match is very closely related to the performance problem with searching for "and." My old approach, using REGEXP and the word-boundary markers is actually faster than the mysql natural language search for this query:

search term is "such using equipment"
old has 1011 matches
old_elapsed: 1.6434950828552
new has 1011 matches
new_elapsed: 4.6821620464325

I believe this is because the REGEXP just returns a binary true/false once a match is found, whereas the mysql natural language search continues to munge the text looking for additional matches to finish calculating a relevance score.

sneakyimp · Sep 11, 2018

I've developed a theory about the slowness of the mysql natural language search (MNLS). The slowness is because it must fully munge all of the text in any matching record. If you get a lot of matches, this is a lot of munging. The MNLS benefits from a full-text search index but that just identifies which records contain a given word. My original REGEXP query must do a full table scan every time apparently, but as soon as it finds a match, it can return TRUE and ignore the rest of the text. Because MNLS must calculate a relevance score, it has to munge all the text stored in that record's column to fully calculate the relevance score.

sneakyimp · Sep 11, 2018

in support of my theory, i created a few entries in my tasks table where the task column just contains the word "tiger". I made four records

99999  - tiger
100000 - tiger tiger
100001 - tiger tiger tiger tiger 
100002 - tiger tiger tiger tiger tiger tiger tiger tiger

and ran this query:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE)
ORDER BY score ASC

Sure enough, the MNLS scores the records higher that have more instances of the word. In fact, the score is precisely proportional to the number of occurrences:

99999 	13.566825866699219
100000 	27.133651733398438
100001 	54.267303466796875
100002 	108.53460693359375

More precisely score=number_of_occurrences * 13.566825866699219