Free-form text search of multiple tables

sneakyimp · Sep 7, 2018

As far as escaping for REGEXP goes: since (I'm assuming) you don't want users writing arbitrary regular expressions as their search terms, you'd need to quote anything your DBMS considers having significance in regular expressions (although you probably aren't interested in any of them: just keep whitespace, numberlike, and letterlike characters, and maybe the hyphen and there wouldn't be anything to disrupt a regexp).

If only there were some function like preg_quote for escaping keywords to be fed into an SQL REGEXP expression. I expect this would be a fairly involved problem to solve in any general way -- especially considering that hyphens are meaningful within square brackets but not outside of the square brackets. I'm reminded of the complicated url validation issue, which is what led me to the decision of stripping all but numbers & letters & spaces (and maybe hyphens too).

Weedpacket As far as my soundex suggestion goes, even without that there is still the literal index lookup idea...

I remember seeing this approach used by PHPBB back in the day and, while it seems useful and fairly clever, I expect that there's a fair amount of effort involved both initially and ongoing to munge the various tables' columns and generate the index tables and update these indexes when the data changes, etc. I cannot help but think that to do so would be reinventing the wheel.

I looked into MySQL Natural Language Full-Text Searches and it's pretty exciting. PostgreSQL has something similar.

With MySQL, I can simply feed the user's original search string into my SQL query and it handles all the detail:

SELECT id_code, title, description, MATCH(description) AGAINST ('video games' IN NATURAL LANGUAGE MODE) AS score FROM other_data_table ORDER by score DESC

I've only just done a cursory inspection of the search results but it would appear that this natural language search, like my suggested approach, ignores punctuation. This is hardly scientific or comprehensive, but this search yields essentially the same search results:

SELECT id_code, title, description, MATCH(description) AGAINST ('[video]--! * ., ., .,  games\\\\ ' IN NATURAL LANGUAGE MODE) AS score FROM other_data_table ORDER by score DESC

The scores returned in the second gobbledygook punctuation search are slightly lower (maybe 2-5%) than in the first query, but the ids and titles returned are the same and in the same order. I feel that, to some degree, this vindicates my earlier decision to strip out the punctuation.

It is noteworthy that I must add a fulltext index on the columns to be searched, but it's a lot easier to run one SQL command or click a link in phpmyadmin than to create my own indexing scheme. Additionally, I can apparently feed the user's search query right into the query without massaging the keywords and constructing numerous queries of my own. This is a considerable improvement over my code which has 4 queries per keyword.

Weedpacket · Sep 7, 2018

sneakyimp especially considering that hyphens are meaningful within square brackets but not outside of the square brackets.

Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.

And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.

sneakyimp · Sep 9, 2018

Weedpacket Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.

My (perhaps cowardly) point about the hyphen in a regex expression was to suggest dread at the prospect of writing some function, analagous to preg_quote, to escape keywords designed to be fed from PHP code into a REGEXP inside an SQL statement. I attempted to point out that a hyphen in a regex need not be escaped unless it is part of a square-bracketed character range that you want to actually match a hyphen e.g., this one that matches either a single digit or a hypen: /[0-9\-]/

Interestingly, this script has the same output for the first 3 regexes:

$regexes = array(
  '/-/',
  '/\-/',
  '/\\-/',
  '/\\\\-/'
);

foreach ($regexes as $r) {
  echo $r . "\n";
  $matches = null;
  if (!preg_match($r, "\-", $matches)) {
    echo "no match\n";
  } else {
    var_dump($matches);
  }
  echo "\n";
}

Also interesting is that preg_quote escapes a hyphen:

// outputs:  string(2) "\-"
var_dump(preg_quote("-", "/"));

And this is only one of the many special regex characters. Check out this script:

$matches = null;
if (!preg_match('/[*]/', "*", $matches)) {
  echo "no match\n";
} else {
  var_dump($matches);
}

The bracketed char range [*] matches the string with one asterisk. The output:

array(1) {
  [0] =>
  string(1) "*"
}

Whereas preg_quote will definitely escape an asterisk:

// outputs: string(2) "\*"
var_dump(preg_quote("*", "/"));

I see that [\*] also will match a string containing a single asterisk, but I'm definitely feeling uneasy about the prospect that will properly escape any keyword or char that I might want to feed into any partial SQL expression which is going inside from SQL statement. The way that multiple regexes with our without escaped special chars identically match a given string seems especially complicated when I might be feeding it into either of these expressions:

SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]mychar'
SELECT * FROM my_table WHERE my_col REGEXP '[mychar0-9]'

Then also consider that I might be using a prepared statement where each value to be merged is represented by the bind character, ?.

As I said before, it starts to remind me of that nasty url-validating issue.

Weedpacket And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.

I am always impressed with the depth to which you understand code. Such a thing would never have occurred to me. I will not be doing this. I'd much rather pose the question why allow punctuation or line breaks in a full text search at all? For coders and pedants like us, we might want to search for some peculiar series of characters. E.g., I frequently grep search for something->methodName or whatever, but I doubt the ~~chuckleheads~~ very nice people who use my site have any need at all for such a thing. I'm pretty comfortable depriving my users of punctuation search -- unless someone can give a good reason not to.

MEANWHILE...

I've had some good luck using MySQL's natural language search functionality. A couple of noteworthy points:
- The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. This setting can apparently be configured and is applied at the time an index is generated. See caveats in the docs for more info.
- There are numerous stopwords.
- "and" is apparently not one of the default stop words and, for some mystifying reason, causes a modest fulltext search to run VERY slowly and return A LOT more results. I don't know why this happens.
- overall, my old search approach where I generate the regex tends to return more results (unless "and" is present in the search string)
- the old search that uses regexes is MUCH slower (unless "and" is present in the search terms).
- I've not done a thorough test yet, but the sequence of the search keywords apparently doesn't make any difference

ALSO: New code uses PDO

I'll post the new code momentarily. This post is already a bit long.

sneakyimp · Sep 9, 2018

This new function uses PDO and MySQL natural language search. It is about 4 times faster than the prior function I posted unless the word "and" is in my search terms.

/**
 * This function trims and cleans the search query so that it can be
 * safely fed directly to a mysql natural language search
 * @param string $search_query
 * @return string
 */
private static function clean_search_string($search_query) {
	// convert any newline chars or multiple spaces into single spaces
	$clean_string = trim(preg_replace('/\s+/', " ", $search_query));
	
	// TODO: consider removing trivial or bad words like and, or, etc. Maybe scrubbing for hacker shit?
	
	return $clean_string;
}

/**
 * Updated search function that uses MySQL natural language searching; only requires 4 queries
 * @see https://dev.mysql.com/doc/refman/8.0/en/fulltext-natural-language.html
 * @param CI_DB $db CodeIgniter DB
 * @param string $search_query user-supplied search query MAY COME DIRECTLY FROM USER INPUT SO USE CAUTION
 * @throws Exception
 */
public static function career_search_new($db, $search_query) {
	$clean_string = self::clean_search_string($search_query);
	if (mb_strlen($clean_string) == 0) {
		throw new Exception("search_query empty after cleaning");
	}
	
	$sql = array();
	// careers.title
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, (MATCH(title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_title_factor AS score, 'q_ct' AS qid
			FROM " . TABLE_1 . " c
			WHERE MATCH(title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)";
	
	// career task statements
	// takes an average of all rows matching the current c_i
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_task_statement_factor AS score, 'q_ts' AS qid
			FROM " . TASK_STATEMENTS_TABLE . " ts, " . TABLE_1 . " c
			WHERE ts.id_code = c.id_code
				AND MATCH(ts.task) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
			GROUP BY c_i";
	
	//	alternate titles
	// takes an average of all rows matching the current c_i
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(oat.alternate_title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_alternate_title_factor AS score, 'q_at' AS qid
			FROM " . ALTERNATE_TITLES_TABLE . " oat, " . TABLE_1 . " c
			WHERE oat.id_code = c.id_code
				AND MATCH(oat.alternate_title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
			GROUP BY c_i";

	// occupation data
	// takes an average of all rows matching the current c_i
	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(od.description) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_occupation_data_factor AS score, 'q_od' AS qid
			FROM " . OD_TABLE . " od, " . TABLE_1 . " c
			WHERE od.id_code = c.id_code
				AND MATCH(od.description) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
			GROUP BY c_i";
	
	// aggregate the above queries using UNION into a single query for efficiency
	$combined_sql = "SELECT c_i, c_t, seo_title, SUM(score) AS score
			FROM (" . implode("\nUNION\n", $sql) . ") AS union_query
			GROUP BY c_i";

	// for testing/inspection
// 		$combined_sql = implode("\nUNION\n", $sql);
	
	// params to supply to the PDO prepared statement
	$sql_params = array(
			":clean_string" => $clean_string,
			":career_title_factor" => self::career_title_factor,
			":career_task_statement_factor" => self::career_task_statement_factor,
			":career_alternate_title_factor" => self::career_alternate_title_factor,
			":career_occupation_data_factor" => self::career_occupation_data_factor
	);
	
	$retval = self::pdo_fetch_all($db, $combined_sql, $sql_params);
	
	// for testing, remove for production
// 		usort($retval, "self::sort_by_score");
	
	return $retval;
	
}

/**
 * Bypasses CodeIgniter db and uses PDO object directly to prepare statement and execute with provided parameters
 * @param CI_DB $db Codeigniter DB object
 * @param string $sql SQL statement to be prepared for execution
 * @param array $params Either an associative array of named bindings or just an array of values for ? bindings
 * @throws Exception
 */
private static function pdo_fetch_all($db, $sql, $params) {

	// when using pdo for db connection, the PDO object is $db->conn_id;
	// set PDO to throw exceptions for errors or you might have trouble figuring out problems
	if (!$db->conn_id->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION)) {
		throw new Exception("Unable to set PDO attribute");
	}
	
	$stmt = $db->conn_id->prepare($sql);
	if (!$stmt) {
		throw new Exception("Statement prepare failed");
	}
	$query = $stmt->execute($params);
	if (!$query) {
		throw new Exception("query failed");
	}
	// this would be an array of arrays
	$retval = $stmt->fetchAll(PDO::FETCH_ASSOC);
	
	return $retval;
	
}

Weedpacket · Sep 9, 2018

sneakyimp Also interesting is that preg_quote escapes a hyphen:

Because preg_quote itself doesn't know your intentions about where in the regular expression you're going to be inserting the string it was passed: for all it knows you might have been collecting (and quoting) a set of characters with the intention of putting it inside a [] pair. All it's able to assume is that the characters of the string you give it are not to be interpreted as regexp operators.

sneakyimp · Sep 9, 2018

Weedpacket Because preg_quote itself doesn't know your intentions about where in the regular expression you're going to be inserting the string it was passed: for all it knows you might have been collecting (and quoting) a set of characters with the intention of putting it inside a [] pair. All it's able to assume is that the characters of the string you give it are not to be interpreted as regexp operators.

I want to say it's noteworthy that the devs of preg have not bothered to make some kind of prepare_preg function that is context-aware when escaping some abritrary piece of data to be merged into a regex. I expect there's probably little need or demand for such a function. I do wonder about 3/4 of regexes with varying numbers of backslashes above all returning the same matching results. I don't think I'm equipped to really ruminate on this and have some grand epiphany about how to generally solve the problem of escaping from PHP->SQL->REGEXP with deterministic results. Certainly not on this project.

I also believe that my experience with the MySQL Natural Language Full-Text Search functionality is reinforcing my instincts as far as stripping keywords goes, but I don't want to get too uppity. As previously mentioned, the presence of the word "and" is bogging down my searches. Having somewhat painstakingly examined various search strings and their results, various EXPLAIN statements, forum posts, documentation, and rants I am starting to think this and-keyword-slowness problem is not due to any functionally different treatment of this word as an operator but rather the fact that the word "and" is very common, yielding a very large number of matches which forces a lot of data comparisons. I find it puzzling that "and" is missing from the default MySQL stopwords file precisely because it is going to appear in pretty much any English string of sufficient length. I think my instincts were right to exclude this word from a full text search. HOWEVER, I'd like to definitively resolve why the presence of "and" in any search will bog things down.

You don't even need to run the whole big UNIONized query to get slowness. My tasks subquery runs slow -- around 8 seconds:

SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE)) * 3 AS score, 'q_ts' AS qid FROM tasks ts, careers c WHERE ts.id_code = c.id_code AND MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE) GROUP BY c_i

If I EXPLAIN it I get this. Seems OK to me but I might be overlooking something? I'm not really sure what to make of it.

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra   
1   SIMPLE  ts  fulltext    id_code,task    task    0   NULL    1   Using where; Using temporary; Using filesort
1   SIMPLE  c   ALL     id_code     NULL    NULL    NULL    1110    Using where

The query returns 87% of the careers (967 out of 1110). If I drop "and" from the search and just search "pig whistle" I get only 2 careers out of 1110 and the query runs in about 5 milliseconds.

Is there some way to inspect the words stored in a MySQL full-text search index for an Innodb table? I see there's a myisam_ftdump function but don't see an innodb one. I'm thinking it'd be informative to see what the most common words are. Perhaps then I could formulate some query using another common word and see if it's also slow.

If anyone has thoughts about why "and" makes a search slow and, more importantly, if this problem might also happen with other strings, then I'd very much like to hear your thoughts.

In the meantime, there seem to be some good reasons for excluding punctuation and words like "and."
- for this application, users are probably unlikely to search for specific punctuation sequences or attempt to use regex patterns
- punctuation in particular introduces regex problems for my original search function.
- mysql full text search excludes any strings less than 3 chars long by default
- common words like "and" appear to cause performance problems
- better devs than I (the mysql devs) have utilized a stopwords list (possibly for performance reasons?)
- search results get dramatically expanded with unhelpful matches when common words are used, clouding the utility of the results

Weedpacket · Sep 10, 2018

It's more concerning than that. A quick squizz at the source code says "and" is a stopword.

...skip 23 words ... am among amongst anandanother any anybody anyhow... skip 511 words

Oh, wait, that's the myisam implementation: storage/myisam/ft_static.cc (not ft_stopwords.cc it turns out).

Yeah.... different storage engines do full-text search differently. Let's see if they use different lists of stopwords, too Refactoring? What's that?.

Let's look under innodb: Um, that would be storage/innobase/fts I suppose.

/** InnoDB default stopword list:
There are different versions of stopwords, the stop words listed
below comes from "Google Stopword" list. Reference:
http://meta.wikimedia.org/wiki/Stop_word_list/google_stop_word_list.
The final version of InnoDB default stopword list is still pending
for decision */
const char *fts_default_stopword[] = {
    "a",    "about", "an",  "are",  "as",   "at",    "be",   "by",
    "com",  "de",    "en",  "for",  "from", "how",   "i",    "in",
    "is",   "it",    "la",  "of",   "on",   "or",    "that", "the",
    "this", "to",    "was", "what", "when", "where", "who",  "will",
    "with", "und",   "the", "www",  NULL};

(from fts0fts.cc)

For extra giggles, have a look at the list referenced in the comment. "Pending for decision". This list in its current form and that comment are at least eight years old.

<Insomniak`> Stupid Google <Insomniak`> "The" is a common word, and was not included in your search <Insomniak`> "Who" is a common word, and was not included in your search
http://bash.org/?514353

sneakyimp · Sep 10, 2018

I just wrote a lengthy response to this and firefox crashed

sneakyimp · Sep 10, 2018

lengthy response condensed:

Weedpacket It's more concerning than that. A quick squizz at the source code says "and" is a stopword.

Are things really so bad? Docs explain that you can check default stopwords like so:

SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD

On my workstation, "and" is not among them:

a about an are as at be by com de en for from how i in is it la of on or that the this to was what when where

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical. Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

Mostly worried about:
- When will I encounter a 20-second search? Is "and" the only culprit or are there others?
- MySQL FTS only searches whole words. I.e., search for "soft" won't match "software." My slow legacy function would find these partial words.
- is there some way to inspect the contents of MySQL Innodb full-text index? Alternative is that I must roll my own code to compile word counts. Was hoping to check what the second-most-popular word is (and how popular in comparison to "and") and maybe search for that. Already tried a few candidates with no joy.

sneakyimp · Sep 10, 2018

I wrote a quick script to count the frequency of words in a column in my tasks table:

public function words() {
	$sql = "SELECT c.id_code AS ci, c.title as c_t, ts.task FROM tasks ts, careers c
WHERE ts.id_code = c.id_code";
	$query = $this->db->query($sql);
	if (!$query) {
		throw new Exception("Query failed");
	}
	$retval = array();
	$i = 0;
	while ($row = $query->unbuffered_row("array")) {
		$i++;
		$word_arr = self::parse_words($row["task"]);
		foreach($word_arr as $word) {
			if (array_key_exists($word, $retval)) {
				$retval[$word]++;
			} else {
				$retval[$word] = 1;
			}
		}
	}
	echo "$i records<br>";
	echo count($retval) . " distinct words encountered<br>";
	arsort($retval);
	var_dump($retval);
}

private static function parse_words($str) {
	// clean spaces
	$clean_string = trim(preg_replace('/\s+/', " ", $str));
	// remove all but letters & numbers & spaces
	$clean_string = preg_replace("/[^\pL\pN\pZ]/u", "", $clean_string);
	// return an array of words
	return preg_split('/\s+/', $clean_string);
}

As expected, "and" is the most common word....the first lines of output:

19530 records
13898 distinct words encountered

array (size=13898)
  'and' => int 17498
  'or' => int 15182
  'to' => int 10451
  'of' => int 6098
  'as' => int 3492
  'for' => int 3413
  'such' => int 2896
  'in' => int 2745
  'the' => int 2479
  'with' => int 2350
  'using' => int 2153
  'equipment' => int 2125
...

Weedpacket · Sep 10, 2018

sneakyimp Are things really so bad? Docs explain that you can check default stopwords like so:

Well yeah, but that would mean installing MySQL.

Also offer fairly detailed recipes/instructions for defining your own stopword tables so I'm not sure the source code (presumably containing default stopwords) is so critical.

So you define a stopword table that does include "and".

Also points out that "to be or not to be" is a reasonable search string utterly obliterated by stopwords, depending on context.

As is "The Who".

// clean spaces $clean_string = trim(preg_replace('/\s+/', " ", $str));

Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here. Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

sneakyimp · Sep 11, 2018

Weedpacket Well yeah, but that would mean installing MySQL.

I'm guessing you use PostGreSQL?

Weedpacket So you define a stopword table that does include "and".

I have certainly been considering this since realizing the nasty effect "and" has on my search performance. The reason I didn't simply do so is because I was worried that the problem might happen with other words -- and I'm more convinced now that it would if such a word appeared commonly.

I feel like just adding stop words is a bit like whack-a-mole. I believe that frequent+short words might cause performance problems, but I don't really understand why this would make things slow. Is it because the code has to munge larger volumes of data somehow? It doesn't seem like the issue arises at the UNION stage because the individual subqueries --which just calculate a floating-point score for each record -- are slow. I feel like something must be really inefficient, but don't really have the time (or coding chops) to get to the bottom of it.

Weedpacket As is "The Who".

I expect I will define an array of stopwords -- in code rather than defining additional tables because the stop words clearly seem important to the text being searched and the type of search we would like to work on that text. I doubt anyone will search career-related data for "the who" or "to be or not to be" and assert, as before, that "and" is totally unimportant to the types of searches to be performed on this data.

Weedpacket Small point: you don't really need to normalise spaces because you take them out later anyway; trim alone would be sufficient here.

Thanks for that suggestion. It occurs to me now that preg_split along \s+ is sufficient.

Weedpacket Also, throwing an array_unique around the final array might be more accurate, since it shouldn't really matter if "of", say, appears more than once in a single record.

Not sure what you mean by 'final array' but I'm guessing that you are referring to my array of search words. I want to check first if duplicate search words change the scores. E.g., if a search for "tiger" yields any different results than a search for "tiger tiger."

Also, I might pick a nit and say that the appearance of a search term more than once in the record to be searched does matter and, in fact, will yield a higher score. I want to say (but have no proof) that the additional effort to score multiple matches higher than a single match is very closely related to the performance problem with searching for "and." My old approach, using REGEXP and the word-boundary markers is actually faster than the mysql natural language search for this query:

search term is "such using equipment"
old has 1011 matches
old_elapsed: 1.6434950828552
new has 1011 matches
new_elapsed: 4.6821620464325

I believe this is because the REGEXP just returns a binary true/false once a match is found, whereas the mysql natural language search continues to munge the text looking for additional matches to finish calculating a relevance score.

sneakyimp · Sep 11, 2018

I've developed a theory about the slowness of the mysql natural language search (MNLS). The slowness is because it must fully munge all of the text in any matching record. If you get a lot of matches, this is a lot of munging. The MNLS benefits from a full-text search index but that just identifies which records contain a given word. My original REGEXP query must do a full table scan every time apparently, but as soon as it finds a match, it can return TRUE and ignore the rest of the text. Because MNLS must calculate a relevance score, it has to munge all the text stored in that record's column to fully calculate the relevance score.

sneakyimp · Sep 11, 2018

in support of my theory, i created a few entries in my tasks table where the task column just contains the word "tiger". I made four records

99999  - tiger
100000 - tiger tiger
100001 - tiger tiger tiger tiger 
100002 - tiger tiger tiger tiger tiger tiger tiger tiger

and ran this query:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('tiger' IN NATURAL LANGUAGE MODE)
ORDER BY score ASC

Sure enough, the MNLS scores the records higher that have more instances of the word. In fact, the score is precisely proportional to the number of occurrences:

99999 	13.566825866699219
100000 	27.133651733398438
100001 	54.267303466796875
100002 	108.53460693359375

More precisely score=number_of_occurrences * 13.566825866699219

dalecosp · Sep 11, 2018

I hate to jump into a highly technical discussion, especially with a less than technical point; but is there any reason why you couldn't bolt a real search engine into your app? (Sphinx, Elasticsearch, etc.)?

MySQL's relatively poor search methods were part of the reason we decided to do this a few years back.

sneakyimp · Sep 11, 2018

dalecosp A welcome suggestion! Can you elaborate? The data is from a third party and arrives in a DB. I'm wondering how outside app would get at the data?

sneakyimp · Sep 11, 2018

Two additional observations.

First, assuming I've got those tiger*n entries just mentioned above, a search for just "tiger" yields different relevance scores than a search for "tiger tiger" but the latter search also matches the record with a just a single "tiger" in it. The latter search yields all the same results as the former search, except the scores are all reduced by a factor of 1.4289674259:

// search results for 'tiger tiger':
99999 	9.494146347045898
100000 	18.988292694091797
100001 	37.976585388183594
100002 	75.95317077636719

Second observation. MNLS recognizes quotes in your search string. This search:

SELECT ts.task_id, MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE) AS score
FROM onet_task_statements ts
WHERE MATCH(ts.task) AGAINST ('"tiger tiger"' IN NATURAL LANGUAGE MODE)
ORDER BY score DESC

Excludes the record with only one "tiger" in it.

Weedpacket · Sep 11, 2018

It looks like Sphinx can suck text directly from a MySQL db.
I presume Elasticsearch (based on Lucene) can as well, but being a Java application there's a shipload of configuration involved and I didn't wave my machete enough to find the source of the data.

NogDog · Sep 12, 2018

We use Elastisearch (along with the entire ELK stack) at work for a number of things, and it seems quite good -- though I'm far from an expert, mostly just using the Kibana part of the stack for my needs. I dabbled with Sphinx several years ago, and it seemed quite performant and relatively easy to use in conjunction with MySQL, if my memory serves. Either is almost certainly going to perform much better than full table scans in MySQL against regexes if the data reaches any significant size (for an undefined value of "significant").

PS: I suppose somewhere in between might be to leverage "full text search" in MySQL? https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

dalecosp · Sep 12, 2018

Sphinxsearch is written in C, but most distros and FreeBSD, etc., have packages available or you can build from a source tarball. It requires an SQL server; unlike Weed we're into MySQL (not necessarily completely or by choice), and Sphinx is good at that---off the top of my head I'm not sure what else they support.

Sphinx runs its own "searchd" on a different port (9306 is default as I recall) and the syntax for querying it is pretty much standard/MySQL. So, we have a "searchDB" object that manages it own connection to localhost 9306, and our search objects do something like "select foo, bar, baz from ombe_index where match('$term_string')" to get data to send back. IIRC, I clean $term_string with a must-match regexp, which is pretty much what you said you wanted to do originally.

A cronjob runs "indexer" just after 3 AM daily to update ombe_index, and a secondary "delta" indexer runs a few times during the day to catch up on new product listings, etc. --- which probably means my example query above is too simple ... there's probably a UNION statement so the delta index is also included in search results.

Hope that's helpful.

Free-form text search of multiple tables

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

dalecosp

Ssneakyimp

Ssneakyimp

Weedpacket

NogDog

dalecosp