Weedpacket Because preg_quote itself doesn't know your intentions about where in the regular expression you're going to be inserting the string it was passed: for all it knows you might have been collecting (and quoting) a set of characters with the intention of putting it inside a [] pair. All it's able to assume is that the characters of the string you give it are not to be interpreted as regexp operators.
I want to say it's noteworthy that the devs of preg have not bothered to make some kind of prepare_preg function that is context-aware when escaping some abritrary piece of data to be merged into a regex. I expect there's probably little need or demand for such a function. I do wonder about 3/4 of regexes with varying numbers of backslashes above all returning the same matching results. I don't think I'm equipped to really ruminate on this and have some grand epiphany about how to generally solve the problem of escaping from PHP->SQL->REGEXP with deterministic results. Certainly not on this project.
I also believe that my experience with the MySQL Natural Language Full-Text Search functionality is reinforcing my instincts as far as stripping keywords goes, but I don't want to get too uppity. As previously mentioned, the presence of the word "and" is bogging down my searches. Having somewhat painstakingly examined various search strings and their results, various EXPLAIN statements, forum posts, documentation, and rants I am starting to think this and-keyword-slowness problem is not due to any functionally different treatment of this word as an operator but rather the fact that the word "and" is very common, yielding a very large number of matches which forces a lot of data comparisons. I find it puzzling that "and" is missing from the default MySQL stopwords file precisely because it is going to appear in pretty much any English string of sufficient length. I think my instincts were right to exclude this word from a full text search. HOWEVER, I'd like to definitively resolve why the presence of "and" in any search will bog things down.
You don't even need to run the whole big UNIONized query to get slowness. My tasks subquery runs slow -- around 8 seconds:
SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE)) * 3 AS score, 'q_ts' AS qid FROM tasks ts, careers c WHERE ts.id_code = c.id_code AND MATCH(ts.task) AGAINST ('pig and whistle' IN NATURAL LANGUAGE MODE) GROUP BY c_i
If I EXPLAIN it I get this. Seems OK to me but I might be overlooking something? I'm not really sure what to make of it.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ts fulltext id_code,task task 0 NULL 1 Using where; Using temporary; Using filesort
1 SIMPLE c ALL id_code NULL NULL NULL 1110 Using where
The query returns 87% of the careers (967 out of 1110). If I drop "and" from the search and just search "pig whistle" I get only 2 careers out of 1110 and the query runs in about 5 milliseconds.
Is there some way to inspect the words stored in a MySQL full-text search index for an Innodb table? I see there's a myisam_ftdump function but don't see an innodb one. I'm thinking it'd be informative to see what the most common words are. Perhaps then I could formulate some query using another common word and see if it's also slow.
If anyone has thoughts about why "and" makes a search slow and, more importantly, if this problem might also happen with other strings, then I'd very much like to hear your thoughts.
In the meantime, there seem to be some good reasons for excluding punctuation and words like "and."
- for this application, users are probably unlikely to search for specific punctuation sequences or attempt to use regex patterns
- punctuation in particular introduces regex problems for my original search function.
- mysql full text search excludes any strings less than 3 chars long by default
- common words like "and" appear to cause performance problems
- better devs than I (the mysql devs) have utilized a stopwords list (possibly for performance reasons?)
- search results get dramatically expanded with unhelpful matches when common words are used, clouding the utility of the results