Weedpacket Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.
My (perhaps cowardly) point about the hyphen in a regex expression was to suggest dread at the prospect of writing some function, analagous to preg_quote, to escape keywords designed to be fed from PHP code into a REGEXP inside an SQL statement. I attempted to point out that a hyphen in a regex need not be escaped unless it is part of a square-bracketed character range that you want to actually match a hyphen e.g., this one that matches either a single digit or a hypen: /[0-9\-]/
Interestingly, this script has the same output for the first 3 regexes:
$regexes = array(
'/-/',
'/\-/',
'/\\-/',
'/\\\\-/'
);
foreach ($regexes as $r) {
echo $r . "\n";
$matches = null;
if (!preg_match($r, "\-", $matches)) {
echo "no match\n";
} else {
var_dump($matches);
}
echo "\n";
}
Also interesting is that preg_quote escapes a hyphen:
// outputs: string(2) "\-"
var_dump(preg_quote("-", "/"));
And this is only one of the many special regex characters. Check out this script:
$matches = null;
if (!preg_match('/[*]/', "*", $matches)) {
echo "no match\n";
} else {
var_dump($matches);
}
The bracketed char range [*] matches the string with one asterisk. The output:
array(1) {
[0] =>
string(1) "*"
}
Whereas preg_quote will definitely escape an asterisk:
// outputs: string(2) "\*"
var_dump(preg_quote("*", "/"));
I see that [\*] also will match a string containing a single asterisk, but I'm definitely feeling uneasy about the prospect that will properly escape any keyword or char that I might want to feed into any partial SQL expression which is going inside from SQL statement. The way that multiple regexes with our without escaped special chars identically match a given string seems especially complicated when I might be feeding it into either of these expressions:
SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]mychar'
SELECT * FROM my_table WHERE my_col REGEXP '[mychar0-9]'
Then also consider that I might be using a prepared statement where each value to be merged is represented by the bind character, ?.
As I said before, it starts to remind me of that nasty url-validating issue.
Weedpacket And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.
I am always impressed with the depth to which you understand code. Such a thing would never have occurred to me. I will not be doing this. I'd much rather pose the question why allow punctuation or line breaks in a full text search at all? For coders and pedants like us, we might want to search for some peculiar series of characters. E.g., I frequently grep search for something->methodName or whatever, but I doubt the chuckleheads very nice people who use my site have any need at all for such a thing. I'm pretty comfortable depriving my users of punctuation search -- unless someone can give a good reason not to.
MEANWHILE...
I've had some good luck using MySQL's natural language search functionality. A couple of noteworthy points:
- The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. This setting can apparently be configured and is applied at the time an index is generated. See caveats in the docs for more info.
- There are numerous stopwords.
- "and" is apparently not one of the default stop words and, for some mystifying reason, causes a modest fulltext search to run VERY slowly and return A LOT more results. I don't know why this happens.
- overall, my old search approach where I generate the regex tends to return more results (unless "and" is present in the search string)
- the old search that uses regexes is MUCH slower (unless "and" is present in the search terms).
- I've not done a thorough test yet, but the sequence of the search keywords apparently doesn't make any difference
ALSO: New code uses PDO
I'll post the new code momentarily. This post is already a bit long.