I'm updating some code from about fifteen years ago (!) and, looking at the search code written by some subcontractors, I see that it looks problematic for a few reasons:
1) vulnerable to SQL injection
2) strange decisions made in formulating SQL queries
3) queries get run in loops, yielding about 8 queries per keyword entered

I'll start with #1. When a user enters a search term, I want to safely search my DB tables without worrying about SQL injection. I know that if I properly escape or quote these keywords as I'm creating my SQL, that things should be pretty safe, but I've seen some pretty crazy SQL injection attempts that try to use delete and backspace characters, foreign language quotes, and all manner of crazy stuff. Seems to me that the easiest thing to do would be to just whitelist permissible characters such that we only permit upper and lowercase letters, numbers, spaces, and maybe some dashes and possibly other punctuation.

Can anyone suggest improvements on this function? It takes $search_query, a string of search terms supplied directly by the user from an input type="text" on the form. It converts newline chars and repeated space chars into single spaces and then explodes the resulting string along the spaces to create an array of search words. From this array, we exclude trivial words like and and or. This array is subsequently used to formulate some SQL to search various columns in multiple different tables, but I'll get to that later.

	/**
	 * New, improved function to clean up search query and provide safe, nontrivial search terms
	 * @param string $search_query the input string supplied directly from user input MUST BE SCANNED AND CLEANED
	 * @return array numerically indexed array of non-trivial search terms 
	 */
	private static function clean_keywords($search_query) {
		// TODO: should we allow fancy search syntax? NOTE that we'll feed this into
		// a MySQL REGEXP expression so we would need to escape and filter properly

		// convert any newline chars or multiple spaces into single spaces
		$pattern = "/\s+/";
		$clean_string = trim(preg_replace($pattern, " ", $search_query));

		// use unicode regex to whitelist valid chards, excludes any chars that aren't letters, numbers, or spaces
		$pattern = "/[^\pL\pN\pZ]/u";
		$clean_string = preg_replace($pattern, "", $clean_string);
		
		// get an array of all the words, this breaks along whitespace and punctuation
		// this function seems to happily avoid empty entries and leading/trailing spaces
		preg_match_all("/([\w\.]+)/", $clean_string, $search_words);
		
		// remove trivial words
		// TODO consider adding additional words here and move this into a static var or class constant
		$skip_words = array('and', 'or');
		foreach ($search_words[0] as $k=>$v) {
			if (in_array(mb_strtolower($v), $skip_words)) {
				unset($search_words[0][$k]);
			}
		}
		
		return $search_words[0];
		
	} // clean_keywords()

I'm hoping this function will return a safe and clean array of search terms. I plan to escape/quote these search terms before inserting into any SQL, but it's my hope that the regex operations will preclude any of those crazy sql injection techniques. I'm also wondering if I should drop any words that are less than some minimum length -- I'm thinking 2 chars.

Any input would be greatly appreciated here. I'm sort of hoping for any thoughts you experienced folks might have on my regex or things I might be over looking.

    Couldn't you just change the code that performs the SQL queries to use prepared statements with bound parameters instead? Besides, maybe at some later date you might want to actually keep "and" and "or" in the search terms so as to perform more intelligent searching.

    Agreeing with laserlight here re: binding parameters.

    But if you do want to do some sort of sanitisation - basically, what's legitimate depends on what you're actually searching. The whitelist would be "anything that matches what is allowed in the column being searched".

    Oh, and consider search.php?q=help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck+help+my+keyboard+is+stuck

    laserlight

    The new project uses Codeigntier and, while Codeigniter is apparently not built to make it easy to use prepared statements, this is apparently feasible. I supose you are suggesting this because it would provide some security as far as exploits go, but I can't help but wonder if that's a foolproof way to just feed user input directly into a query?

    I also wonder if using a prepared statement via PDO is any more secure than using PDO::quote, which is what CodeIgniter apparently does with its PDO driver. I worry about just cramming user input into a query and relying on the prepared statement behaviors to prevent SQL injection. Is it really to be trusted? Will it prevent an SQL injection attack that uses a DELETE char, for instance?

    And simply using prepared statements doesn't really address the issue of generating suitable SQL. I don't have any special attachment to the SQL generation code in this project but it does function adequately and it works by splitting any supplied user search string into keywords and searching various columns in various tables for whole and partial word matches. It sorts the search results by allocating score to different types REGEXP matches. Some tables count more than others, a full-word match counts more than just a partial word match. This rather ad-hoc logic, which I've cleaned up below, was originally dictated by the client and seems to serve the users reasonably well.

    The following PHP functions are static methods in a class:

    	public static function search($db, $search_query){
    		$clean_keywords = self::clean_keywords($search_query);
    		$keyword_count = sizeof($clean_keywords);
    
    		// these generate arrays of regex patterns around each keyword	
    		$full_regex = self::key_regex($clean_keywords, TRUE);
    		$part_regex = self::key_regex($clean_keywords, FALSE);
    	
    	
    		// array to accumulate all the disparate queries
    		$sql = array();
    	
    		// record titles
    		for ($j=0; $j<$keyword_count; $j++) {
    			//	TABLE 1: get records with keyword in title
    			$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, ROUND(IF(c.title REGEXP $full_regex[$j], 40, 20),1) AS score, 'q_ct_{$j}' AS qid
      			FROM " . TITLE_TABLE . " c
    	  			WHERE c.title REGEXP $full_regex[$j] OR c.title REGEXP $part_regex[$j]";
    		}
    	
    		// task statements
    		for ($j=0; $j<$keyword_count; $j++) {
    			//	get records from tasks by full or partial word match
    			// NOTE: this sql is carefully constructed to allow for multiple tasks per record, some full match, some part. we score higher if both exist, BUT DO NOT SUM
    			$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, ROUND(IF(ts.task REGEXP $full_regex[$j], 25, 12.5),1) AS score, CONCAT('q_ts_{$j}_', IF(ts.task REGEXP $full_regex[$j], 'full', 'part')) AS qid
    				FROM " . TASK_TABLE . " ts, " . TITLE_TABLE . " c
    					WHERE (ts.task REGEXP $full_regex[$j] OR ts.task REGEXP $part_regex[$j]) AND
    					ts.id_code = c.id_code
    					GROUP BY c_i, qid";
    		}
    	
    		// alternate titles
    		for ($j=0; $j<$keyword_count; $j++) {
    			//	get records from alternate titles by full or partial word match
    			// NOTE: this sql is carefully constructed to allow for multiple alternate titles per record, some full match, some part. we score higher if both exist, BUT DO NOT SUM
    			$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, ROUND(IF(oat.alternate_title REGEXP $full_regex[$j], 20, 10),1) AS score, CONCAT('q_at_{$j}_', IF(oat.alternate_title REGEXP $full_regex[$j], 'full', 'part')) AS qid
      			FROM " . ALTERNATE_TITLE_TABLE . " oat, " . TITLE_TABLE . " c
    	  			WHERE (oat.alternate_title REGEXP $full_regex[$j] OR oat.alternate_title REGEXP $part_regex[$j]) AND
    	  			oat.id_code = c.id_code
    	  			GROUP BY c_i, qid";
    		}
    	
    		// other data
    		for ($j=0; $j<$keyword_count; $j++) {
    			//	get records from other data by full or partial word;
    			// NOTE: this sql is carefully constructed to allow for multiple occupation data per record, some full match, some part. we score higher if both exist, BUT DO NOT SUM
    			$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, ROUND(IF(od.description REGEXP $full_regex[$j], 15, 7.5),1) AS score, CONCAT('q_od_{$j}_', IF(od.description REGEXP $full_regex[$j], 'full', 'part')) AS qid
      			FROM " . OTHER_DATA_TABLE . " od, " . TITLE_TABLE . " c
    	  			WHERE (od.description REGEXP $full_regex[$j] OR od.description REGEXP $part_regex[$j]) AND
    	  			od.id_code = c.id_code
    	  			GROUP BY c_i, qid";
    		}
    
    		// after we've accumulated all the keyword-specific queries, we make a UNION query to combine them into a single one
    		$combined_sql = "SELECT c_i, c_t, seo_title, SUM(score) AS score
    				FROM (" . implode("\nUNION\n", $sql) . ") AS union_query
    				GROUP BY c_i";
    	
    		$retval = self::db_fetch($db, $combined_sql);
    		return $retval;
    	} // search()
    
    	/**
    	 * Runs the specified query and returns an array of arrays
    	 * @param CI_DB $db
    	 * @param string $sql
    	 */
    	private static function db_fetch($db, $sql) {
    		$query = $db->query($sql);
    		$retval = array();
    		while ($row = $query->unbuffered_row("array")) {
    			$row["score"] = floatval($row["score"]);
    			$retval[] = $row;
    		}
    		return $retval;
    	}
    
    	/**
    	 * Function to generate MySQL REGEXP patterns from each keyword
    	 * @param array $clean_keywords numerically indexed array of keywords for which we search
    	 * @param boolean $full TRUE if you want full word match, FALSE if you just want to match start-of-word
    	 * @throws Exception
    	 */
    	private static function key_regex($clean_keywords, $full) {
    		if (!is_array($clean_keywords)) {
    			throw new Exception("Invalid clean keywords. Should be an array");
    		}
    		$i = 0;
    		$retval = array();
    		// FIXME we need to db escape and regex escape these values if we want the search to be safe and work for all input
    		foreach ($clean_keywords as $key => $value) {
    			if ($full) {
    				$retval[$i] = '\'[[:<:]]' . $value . '[[:>:]]\'';
    			} else {
    				$retval[$i] = '\'[[:<:]]' . $value. '\'';
    			}
    			$i++;
    		}
    		
    		return $retval;
    	}
    
    	private static function clean_keywords($search_query) {
    		// TODO: should we allow fancy search syntax? NOTE that we'll feed this into
    		// a MySQL REGEXP expression so we would need to escape and filter properly
    
    		// convert any newline chars or multiple spaces into single spaces
    		$pattern = "/\s+/";
    		$clean_string = trim(preg_replace($pattern, " ", $search_query));
    
    		// use unicode regex to whitelist valid chars, excludes any chars that aren't letters, numbers, or spaces
    		$pattern = "/[^\pL\pN\pZ]/u";
    		$clean_string = preg_replace($pattern, "", $clean_string);
    		
    		// get an array of all the words, this breaks along whitespace and punctuation
    		// this function seems to happily avoid empty entries and leading/trailing spaces
    		preg_match_all("/([\w\.]+)/", $clean_string, $search_words);
    		
    		// remove trivial words
    		// TODO consider adding additional words here and move this into a static var or class constant
    		$skip_words = array('and', 'or');
    		foreach ($search_words[0] as $k=>$v) {
    			if (in_array(mb_strtolower($v), $skip_words)) {
    				unset($search_words[0][$k]);
    			}
    		}
    		
    		return $search_words[0];
    		
    	} // clean_keywords()

    To call this function, you do something like this:

    $search_query = trim($this->input->post("q"));
    		
    if (mb_strlen($search_query) == 0) {
    	// FIXME show the user an alert
    	show_error("You did not enter anything for your search.", 400);
    }
    		
    // FIXME also check maximum string length here
    		
    		
    $search_results = MY_site_search::search($this->db, $search_query);

    Weedpacket

    But if you do want to do some sort of sanitisation - basically, what's legitimate depends on what you're actually searching. The whitelist would be "anything that matches what is allowed in the column being searched".

    Certainly sanitation seems necessary (and I'll definitely be limiting the length of the search query to something quite short). Having ruminated on this a fair amount, it seemed quite fair to me that only alphanumerics, spaces, and maybe some dashes be permitted. The data being searched are textual columns in government occupational data. I don't expect anyone should be searching for complicated computer code or markup of any kind.

    As for laserlight's suggestion that we might want to keep the and and or words and so on, I'll honestly say it sounds unnecessarily complicated to implement boolean type logic from these keywords when the users probably won't be especially technical. They'll probably want to type "CEO" or "film editor" or "video games" something and see the occupational data that matches.

    EDIT: I'm quite open to suggestions regarding the search logic. I think that any scheme adopted should preserve the proportional, preferential ranking of the 4 tables being queried as they are in the logic I've posted here, but I'm quite curious about Natural Language Searching and I haven't much clue at all about what sorts of indexes would best serve this type of operation.

      sneakyimp I also wonder if using a prepared statement via PDO is any more secure than using PDO::quote, which is what CodeIgniter apparently does with its PDO driver. I worry about just cramming user input into a query and relying on the prepared statement behaviors to prevent SQL injection. Is it really to be trusted? Will it prevent an SQL injection attack that uses a DELETE char, for instance?

      I feel we've been down this path before, sneakyimp 🙄 😛

      Generally, the PDO drivers are written to link against DBMS's own APIs and use their own libraries (e.g. PDO::pgsql links against PostgreSQL's libpq-fe.h, the same library that all of PG's own client apps use). You can consult the source to see what is going on there, noting that since the heavy lifting is being done by the DBMS's own code you're really asking if there is a bug in its own internal APIs.

      You can write a stack of unit and regression tests, starting with those that are run against the applications themselves as part of their own build process and adding others if you find code paths that they don't cover.

        Oh, and have you considered spelling? Last time I had to do a general text search against a db that didn't support full-text searching I found I needed to a soundex check as well because a lot of users would type phonetically (there was some pretty hairy words in this particular corpus).

        This was for autocomplete purposes, so I couldn't muck around for very long before I had to start serving up a response.

        I had a couple of lookup tables: one that matched each distinct word (less stopwords) against all the records that used that word. Because I had that, I could precompute their soundexes and have a second table that matched soundexes to their corresponding homonyms.

          sneakyimp wrote:

          I also wonder if using a prepared statement via PDO is any more secure than using PDO::quote, which is what CodeIgniter apparently does with its PDO driver. I worry about just cramming user input into a query and relying on the prepared statement behaviors to prevent SQL injection. Is it really to be trusted? Will it prevent an SQL injection attack that uses a DELETE char, for instance?

          The PHP manual entry that you linked to answers your own question:

          If you are using this function to build SQL statements, you are strongly recommended to use PDO::prepare() to prepare SQL statements with bound parameters instead of using PDO::quote() to interpolate user input into an SQL statement. Prepared statements with bound parameters are not only more portable, more convenient, immune to SQL injection, but are often much faster to execute than interpolated queries, as both the server and client side can cache a compiled form of the query.

          The key idea is that unless the prepared statement's bound parameters lacks native support such that it is actually emulated by quoting and escaping, the data is separated from the SQL statement. Hence, security related bugs aside, it is impossible to construct an SQL injection because the data is always merely data; it will never be treated as SQL code. If you use PDO::quote, then at some point the data becomes part of the SQL statement, although it is after quoting and escaping.

            I have certainly participated in numerous discussions that involve prepared statements. I apologize if it seems like we are rehashing things. I suppose the benefits, aside from syntactical ones and automatic escaping, have never really sunk in to my sluggish brain and I've only just been asymptotically approaching true understanding.

            In my estimation, the 'higher efficiency' benefit for multiple-query situations has never seemed especially important inasmuch as queries inside a loop are strongly discouraged. I've always striven to avoid looping query situations whenever possible -- here evidenced by my use of UNION to roll the numerous (dozens of?) queries in this search function into a single query. I find myself wondering what performance/efficiency benefit, if any, prepared statement caching would bring to such a query and can't think of any.

            As for the automatic escaping with prepared statements, it never sank in that this type of escaping would somehow be superior to the use of PDO::quote or mysqli_real_escape_string. Looking now at the docs on mysqli_real_escape_string, I'm more than a bit horrified by this little description:

            Characters encoded are NUL (ASCII 0), \n, \r, \, ', ", and Control-Z.

            Is that ALL this function does is change those characters? Is that adequate to prevent sql injection? Given the SQL injection attempts I've seen in my apache logs, this seems dangerously inadequate, but I don't fully understand how various SQL engines interpret quotes, DELETE chars, cursor-moving chars, and all that sort of thing.

            Prepared statements with bound parameters are not only more portable, more convenient, immune to SQL injection, but are often much faster to execute than interpolated queries

            I had seen this -- and I may be paranoid -- but I'm very suspicious of this claim of immunity and believe that relying on this automatic escaping to be far too credulous. I'm curious what sort of logic/mechanism might insure this 'immunity' but dread the prospect of trying to read source code for MySQL, PostGreSQL, etc. I'd also point out that PHP's PDO sometimes just emulates real PDO behavior, depending on how one has installed the MySQL client. Furthermore, application-level validation can check if your data matches application-level requirements. E.g., the DBMS engine's value escaping won't check if the supplied email address is valid before cramming it into a VARCHAR field.

            Given that I only need a simple text search function and given the current implementation (which everyone should feel free to criticize), my decision to use preg_replace to scour the user input string seems reasonable given all these considerations. I know I still need to use PDO::quote or prepared statements on these keywords but given that they need to be escaped not just for the DB but also escaped for use in a REGEXP pattern within the SQL generated (does any escape function for exist for this purpose?) I thought it reasonable to seek out some sage input from my esteemed colleagues. Only in preparing this post and responding to your input (which I value very much) have I (re)discovered limitations in CodeIgniter and in PHP's PDO implementation.

            And Weedpacket is entirely right that I should work up some unit/regression tests. My skills in this realm are woefully inadequate. I realized I've avoided lots of tedious SQL work for years now thanks to the convenience of a system I've constructed which lets me interact with DB_x objects (PHP classes) to insert records rather than writing SQL for this purpose. I'm only aware today that this system may rely on some false assumptions as far as escaping goes.

              I suspect the prepared statements use the same underlying escape mechanism, though possibly based on the column type using type-casting instead when applicable (e.g. to int or float)? I don't think it's so much a case of being a "better" escape mechanism, as opposed to one that will essentially happen automatically, rather than depending on the developer to do it (and do it correctly) in each and every place it's needed.

              CodeIgniter may be abstracting that away if you use their modelling paradigm, and thus be fine -- unless/until you override it with your own explicit SQL.

              NogDog

              If I understand laserlight's post correctly -- and i haven't bothered to examine any DBMS source code here -- the theoretical difference is that prepared statements deliver your query to the DBMS with SQL as one string and the parameters as entirely separate data structures. Because of this separation, the absence of quotes in your SQL or the presence of umatched quotes in these separate data structures no longer leads to an SQL injection vulnerability. An analogy might be found in parsing CSV data versus having an array containing distinct values. The data in prepared statements is more highly structured such that the parsing engine needn't bother to check for escape sequences or quotes -- it has the separate data objects and treats them as such. Phrased another way, prepared statements provide better data integrity because your query and its data are more highly isolated structured in transit from your PHP code to the DBMS and less vulnerable to exploit because they don't have to be translated into the less-structured of format of an SQL string, which is in reality a mixture of data and instructions for manipulating the data. Phrased yet another way: prepared statements are a more highly structured client-server protocol.

              CodeIgniter may be abstracting that away if you use their modelling paradigm, and thus be fine -- unless/until you override it with your own explicit SQL.

              Codeigniter's query builder apparently mimics the behavior of prepared statements, but does not use the actual PDO prepare & execute functions even if your config file sets dbdriver=pdo. That being the case, its success & security against SQL injection will depend on its own implementation of quoting/escaping functionality which is probably NOT aware of column data types. I don't know for certain that it's not column-aware, but suspect it is because all data retrieved from the DBMS comes back as string values.

              In my search code above, you can see that I am in fact constructing explicit SQL. That code doesn't yet quote/escape the keywords supplied. I'm looking into using PDO and prepared statements first, but the fact that these keywords are fed into REGEXP expressions complicates matters.

                I admittedly have not followed this whole discussion, but the escaping does not actually end up in the DB, just in the SQL*. So when you either do subsequent queries using DB regex functions, or pull the data into your app and apply application regex functions, you should not have to worry about that escaping affecting it -- or I'm totally missing the context and you should just ignore me.

                ======================
                * Sort of like the backslashes do not actually get output in PHP:

                echo 'This here\'s a test, y\'all.';

                NogDog

                I'm familiar with how escape characters in one's code don't actually make their way into the string defined. Escaping them is necessary because, for instance, you need to distinguish quotes that delimit your string from the quotes you might want IN your string. Totally understand that.

                What concerns me is that I want my keywords to be part of a REGEXP expression inside and sql expression and escaping is different for SQL than it is for REGEXP. PHP has a preg_quote function, for example.

                Some example questions:
                - what if one of my keywords is wh*ee -- do I need to escape the asterisk? Is there a function or guidelines for this?
                - what if one of my keywords is [[:alnum:]]+ -- should I escape the square brackets?
                - what if the db escape (like PDO::quote) function returns quotes in its output? Does that mean I need to escape my entire regexp expression even though this expression is intended to be interpreted as SQL regexp?

                For that last example, consider this code:

                $db = new PDO("mysql:host=localhost;dbname=my_db", "user", "pass");
                $keyword = "foo";
                $sql = "SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]" . $db->quote($keyword) . "[[:>:]]'";
                echo $sql . "\n";

                The output is broken SQL because the PDO::quote function adds single quotes:

                SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]'foo'[[:>:]]'

                I can fix that particular query by escaping the keyword surrounded by the regexp along with the keyword:

                $keyword = "[[:<:]]foo[[:>:]]";
                $sql = "SELECT * FROM my_table WHERE my_col REGEXP " . $db->quote($keyword);

                But is this really what I want to do for a general solution? Are there any chars that might be in my regex which I DON'T want quoted. Like what if my keyword itself contained a single or double quote? This code:

                $keyword = '[[:<:]]f"oo[[:>:]]';
                $sql = "SELECT * FROM my_table WHERE my_col REGEXP " . $db->quote($keyword);

                results in the double quote also being escaped:

                SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]f\"oo[[:>:]]'

                NOTE this does actually work -- if you run that query it'll locate records containing the string f"oo but I think it illustrates my concern about escaping. I'd like to avoid crosstalk between escaping REGEXP chars and escaping SQL search keywords.

                I'd also point out that the mysql docs on regex don't talk much about escaping anything. Nor does the ICU Reference that it refers to.

                  sneakyimp wrote:

                  I had seen this -- and I may be paranoid -- but I'm very suspicious of this claim of immunity and believe that relying on this automatic escaping to be far too credulous. I'm curious what sort of logic/mechanism might insure this 'immunity' but dread the prospect of trying to read source code for MySQL, PostGreSQL, etc.

                  I explained this in my elaboration. It isn't "automatic escaping", unless emulated. I'm more familiar with SQLite because they have a digestible newbie explanation for what's going on, but basically we're looking at the SQL statement getting parsed into byte code, and from there the data is processed by the byte code. Preparing the statement means that this parsing doesn't have to happen each time the statement is executed (without using a query cache), which can provide a small advantage in efficiency, but where we are concerned about SQL injection, it means that the parser has no chance to mistake data for SQL code and hence produce byte code that is different from what was intended by the author of the SQL code. That's what grants immunity from SQL injection.

                  sneakyimp wrote:

                  I'd also point out that PHP's PDO sometimes just emulates real PDO behavior, depending on how one has installed the MySQL client.

                  It sounds like you didn't read my elaboration, otherwise you would have seen that I already mentioned it:

                  laserlight wrote:

                  The key idea is that unless the prepared statement's bound parameters lacks native support such that it is actually emulated by quoting and escaping, the data is separated from the SQL statement. Hence, security related bugs aside, it is impossible to construct an SQL injection because the data is always merely data; it will never be treated as SQL code. If you use PDO::quote, then at some point the data becomes part of the SQL statement, although it is after quoting and escaping.

                  Of course, if it does fall back to emulation, then while it is no better than plain escaping, it's no worse either (and I guess that's where "automatic" as an advantage comes in, but it probably isn't much of an advantage when you have frameworks that take care of things for you).

                  sneakyimp wrote:

                  Furthermore, application-level validation can check if your data matches application-level requirements. E.g., the DBMS engine's value escaping won't check if the supplied email address is valid before cramming it into a VARCHAR field.

                  Yes, of course: preventing SQL injection through the use of bound parameters, or some kind of escaping mechanism, is not the same thing as application-level validation, and typically they would both be applied. But do you have application-level requirements other than "stop SQL injection"? You didn't mention any other than "we exclude trivial words like and and or", and that's technically not validation either: you're cleaning the data through the removal of those words, but not validating it to reject it if it contains those words.

                  laserlight
                  I apologize if I did not make reference to your elaboration, which I appreciate very much. I did in fact read it several times and had difficulty understanding the first half. It was the latter half that helped me to comprehend the actual mechanism by which prepared statements prevent SQL injection. Only you calling me out here brought me to understand that I was paraphrasing what you already explained. Your further elucidation has made it even clearer and I am even deeper in your debt. Thank you for the additional detail about byte code et. al. My reading comprehension skills are not especially sharp sometimes and I apologize if I seem obtuse or ungrateful. Also -- and this may be an irritating habit -- paraphrasing the descriptions of others really helps me to solidify my understanding of complicated concepts.

                  Regarding my application-level validation, I'm stripping out everything (or trying to anyway) except letters, numbers, and spaces from the user search string. This serves two purposes (albeit in a ham-fisted way): 1) it should prevent SQL injection arguably more effectively than the dubious stack of PDO emulation and framework code I'm building on and 2) should eliminate the need to worry about escaping any special REGEX chars like asterisks or square brackets or $ or ^ or whatever.

                  No one has commented much on the SQL-generating code I posted which makes use of the word-end boundary markers ([[:<:]] and [[:>:]]). My regex cleanup in PHP should still allow users to search for actual words and numbers so it still supports fundamentally useful functionality and at the same time I need not face the task of sorting out how to escape my keywords and/or regex expressions in order to preserve the original keywords while still providing safe, functional code.

                  I guess you could say that one of my application requirements is "search the four different columns in the four different tables while preserving the the original keywords, my word-boundary logic, and the relative scoring of the tables and full matches versus partial matches" I'm concerned that the word-boundary regex I have (which does a fair job on our production server) will just run into trouble if we start allowing more punctuation into the SQL. I'm entirely open to other methods of full text search, but have no experience with them. If anyone wants to suggest something, I'd be grateful. I appreciated Weedpacket's sound-alike suggestion very much but this sounds like next-level enhancement. I'm mostly concerned about performance and precision at this point.

                    As far as escaping for REGEXP goes: since (I'm assuming) you don't want users writing arbitrary regular expressions as their search terms, you'd need to quote anything your DBMS considers having significance in regular expressions (although you probably aren't interested in any of them: just keep whitespace, numberlike, and letterlike characters, and maybe the hyphen and there wouldn't be anything to disrupt a regexp).

                    As far as my soundex suggestion goes, even without that there is still the literal index lookup idea: keep a table <word> | <record containing that word>: look in that table for each word in the search string, that gives you the records in which each word appears. That saves having to do potentially hairy searching of the records themselves. One thing I did in my code was score each record by how many of the search terms it contained, and sorted matches by that. (Note that this is basically a dumbed-down full-text-search index; if I had one at the time I'd've used that.) Of course it's necessary to keep that table up to date if/when the records being searched change.

                    (Oh, and I've finally remembered what the idea behind the extra "keyword lookup" table is called: KWIC indexing).

                    Weedpacket

                    As far as escaping for REGEXP goes: since (I'm assuming) you don't want users writing arbitrary regular expressions as their search terms, you'd need to quote anything your DBMS considers having significance in regular expressions (although you probably aren't interested in any of them: just keep whitespace, numberlike, and letterlike characters, and maybe the hyphen and there wouldn't be anything to disrupt a regexp).

                    If only there were some function like preg_quote for escaping keywords to be fed into an SQL REGEXP expression. I expect this would be a fairly involved problem to solve in any general way -- especially considering that hyphens are meaningful within square brackets but not outside of the square brackets. I'm reminded of the complicated url validation issue, which is what led me to the decision of stripping all but numbers & letters & spaces (and maybe hyphens too).

                    Weedpacket As far as my soundex suggestion goes, even without that there is still the literal index lookup idea...

                    I remember seeing this approach used by PHPBB back in the day and, while it seems useful and fairly clever, I expect that there's a fair amount of effort involved both initially and ongoing to munge the various tables' columns and generate the index tables and update these indexes when the data changes, etc. I cannot help but think that to do so would be reinventing the wheel.

                    I looked into MySQL Natural Language Full-Text Searches and it's pretty exciting. PostgreSQL has something similar.

                    With MySQL, I can simply feed the user's original search string into my SQL query and it handles all the detail:

                    SELECT id_code, title, description, MATCH(description) AGAINST ('video games' IN NATURAL LANGUAGE MODE) AS score FROM other_data_table ORDER by score DESC

                    I've only just done a cursory inspection of the search results but it would appear that this natural language search, like my suggested approach, ignores punctuation. This is hardly scientific or comprehensive, but this search yields essentially the same search results:

                    SELECT id_code, title, description, MATCH(description) AGAINST ('[video]--! * ., ., .,  games\\\\ ' IN NATURAL LANGUAGE MODE) AS score FROM other_data_table ORDER by score DESC

                    The scores returned in the second gobbledygook punctuation search are slightly lower (maybe 2-5%) than in the first query, but the ids and titles returned are the same and in the same order. I feel that, to some degree, this vindicates my earlier decision to strip out the punctuation.

                    It is noteworthy that I must add a fulltext index on the columns to be searched, but it's a lot easier to run one SQL command or click a link in phpmyadmin than to create my own indexing scheme. Additionally, I can apparently feed the user's search query right into the query without massaging the keywords and constructing numerous queries of my own. This is a considerable improvement over my code which has 4 queries per keyword.

                    sneakyimp especially considering that hyphens are meaningful within square brackets but not outside of the square brackets.

                    Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.

                    And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.

                    Weedpacket Well, that's not really relevant: quoting is to ensure a literal string gets treated as one when interpolated into a regular expression; literal strings don't go inside character class square brackets.

                    My (perhaps cowardly) point about the hyphen in a regex expression was to suggest dread at the prospect of writing some function, analagous to preg_quote, to escape keywords designed to be fed from PHP code into a REGEXP inside an SQL statement. I attempted to point out that a hyphen in a regex need not be escaped unless it is part of a square-bracketed character range that you want to actually match a hyphen e.g., this one that matches either a single digit or a hypen: /[0-9\-]/

                    Interestingly, this script has the same output for the first 3 regexes:

                    $regexes = array(
                      '/-/',
                      '/\-/',
                      '/\\-/',
                      '/\\\\-/'
                    );
                    
                    foreach ($regexes as $r) {
                      echo $r . "\n";
                      $matches = null;
                      if (!preg_match($r, "\-", $matches)) {
                        echo "no match\n";
                      } else {
                        var_dump($matches);
                      }
                      echo "\n";
                    }

                    Also interesting is that preg_quote escapes a hyphen:

                    // outputs:  string(2) "\-"
                    var_dump(preg_quote("-", "/"));

                    And this is only one of the many special regex characters. Check out this script:

                    $matches = null;
                    if (!preg_match('/[*]/', "*", $matches)) {
                      echo "no match\n";
                    } else {
                      var_dump($matches);
                    }

                    The bracketed char range [*] matches the string with one asterisk. The output:

                    array(1) {
                      [0] =>
                      string(1) "*"
                    }

                    Whereas preg_quote will definitely escape an asterisk:

                    // outputs: string(2) "\*"
                    var_dump(preg_quote("*", "/"));

                    I see that [\*] also will match a string containing a single asterisk, but I'm definitely feeling uneasy about the prospect that will properly escape any keyword or char that I might want to feed into any partial SQL expression which is going inside from SQL statement. The way that multiple regexes with our without escaped special chars identically match a given string seems especially complicated when I might be feeding it into either of these expressions:

                    SELECT * FROM my_table WHERE my_col REGEXP '[[:<:]]mychar'
                    SELECT * FROM my_table WHERE my_col REGEXP '[mychar0-9]'

                    Then also consider that I might be using a prepared statement where each value to be merged is represented by the bind character, ?.

                    As I said before, it starts to remind me of that nasty url-validating issue.

                    Weedpacket And if you're still paranoid about what characters are being entered into search terms, and whether escaping is sufficient to fully escape string literals, you could avoid the question by "SELECT ... decode('" . base64_encode($value) . "', 'base64') ...". Of course, anyone else who sees it later will ask why you didn't just escape it or use parameter binding.

                    I am always impressed with the depth to which you understand code. Such a thing would never have occurred to me. I will not be doing this. I'd much rather pose the question why allow punctuation or line breaks in a full text search at all? For coders and pedants like us, we might want to search for some peculiar series of characters. E.g., I frequently grep search for something->methodName or whatever, but I doubt the chuckleheads very nice people who use my site have any need at all for such a thing. I'm pretty comfortable depriving my users of punctuation search -- unless someone can give a good reason not to.

                    MEANWHILE...

                    I've had some good luck using MySQL's natural language search functionality. A couple of noteworthy points:
                    - The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. This setting can apparently be configured and is applied at the time an index is generated. See caveats in the docs for more info.
                    - There are numerous stopwords.
                    - "and" is apparently not one of the default stop words and, for some mystifying reason, causes a modest fulltext search to run VERY slowly and return A LOT more results. I don't know why this happens.
                    - overall, my old search approach where I generate the regex tends to return more results (unless "and" is present in the search string)
                    - the old search that uses regexes is MUCH slower (unless "and" is present in the search terms).
                    - I've not done a thorough test yet, but the sequence of the search keywords apparently doesn't make any difference

                    ALSO: New code uses PDO

                    I'll post the new code momentarily. This post is already a bit long.

                    This new function uses PDO and MySQL natural language search. It is about 4 times faster than the prior function I posted unless the word "and" is in my search terms.

                    /**
                     * This function trims and cleans the search query so that it can be
                     * safely fed directly to a mysql natural language search
                     * @param string $search_query
                     * @return string
                     */
                    private static function clean_search_string($search_query) {
                    	// convert any newline chars or multiple spaces into single spaces
                    	$clean_string = trim(preg_replace('/\s+/', " ", $search_query));
                    	
                    	// TODO: consider removing trivial or bad words like and, or, etc. Maybe scrubbing for hacker shit?
                    	
                    	return $clean_string;
                    }
                    
                    /**
                     * Updated search function that uses MySQL natural language searching; only requires 4 queries
                     * @see https://dev.mysql.com/doc/refman/8.0/en/fulltext-natural-language.html
                     * @param CI_DB $db CodeIgniter DB
                     * @param string $search_query user-supplied search query MAY COME DIRECTLY FROM USER INPUT SO USE CAUTION
                     * @throws Exception
                     */
                    public static function career_search_new($db, $search_query) {
                    	$clean_string = self::clean_search_string($search_query);
                    	if (mb_strlen($clean_string) == 0) {
                    		throw new Exception("search_query empty after cleaning");
                    	}
                    	
                    	$sql = array();
                    	// careers.title
                    	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, (MATCH(title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_title_factor AS score, 'q_ct' AS qid
                    			FROM " . TABLE_1 . " c
                    			WHERE MATCH(title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)";
                    	
                    	// career task statements
                    	// takes an average of all rows matching the current c_i
                    	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(ts.task) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_task_statement_factor AS score, 'q_ts' AS qid
                    			FROM " . TASK_STATEMENTS_TABLE . " ts, " . TABLE_1 . " c
                    			WHERE ts.id_code = c.id_code
                    				AND MATCH(ts.task) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
                    			GROUP BY c_i";
                    	
                    	//	alternate titles
                    	// takes an average of all rows matching the current c_i
                    	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(oat.alternate_title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_alternate_title_factor AS score, 'q_at' AS qid
                    			FROM " . ALTERNATE_TITLES_TABLE . " oat, " . TABLE_1 . " c
                    			WHERE oat.id_code = c.id_code
                    				AND MATCH(oat.alternate_title) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
                    			GROUP BY c_i";
                    
                    	// occupation data
                    	// takes an average of all rows matching the current c_i
                    	$sql[] = "SELECT c.id_code AS c_i, c.title AS c_t, c.seo_title, AVG(MATCH(od.description) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)) * :career_occupation_data_factor AS score, 'q_od' AS qid
                    			FROM " . OD_TABLE . " od, " . TABLE_1 . " c
                    			WHERE od.id_code = c.id_code
                    				AND MATCH(od.description) AGAINST (:clean_string IN NATURAL LANGUAGE MODE)
                    			GROUP BY c_i";
                    	
                    	// aggregate the above queries using UNION into a single query for efficiency
                    	$combined_sql = "SELECT c_i, c_t, seo_title, SUM(score) AS score
                    			FROM (" . implode("\nUNION\n", $sql) . ") AS union_query
                    			GROUP BY c_i";
                    
                    	// for testing/inspection
                    // 		$combined_sql = implode("\nUNION\n", $sql);
                    	
                    	// params to supply to the PDO prepared statement
                    	$sql_params = array(
                    			":clean_string" => $clean_string,
                    			":career_title_factor" => self::career_title_factor,
                    			":career_task_statement_factor" => self::career_task_statement_factor,
                    			":career_alternate_title_factor" => self::career_alternate_title_factor,
                    			":career_occupation_data_factor" => self::career_occupation_data_factor
                    	);
                    	
                    	$retval = self::pdo_fetch_all($db, $combined_sql, $sql_params);
                    	
                    	// for testing, remove for production
                    // 		usort($retval, "self::sort_by_score");
                    	
                    	return $retval;
                    	
                    }
                    
                    /**
                     * Bypasses CodeIgniter db and uses PDO object directly to prepare statement and execute with provided parameters
                     * @param CI_DB $db Codeigniter DB object
                     * @param string $sql SQL statement to be prepared for execution
                     * @param array $params Either an associative array of named bindings or just an array of values for ? bindings
                     * @throws Exception
                     */
                    private static function pdo_fetch_all($db, $sql, $params) {
                    
                    	// when using pdo for db connection, the PDO object is $db->conn_id;
                    	// set PDO to throw exceptions for errors or you might have trouble figuring out problems
                    	if (!$db->conn_id->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION)) {
                    		throw new Exception("Unable to set PDO attribute");
                    	}
                    	
                    	$stmt = $db->conn_id->prepare($sql);
                    	if (!$stmt) {
                    		throw new Exception("Statement prepare failed");
                    	}
                    	$query = $stmt->execute($params);
                    	if (!$query) {
                    		throw new Exception("query failed");
                    	}
                    	// this would be an array of arrays
                    	$retval = $stmt->fetchAll(PDO::FETCH_ASSOC);
                    	
                    	return $retval;
                    	
                    }