I am writing a web front end to a MySQL database. The front end is Japanese and will accept primarily (and possibly only) Japanese input -- that is multibyte input. In the past, when I have written content management systems, I have been careful to treat all user input as tainted and use regular expressions to strip all input of characters that are meaningful to a database (such as quotes and so on). Usually, I replaced them with their HTML entity equivalents ("e;, etc) and so forth. This time, however, I am pretty sure I can't rely on this method since dozens or even hundreds of single Japanese characters may use these characters in their multi-byte encodings.
As well, PHP automatically escapes quotes in user input. This, for the same reason as above, could turn some characters into garbage (or worse, change their meanings!), though I have not started testing to see if this is the case.
There is a TON of PHP support material here in Japan, but all of it is in Japanese, which at this point I cannot read well enough to understand.
Does anyone have any ideas on how I can deal with SQL, taint and multi-byte form input?