drawmack,
Thanks for the response. I think I should clarify:
Curly quotes, trademark symbols, bullets, em & en dashes, etc. are all fine for my needs. Their characters are all below 0xFF (255) in the ISO 8859-1 character set, which I'm using. There are something like 86,000+ other code points used to create the "unwanted" characters I'm trying to get rid of.
Run this code sample to see the junk I'm trying to get rid of:
<?
for ($i=256; $i<100000; $i++) {
echo "&#$i;<br />\r";
}
?>
So when a user copy/pastes text into the form from MS Word, sometimes the formatting (e.g., colored text, table formatting, indents, etc.) becomes invisible strange characters that I don't believe I can identify with the technique you outlined.
It's one (or more) of those "gremlins" that screwed up the columns in one of my (mySQL) tables. I know I can try working with UTF-8 and using PHP's mb_ (multibyte) functions, but that's over my head, and from what I've read, it looks like it has a whole other set of headaches.
Ideally I'd love to just have a function that says "for every character in $string, if it's anything other than ascii character 0-255, delete it."
Is this doable? Is it a bad idea?
Thanks again.
Joe