Right now I'm using PSpad (Win). But I've used Notepad++ (Win), Kwrite (Linux) and others. I get the same results.
I'm starting to wonder if what I have is actually UTF-8. I'm wondering if it's UTF-16. This is all new to me so I really don't have a clue what I'm doing. All I know is that's how it's coming out of InDesign CS.
I found this function on php.net and modified it slightly (adding one character code):
function str_sanitize($input)
{
$search = array(
'/[\x60\x82\x91\x92\xb4\xb8]/i', // single quotes
'/[\x84\x93\x94\xe2]/i', // double quotes
'/[\x85]/i', // ellipsis ...
'/[\x00-\x0d\x0b\x0c\x0e-\x1f\x7f-\x9f]/i' // all other non-ascii
);
$replace = array(
'\'',
'"',
'...',
''
);
return preg_replace($search,$replace,$input);
}
I added the \xe2 because echo dechex(ord ('bad_char')); returned e2.
Consider this string:
"“We have just been given two barrels of FEM-12SC for testing purposes,” said Joe. “We have high hopes of incorporating the new technology into our current fire fighting arsenal. It’ll help us learn..."
Notice the single and double quotes in the string. When I run the string through the function, I get this:
"We have just been given two barrels of FEM-12SC for testing purposes," said Joe. "We have high hopes of incorporating the new technology into our current fire fighting arsenal. It"ll help us learn..."
Notice the single quote was grabbed as well. I've noticed that this same thing happens with the long hyphen (alt+0150 in Windows). When I take out the \xe2 from the function, it doesn't do anything (doesn't grab single or double curly quote).
I tried to utf8_decode and utf8_encode prior to calling the above function (just for kicks), and I get question marks in place of the characters. So that obviously doesn't work.
I'm lost... shooting in the dark. Ideas???