I'm having a really interesting problem and I'm not sure how to get around it.

I've got data coming from InDesign CS (on PC) that goes into a UTF-8 XML template for inclusion into a MySQL 5 database (MyISAM table). This is done with a PHP (5.0.x) importer.

Everything works fine except that I need the curly, or "smart" quotes removed (replaced with standard quotes) from the data before it goes into the database, and I can't seem to get it to work.

I've tried a couple different code snippets from various PHP sites I found Googling around, but to no avail.

PHP's function strlen reports the left curly quote is 3 characters long. As a result, I'm assuming that this means I'm looking at multi-byte strings, but I know very little about it, nor do I know much about character sets (just enough to get me by most of the time).

And I can't just paste the curly quote into my PHP editor. Depending on which editor I open the script with, the character gets converted into ugliness. That, and it doesn't seem to work for me anyway.

Any ideas? This is driving me crazy...

    Right now I'm using PSpad (Win). But I've used Notepad++ (Win), Kwrite (Linux) and others. I get the same results.

    I'm starting to wonder if what I have is actually UTF-8. I'm wondering if it's UTF-16. This is all new to me so I really don't have a clue what I'm doing. All I know is that's how it's coming out of InDesign CS.

    I found this function on php.net and modified it slightly (adding one character code):

    function str_sanitize($input)
    {
    $search = array(
    '/[\x60\x82\x91\x92\xb4\xb8]/i', // single quotes
    '/[\x84\x93\x94\xe2]/i', // double quotes
    '/[\x85]/i', // ellipsis ...
    '/[\x00-\x0d\x0b\x0c\x0e-\x1f\x7f-\x9f]/i' // all other non-ascii
    );
    $replace = array(
    '\'',
    '"',
    '...',
    ''
    );
    return preg_replace($search,$replace,$input);
    }

    I added the \xe2 because echo dechex(ord ('bad_char')); returned e2.

    Consider this string:
    "“We have just been given two barrels of FEM-12SC for testing purposes,” said Joe. “We have high hopes of incorporating the new technology into our current fire fighting arsenal. It’ll help us learn..."

    Notice the single and double quotes in the string. When I run the string through the function, I get this:

    "We have just been given two barrels of FEM-12SC for testing purposes," said Joe. "We have high hopes of incorporating the new technology into our current fire fighting arsenal. It"ll help us learn..."

    Notice the single quote was grabbed as well. I've noticed that this same thing happens with the long hyphen (alt+0150 in Windows). When I take out the \xe2 from the function, it doesn't do anything (doesn't grab single or double curly quote).

    I tried to utf8_decode and utf8_encode prior to calling the above function (just for kicks), and I get question marks in place of the characters. So that obviously doesn't work.

    I'm lost... shooting in the dark. Ideas???

      Write a Reply...