Hi,

I have a string variable:

$billing['street_address'] = "Præstelængen";

It is in UTF-8:

dd(mb_detect_encoding ($billing['street_address']));

// UTF-8

I do the following to make a CSV file:

($billing is now part of $OrderData array)



		$fp = fopen($filename, 'w');

		foreach ($OrderData as $records) {
			fputcsv($fp, $records);
		}
		fclose($fp);

The string comes out in the csv file as: Pr?stel?ngen

Googling around I saw some point to this:

fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));

But it didn't work for me. It gives me ? in diamonds instead.

Any ideas?

    markjohnson;11047033 wrote:

    Any ideas?

    The first idea is that whatever you're using to read the CSV file doesn't know the file is UTF-8 encoded, or does know but isn't equipped to display it.

    Gotta love encoding issues... 😉

      When you are dealing with character encoding, you must consider every step from the moment the characters first arrive in your hands to the point where you output them.

      What was the source of these norwegian characters?
      submitted in a form? Make sure the form's accept charset is UTF8
      read from a file? You probably don't need to do anything
      * from a database? What is the charset of the db connection (and table, and column, etc.) if the db?

        sneakyimp;11047073 wrote:

        * read from a file? You probably don't need to do anything

        Wouldn't the file's character encoding play a role here?

          In PHP, a string is merely a series of bytes. Unless you are using some kind of string-parsing functions, PHP doesn't really care at all what the encoding of the string is. If you use file_put_contents or file_get_contents to write/read a string to/from a file, this will have no impact at all on the char encoding.

          The only thing I can think of that might come into play with files is a Byte-Order Mark (BOM), the use of which is optional. On my Ubuntu workstation, I don't think any software writes a BOM.

          Weedpacket effectively explained a lot of these concepts to me a long time ago in this thread which helped me tremendously.

            I just remember in the past doing a German site for a client where I had to change my default encoding in Komodo to UTF-8 because the characters were not displaying correctly, even when the browser was set to UTF-8. Once I changed the encoding and resaved the file everything displayed correctly. shrug

              Bonesnap;11047165 wrote:

              I just remember in the past doing a German site for a client where I had to change my default encoding in Komodo to UTF-8 because the characters were not displaying correctly, even when the browser was set to UTF-8. Once I changed the encoding and resaved the file everything displayed correctly. shrug

              Yep; if the editor you're using uses the wrong character set when saving the file, then string literals (like [font=monospace]'Præstelængen'[/font]) that appear in the file will be encoded using the wrong character set.

                I'm not sure what the default charset of Komodo might have been, but any given text editor will have its own behavior as far as how it encodes text. E.g., if you have some editor (say Notepad or Komodo) and you create a new empty document, paste in some text from god-knows-where, and then save it as myfile.txt, then you might not be able to make any assumptions about what charset is used. It's been my experience on Ubuntu that pretty much all text I create in this way using nano or gedit is UTF-8. On a windows machine using Notepad, it's often been Latin-1. In either Ubuntu/Gedit or Windows/Notepad, you can change the default character encoding using "Save As..." The resulting dialog has a drop-down that lets you specify the char encoding you want. I have no idea if Windows/Notepad will add a Byte-Order Mark (BOM) or not.

                On my Ubuntu machine using Gedit, I opened a new empty document, typed a single A character, and saved it as /tmp/foo.txt and I can inspect its contents using the xxd command:

                $ xxd /tmp/foo.txt
                0000000: 410a                                     A.
                

                This tells me my file length is two bytes, the first being 41 (hex for "A" in both ASCII and UTF-8) and the second being 0a (hex for a line feed in both ASCII and UTF-8). Note that for ASCII chars (basic ASCII, not extended) their encoding is the same single byte as their UTF-8 encoding. There is no BOM in my file.

                If I'm a bit more adventurous and replace the A in my file with æ from your example, this changes things:

                $ xxd /tmp/foo.txt
                0000000: c3a6 0a                                  ...

                My file is now 3-bytes long. I'm guessing that æ is encoded with two bytes (c3a6) and that we still terminate the file with a LF char.

                Another key issue is that one's PHP files can also be saved with a text editor and encoded as UTF-8 -- or not. It's important to recognize that the PHP language specification only makes use of ASCII chars, but when you create a PHP file, you can save it as UTF-8 and when you type string literals, you can use non-ASCII chars like æ:

                // save this file as utf-8
                $my_utf8_string = "æ";
                

                If you were to save this file as Latin-1 instead of UTF-8, you might get a different encoding for æ other than c3a6.

                So once you get clear on what charset is used for strings that you get INTO your PHP script -- either by reading external files or by defining strings in your PHP file itself, then you can start to concentrate on where you plan to send these strings. Want to show them in a browser? Declare your charset as utf-8 so the browser knows. Want to put them in a database? Make sure your connection to the db uses utf-8 encoding and make sure the db and/or table and/or column you are storing the string in are all declared as utf-8.

                Getting charsets right is all about understanding the chain of custody for data and making sure that any required translations happen if charsets change in that custody chain. E.g., you might have to translate from utf8 encoding to latin-1 if your database table has a column that is latin-1 encoded.

                  Write a Reply...