sneakyimp;11019229 wrote:
* the char encoding of your PHP script -- unless I'm mistaken, the encoding of a file opened by a text editor is often guessed at depending on the presence or absence of a Byte-Order Mark (BOM) that your editor and/or operating system will place at the beginning of a file
First off, I'd say it's the editor places the BOM, never the OS. Secondly, the BOM is supposed to indicate the Byte Order, not the encoding. Do note that the specification allows for a BOM in utf-8 files, so there is no violation of any kind, except perhaps violation of common sense 😉
Since there is only one possible order in UTF-8, there is no need or use for a BOM, but there are a whole lot of potential problems with it.
First line: utf-8 BOM
Second line: what would be found in a shell script starting with a shebang but saved with a BOM (this won't work)
Third line: what would be found in an included/required php file saved with a BOM (might break setting headers/cookies, or garble output)

#!/bin/sh
<?php
So unless you have a file system which somehow supports saving metadata including encoding information, there'd be no certain way to tell. Detection works fine in some cases, lousy in others: see stackoverflow post + comments for a discussion on the topic. And do note that there are some errors such as claiming that an xml file is in utf-8 if no enctype is present (it can also be utf-16).
sneakyimp;11019229 wrote:
which is a text file.
The problem actually begins here. How can you tell if a file is a text file? If you find a file named "a.out" on a windows machine, that might be text output from a program, while if you find it on a *NIX it will most likely be "assembler output". I.e. a compiled program with no specified target file name. My point here is that without specific conventions, you can't tell if a file is this or that. So, just like there are no characters without a charset, only bytes... Completely lacking context, there is not even a file type, only bytes. Well, without a file system, there is not even files 🙂
sneakyimp;11019229 wrote:
I'm not super clear on what the conventions are for specifying the text encoding of a text file,
Some things will differ, such as wether to put a BOM in utf-8 files or not by default, or even letting you specify it explicitly. But what you can always be certain of happening is
1. The text editor is currently treating the file as being of some encoding "X" (wether this is the correct encoding or not doesn't matter, but you'd most likely be able to tell when looking at what is displayed). This results in every byte or sequence of bytes being represented as a character under that encoding
2. You specify a specific encoding "Y"
3. Before saving the file, each and every byte, or multi-byte sequence from X represents some character, and this/these bytes are therefor exchanged for the corresponding byte or byte sequence that represents the same character under Y
In the file, there are no characters, only bytes. It is not until you decide to treat a sequence of bytes as being of a specific encoding that you can start talking about characters. The character encoding simply maps one byte or one sequence of bytes into one character.
sneakyimp;11019229 wrote:
but I believe that the default text editors on various different OSes typically assume some encoding by default.
I'd go so far as to guess that default text editors on some OSes would actually try to guess the encoding, while on others they simply assume default encoding, or perhaps even assume a specific encoding depending on the currently chosen language.
sneakyimp;11019229 wrote:
* the char encoding that you specify when displaying XML -
The xml declaration has to be done using the ASCII bytes for <?xml ...>
So, as long as you start parsing an XML file as ASCII (allowing for / disregarding BOM) you will arrive at an xml declaration either containing encoding="some_enctype" or lacking one. If the XML declaration has no encoding attribute, the encoding is according to the BOM (allowing for no BOM under utf-8), unless some other way of specifying encoding was used (e.g. http content-type header).
But do note that the possibility of an http header with missing encoding attribute might lead to a later error in encoding interpretation. That is, if you receive an external encoding specification, you should add that encoding to the encoding attribute of the xml declaration before saving the file (unless you simply retransmit it elsewhere along with proper headers and leave the headache to someone else).
sneakyimp;11019229 wrote:
I don't know offhand if a browser will automatically translate character encodings from a user's local encoding to that declared on your page
In short: If you specify only the encoding used for your page (content-type header, meta charset, other), then that encoding will be used. While you could specify accept-charset for your form element, IE won't (wouldn't?) switch to another charset if user inputs characters outside utf-8 range, it will alternate between used encodings leaving you with no way of knowing what's what (unless you start testing for validity on a per character basis). I've only read this someplace and don't know the validity of this claim,
Apart from that, browsers should send content-type headers for each element when posting as multipart, but I don't know if they actually do. But the reason I've no idea what special stuff does and doesn't work here, is simply that I've never had issues. That doesn't necessarily mean there can't be problems, but if you really need to deal with characters outside of utf-8, then you could perhaps serve the page as something else, or failing that - you will hopefully know enough to deal with this anyway 🙂
sneakyimp;11019229 wrote:
defined your tables and/or collations to in your table definitions to use utf8-encoding
It should be noted that collation assigns informative value to the characters as such (and I suppose, assigns the same information to the bytes of those characters by proxy).
Consider the letters 'a' and 'A'. Which comes first? Without collation and using US-ASCII, 'A' is 0x41 while 'a' is 0x61, so A comes before a. In EBCDIC, the reverse (order) is true. To make these sort the same way independently of the character set used, you'd need to specify collations that decide wether an 'a' comes before or after an 'A'.
Now, enter the letters ä and Ä which are not the same as a and A with diacritic mark umlaut/diaeresis/trema - in the collation, while being the exact same thing in the character set. That is, there is no charset difference between swedish ä and german ä. The letters ä and Ä sort at the end of the alphabet while the letters a and A with trema (i.e. ä and Ä) sort as a and A. So, if you're using mysql and specify a collation of utf8_general_ci, you'd get this sort order
Äa
Az
B
while specifying utf8_swedish_ci (where ä and Ä are actual letters), you'd get
Az
B
Äa
Also note that the collation specifies wether "a" == "ä" or not (general_ci => true, swedish_ci => false).