Encoding problem on insert

genista · Nov 26, 2012

Hi,

I am trying to insert the title of an rss feed to a mysql table, the problem is the title is: "200 Up! What’s been your highlight?" but on inserting into the database it looks like this: 200 Up! Whatâ€™s been your highlight?

The database is UTF-8 encoded, and the title column is utf-8 encoded too just to be sure.

The following is my code, you'll see at the top that I have added a header, the thing to note is that this file is included from another page that displays the title, description etc. I know there are other issues ion this script, I am just playing at the moment to output some of the other variables:

header('Content-Type: text/html; charset=utf-8');
	$rssgp = new DOMDocument();
	$rssgp->load('http://rss.cnn.com/rss/edition.rss');
	$feed = array();
	foreach ($rssgp->getElementsByTagName('item') as $node) {
		$item = array ( 
			'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
                       	'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue,
			'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
			'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
			);
		array_push($feed, $item);
	}
	$limit = 1;
	for($x=0;$x<$limit;$x++) {
		$gptitle = str_replace(' & ', ' &amp; ', $feed[$x]['title']);
		$gplink = $feed[$x]['link'];

            ////Limit the characters
            $gpposition=150;

            $gpmessage = $feed[$x]['desc'];
            $gpdescription = substr($gpmessage, 0, $gpposition); 
            $gpdescription = substr($gpmessage, 0);
	$gpdate = date('l F d, Y', strtotime($feed[$x]['date']));

}


    include("../../includes/base.php");

    $sql="SELECT item_id, title, url, source_id, views, faith, headlines, family, politics, business, film, music FROM content_detail WHERE title=$gptitle";
    if(is_resource($sql) && mysql_num_rows($sql) > 0 ){

       $messagestatus = "1";

     }
     else
     {
         $messagestatus = "0";
         $url = "cnn.php";
         $politics = "true";
         $source_id = "1";
         $gpinsert = mysql_query("INSERT INTO content_detail (title, url, source_id, politics) VALUES('".$gptitle."', '".$url."', '".$source_id."', '".$politics."')") or die(mysql_error());


     }

thanks,

G

Derokorian · Nov 26, 2012

Is it a "smart quote" or a straight quote? If its a smart quote, the utf-8 charset does not include these characters.

Weedpacket · Nov 27, 2012

That's a properly-encoded right single quote, but the wrong character set is being used to display it. Check with your browser to see what encoding the page is using; if that's not UTF-8 then it needs to be corrected (an additional HTTP header, an equivalent <meta> tag in the page, and/or setting the [man]default_charset[/man] in php.ini). While you're about it, you could also check the encoding being used by your connection.

The following is my code, you'll see at the top that I have added a header

If this is the code for inserting into the database, note that the header will have absolutely no effect on that, because storing a string in a database isn't the same thing as serving text to a browser.

I know there are other issues ion this script

Most notably the use of an outdated DBMS API.

sneakyimp · Nov 28, 2012

Weedpacket was kind enough to educate me on character encoding a while back. The basic idea is that character encoding happens at various levels and you should ideally get accustomed to making sure that all steps of your PHP scripting are certain in their character encoding expectations and declarations or you will run into problems like this. Character encoding matters at various levels in your scripting world:
the char encoding of your PHP script -- unless I'm mistaken, the encoding of a file opened by a text editor is often guessed at depending on the presence or absence of a Byte-Order Mark (BOM) that your editor and/or operating system will place at the beginning of a PHP file, which is a text file. Chances are your text editor will assume a character encoding of some kind. On my Windows desktop using notepad, you can specify ANSI, Unicode, UTF-8, etc. on the save dialogue. I'm not super clear on what the conventions are for specifying the text encoding of a text file, but I believe that the default text editors on various different OSes typically assume some encoding by default.
the char encoding that you specify when displaying a web page or XML feed - This can be accomplished by sending a header using the [man]header[/man] command in PHP or by declaring a document character set with the appropriate tag in your output. The basic idea is that if you are sending text to a browser, you need to tell the browser what encoding you are using. The quote mark you are having problems with looks to me like it's encoded using 3 bytes in the XML feed (and you probably store it this way in your database) but when you try to display the data coming out of your database, your are probably declaring a single-byte charset like ASCII or Latin-1 which has a primitive encoding that only allows one byte per character.
the character encodings of data coming into your system - You might want to check the XML feed you are trying to parse to see if they bothered to declare a charset. If they have, you may need to convert this data to match the charset of your database. If it's declared as something other than utf-8, you should probably use [man]utf8_encode[/man] or better yet [man]mb_convert_encoding[/man]. A different consideration is user input entered into a form that you have displayed in one of your web pages. I don't know offhand if a browser will automatically translate character encodings from a user's local encoding to that declared on your page or if additional declarative tags in the form are required.
character encodings of your data storage - you say that your database is "utf8-encoded" and I assume that you mean you have defined your tables and/or collations to in your table definitions to use utf8-encoding for text fields. I do see that you have not bothered to escape any input from your data source and this means you have a security problem. Q.v., [man]mysqli_real_escape_string[/man].
* the functions that you use to parse strings in your PHP code. IIRC, the basic string-parsing functions in PHP (e.g., strlen, strstr) assume a single-byte charset -- meaning that they only allow for one-byte characters. If you are receiving input encoded as UTF-8 or Unicode, you'll want to use the multibyte string functions.

dalecosp · Nov 28, 2012

sneakyimp;11019229 wrote:
The basic idea is that character encoding

Might I suggest you brush that up a bit and have someone sticky this in the coding forum? That's a real nice overview

johanafm · Nov 29, 2012

sneakyimp;11019229 wrote:
* the char encoding of your PHP script -- unless I'm mistaken, the encoding of a file opened by a text editor is often guessed at depending on the presence or absence of a Byte-Order Mark (BOM) that your editor and/or operating system will place at the beginning of a file

First off, I'd say it's the editor places the BOM, never the OS. Secondly, the BOM is supposed to indicate the Byte Order, not the encoding. Do note that the specification allows for a BOM in utf-8 files, so there is no violation of any kind, except perhaps violation of common sense
Since there is only one possible order in UTF-8, there is no need or use for a BOM, but there are a whole lot of potential problems with it.
First line: utf-8 BOM
Second line: what would be found in a shell script starting with a shebang but saved with a BOM (this won't work)
Third line: what would be found in an included/required php file saved with a BOM (might break setting headers/cookies, or garble output)

ï»¿
ï»¿#!/bin/sh
ï»¿<?php

So unless you have a file system which somehow supports saving metadata including encoding information, there'd be no certain way to tell. Detection works fine in some cases, lousy in others: see stackoverflow post + comments for a discussion on the topic. And do note that there are some errors such as claiming that an xml file is in utf-8 if no enctype is present (it can also be utf-16).

sneakyimp;11019229 wrote:
which is a text file.

The problem actually begins here. How can you tell if a file is a text file? If you find a file named "a.out" on a windows machine, that might be text output from a program, while if you find it on a *NIX it will most likely be "assembler output". I.e. a compiled program with no specified target file name. My point here is that without specific conventions, you can't tell if a file is this or that. So, just like there are no characters without a charset, only bytes... Completely lacking context, there is not even a file type, only bytes. Well, without a file system, there is not even files

sneakyimp;11019229 wrote:
I'm not super clear on what the conventions are for specifying the text encoding of a text file,

Some things will differ, such as wether to put a BOM in utf-8 files or not by default, or even letting you specify it explicitly. But what you can always be certain of happening is
1. The text editor is currently treating the file as being of some encoding "X" (wether this is the correct encoding or not doesn't matter, but you'd most likely be able to tell when looking at what is displayed). This results in every byte or sequence of bytes being represented as a character under that encoding
2. You specify a specific encoding "Y"
3. Before saving the file, each and every byte, or multi-byte sequence from X represents some character, and this/these bytes are therefor exchanged for the corresponding byte or byte sequence that represents the same character under Y

In the file, there are no characters, only bytes. It is not until you decide to treat a sequence of bytes as being of a specific encoding that you can start talking about characters. The character encoding simply maps one byte or one sequence of bytes into one character.

sneakyimp;11019229 wrote:
but I believe that the default text editors on various different OSes typically assume some encoding by default.

I'd go so far as to guess that default text editors on some OSes would actually try to guess the encoding, while on others they simply assume default encoding, or perhaps even assume a specific encoding depending on the currently chosen language.

sneakyimp;11019229 wrote:
* the char encoding that you specify when displaying XML -

The xml declaration has to be done using the ASCII bytes for <?xml ...>
So, as long as you start parsing an XML file as ASCII (allowing for / disregarding BOM) you will arrive at an xml declaration either containing encoding="some_enctype" or lacking one. If the XML declaration has no encoding attribute, the encoding is according to the BOM (allowing for no BOM under utf-8), unless some other way of specifying encoding was used (e.g. http content-type header).

But do note that the possibility of an http header with missing encoding attribute might lead to a later error in encoding interpretation. That is, if you receive an external encoding specification, you should add that encoding to the encoding attribute of the xml declaration before saving the file (unless you simply retransmit it elsewhere along with proper headers and leave the headache to someone else).

sneakyimp;11019229 wrote:
I don't know offhand if a browser will automatically translate character encodings from a user's local encoding to that declared on your page

In short: If you specify only the encoding used for your page (content-type header, meta charset, other), then that encoding will be used. While you could specify accept-charset for your form element, IE won't (wouldn't?) switch to another charset if user inputs characters outside utf-8 range, it will alternate between used encodings leaving you with no way of knowing what's what (unless you start testing for validity on a per character basis). I've only read this someplace and don't know the validity of this claim,

Apart from that, browsers should send content-type headers for each element when posting as multipart, but I don't know if they actually do. But the reason I've no idea what special stuff does and doesn't work here, is simply that I've never had issues. That doesn't necessarily mean there can't be problems, but if you really need to deal with characters outside of utf-8, then you could perhaps serve the page as something else, or failing that - you will hopefully know enough to deal with this anyway

sneakyimp;11019229 wrote:
defined your tables and/or collations to in your table definitions to use utf8-encoding

It should be noted that collation assigns informative value to the characters as such (and I suppose, assigns the same information to the bytes of those characters by proxy).
Consider the letters 'a' and 'A'. Which comes first? Without collation and using US-ASCII, 'A' is 0x41 while 'a' is 0x61, so A comes before a. In EBCDIC, the reverse (order) is true. To make these sort the same way independently of the character set used, you'd need to specify collations that decide wether an 'a' comes before or after an 'A'.

Now, enter the letters ä and Ä which are not the same as a and A with diacritic mark umlaut/diaeresis/trema - in the collation, while being the exact same thing in the character set. That is, there is no charset difference between swedish ä and german ä. The letters ä and Ä sort at the end of the alphabet while the letters a and A with trema (i.e. ä and Ä) sort as a and A. So, if you're using mysql and specify a collation of utf8_general_ci, you'd get this sort order
Äa
Az
B

while specifying utf8_swedish_ci (where ä and Ä are actual letters), you'd get
Az
B
Äa

Also note that the collation specifies wether "a" == "ä" or not (general_ci => true, swedish_ci => false).

Encoding problem on insert

Ggenista

DDerokorian

Weedpacket

Ssneakyimp

dalecosp

Jjohanafm