[RESOLVED] ISO-8859 / UTF-8 and PHP behaviour!

Hendricus · Mar 23, 2008

Hi all,

I have always been struggeling with character sets and such but I've managed so far... eventhough I don't fully understand it. I thought it would be a good idea to play with this a bit more and thus learn as I go... but omg charsets are HELL! :mad:

I gave myself an assignment to create a database with country information, iso codes, country names in different translations and such. I thought it would be smart to create an UTF-8 database instead of a ISO-8859-1 database. Reason being; not all alphabets are supported by ISO-8859-1 and maybe I want to add greek translations...

When I started scripting I found out that PHP defaults to ISO-8859-1. So simple string conversions like strtolower or an ereg_replace simply return corrupted strings. So a solution could be to utf8_decode -> strtolower -> utf8_encode... That would be ok if you hardly want to 'do' stuff to a string. But when you're constantly work with them it becomes a pain, and besides it takes up processing time!

I've looked into set_locale, don't understand much of it. But what I do understand is that, if you create distributable code, you shouldn't really mess with that.

I'm stuck really... I want to create a multi-lingual/alphabetical application including database but would love to work with PHP build in functions like strtolower etc. How should I go about this problem?

I found this; http://sourceforge.net/projects/phputf8
But I can't believe PHP is so narrowminded that we need external libraries to work with multi-alphabetical content?!?! :queasy:

How do you guys work with this problem? Does anyone know of a GOOD and very explanatory tutorial on this?

Cheers,
Hendricus

Hendricus · Mar 24, 2008

Hmmz I found an interesting article here that explains things a bit;
http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet

Yet wouldn't it be cool if PHP automatically checks a string for multibyte content and then automatically uses mb_strtolower() instead of strtolower()?? This way one could write scripts not having to take in account charsets. Or is there some cool feature I'm overlooking?

NogDog · Mar 24, 2008

I don't know of any good tutorials, but if the mbstring extension is enabled, I've had success using the [man]mb_convert_encoding/man function. This along with starting your MySQL session with the following seems to get everything in synch:

<?php
$connx = mysql_connct(<connx params>);
mysql_query("SET NAMES utf8");
mysql_query("SET CHARACTER SET 'utf8_unicode_ci'");
$text = mb_convert_encoding($_POST['text'], 'UTF-8');
mysql_query("INSERT INTO `table` (`text`) VALUES ('".mysql_real_escape_string($text)."')";

PS: One of the supposed enhancements of PHP6 will be improved and native support of mutli-byte character sets.

Hendricus · Mar 24, 2008

Hey NogDog,

Thanx for the reply... well I'm getting closer to the solution. To make PHP default to utf-8 instead of iso8859-1 you can change some php.ini settings or do so thru htaccess.

Before I made the changes phpinfo() retruned;

mbstring.detect_order no value no value 
mbstring.encoding_translation Off Off 
mbstring.func_overload 0 0 
mbstring.http_input pass pass 
mbstring.http_output pass pass 
mbstring.internal_encoding no value no value 
mbstring.language neutral neutral 
mbstring.strict_detection Off Off 
mbstring.substitute_character no value no value

### Use multibyte functions by default, so strtoupper automaticall becomes mb_strtoupper
php_value func_overload 		7
### Set default language to Neutral(UTF-8) (default)
php_value mbstring.language 		 "Neutral"
### Set default internal encoding to UTF-8
php_value mbstring.internal_encoding	 "UTF-8"
### HTTP input encoding translation is enabled
php_value mbstring.encoding_translation	 "On"
### Set HTTP input character set dectection to auto
php_value mbstring.http_input		 "auto"
### Set HTTP output encoding to UTF-8
php_value mbstring.http_output		 "UTF-8"
### Set default character encoding detection order to auto
php_value mbstring.detect_order		 "auto"
### Do not print invalid characters
php_value mbstring.substitute_character	 "none"
### Default character set for auto content type header
php_value default_charset 		 "UTF-8"

After I added these lines to a .htacces phpinfo() returns;

mbstring.detect_order auto no value 
mbstring.encoding_translation On Off 
mbstring.func_overload 0 0 
mbstring.http_input auto pass 
mbstring.http_output UTF-8 pass 
mbstring.internal_encoding UTF-8 no value 
mbstring.language Neutral neutral 
mbstring.strict_detection Off Off 
mbstring.substitute_character no value no value

So everything is reset to my settings except for mbstring.func_overload which stays at default 0. I also tried ini_set("mbstring.func_overload", 7); in the PHP code itself, to no avail! I don't want to mess with php.ini cause I want to create distributable code.... besides I can only touch the php.ini on my dev server, not on the server which will be running this script once I get it straightened out!

Any thoughts?

Hendricus · Mar 24, 2008

Sometimes it is better to leave it alone and check back after a break;

php_value func_overload 7

needs to be

php_value mbstring.func_overload 7

Problem solved!

[RESOLVED] ISO-8859 / UTF-8 and PHP behaviour!

HHendricus

HHendricus

NogDog

HHendricus

HHendricus