Hi all,

I have always been struggeling with character sets and such but I've managed so far... eventhough I don't fully understand it. I thought it would be a good idea to play with this a bit more and thus learn as I go... but omg charsets are HELL! :mad:

I gave myself an assignment to create a database with country information, iso codes, country names in different translations and such. I thought it would be smart to create an UTF-8 database instead of a ISO-8859-1 database. Reason being; not all alphabets are supported by ISO-8859-1 and maybe I want to add greek translations...

When I started scripting I found out that PHP defaults to ISO-8859-1. So simple string conversions like strtolower or an ereg_replace simply return corrupted strings. So a solution could be to utf8_decode -> strtolower -> utf8_encode... That would be ok if you hardly want to 'do' stuff to a string. But when you're constantly work with them it becomes a pain, and besides it takes up processing time!

I've looked into set_locale, don't understand much of it. But what I do understand is that, if you create distributable code, you shouldn't really mess with that.

I'm stuck really... I want to create a multi-lingual/alphabetical application including database but would love to work with PHP build in functions like strtolower etc. How should I go about this problem?

I found this; http://sourceforge.net/projects/phputf8
But I can't believe PHP is so narrowminded that we need external libraries to work with multi-alphabetical content?!?! :queasy:

How do you guys work with this problem? Does anyone know of a GOOD and very explanatory tutorial on this?

Cheers,
Hendricus

    Hmmz I found an interesting article here that explains things a bit;
    http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet

    Yet wouldn't it be cool if PHP automatically checks a string for multibyte content and then automatically uses mb_strtolower() instead of strtolower()?? This way one could write scripts not having to take in account charsets. Or is there some cool feature I'm overlooking?

      I don't know of any good tutorials, but if the mbstring extension is enabled, I've had success using the [man]mb_convert_encoding/man function. This along with starting your MySQL session with the following seems to get everything in synch:

      <?php
      $connx = mysql_connct(<connx params>);
      mysql_query("SET NAMES utf8");
      mysql_query("SET CHARACTER SET 'utf8_unicode_ci'");
      $text = mb_convert_encoding($_POST['text'], 'UTF-8');
      mysql_query("INSERT INTO `table` (`text`) VALUES ('".mysql_real_escape_string($text)."')";
      

      PS: One of the supposed enhancements of PHP6 will be improved and native support of mutli-byte character sets.

        Hey NogDog,

        Thanx for the reply... well I'm getting closer to the solution. To make PHP default to utf-8 instead of iso8859-1 you can change some php.ini settings or do so thru htaccess.

        Before I made the changes phpinfo() retruned;

        mbstring.detect_order no value no value 
        mbstring.encoding_translation Off Off 
        mbstring.func_overload 0 0 
        mbstring.http_input pass pass 
        mbstring.http_output pass pass 
        mbstring.internal_encoding no value no value 
        mbstring.language neutral neutral 
        mbstring.strict_detection Off Off 
        mbstring.substitute_character no value no value 
        
        ### Use multibyte functions by default, so strtoupper automaticall becomes mb_strtoupper
        php_value func_overload 		7
        ### Set default language to Neutral(UTF-8) (default)
        php_value mbstring.language 		 "Neutral"
        ### Set default internal encoding to UTF-8
        php_value mbstring.internal_encoding	 "UTF-8"
        ### HTTP input encoding translation is enabled
        php_value mbstring.encoding_translation	 "On"
        ### Set HTTP input character set dectection to auto
        php_value mbstring.http_input		 "auto"
        ### Set HTTP output encoding to UTF-8
        php_value mbstring.http_output		 "UTF-8"
        ### Set default character encoding detection order to auto
        php_value mbstring.detect_order		 "auto"
        ### Do not print invalid characters
        php_value mbstring.substitute_character	 "none"
        ### Default character set for auto content type header
        php_value default_charset 		 "UTF-8"
        

        After I added these lines to a .htacces phpinfo() returns;

        mbstring.detect_order auto no value 
        mbstring.encoding_translation On Off 
        mbstring.func_overload 0 0 
        mbstring.http_input auto pass 
        mbstring.http_output UTF-8 pass 
        mbstring.internal_encoding UTF-8 no value 
        mbstring.language Neutral neutral 
        mbstring.strict_detection Off Off 
        mbstring.substitute_character no value no value 
        

        So everything is reset to my settings except for mbstring.func_overload which stays at default 0. I also tried ini_set("mbstring.func_overload", 7); in the PHP code itself, to no avail! I don't want to mess with php.ini cause I want to create distributable code.... besides I can only touch the php.ini on my dev server, not on the server which will be running this script once I get it straightened out!

        Any thoughts?

          Sometimes it is better to leave it alone and check back after a break;

          php_value func_overload 7

          needs to be

          php_value mbstring.func_overload 7

          Problem solved!

            Write a Reply...