Thanks for your help. Here is the code I have that converts from UTF-8 to Unicode. I got it off the web from Scott Reynen who had some code for Unicode manipulation. I changed his code a bit to use "dechex" rather than just use the unicode decimal value of the character.
I also emailed the author of the function to see if he has any idea what may be happening.
***** BEGINNING OF PHP CODE *****
function utf8_to_unicode( $str ) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < 128 ) $unicode[] = $thisValue;
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
//**** Using dechex here ***
$unicode[] = dechex($number);
// **** ORIGINAL CODE
//$unicode[] = $number;
$values = array();
$lookingFor = 1;
} // if
} // if
} // for
return $unicode;
}
***** END OF PHP CODE *****
The function returns an array of the characters in their unicode hex value.
The unicode Chinese string being passed in is from a UTF-8 encoded page. The string is:
***** START OF STRING *****
上半場完: 曼聯 3 - 列斯聯 1
曼聯
雲尼斯特羅 12,22
保耶 32
列斯聯
阿倫史密夫 27
***** END OF STRING *****
After running the function and doing a print_r on the array, I get this:
***** START OF RESULT *****
Array ( [0] => 4e0a [1] => 534a [2] => 5834 [3] => 5b8c [4] => ff1a [5] => 32 [6] => 66fc [7] => 806f [8] => 32 [9] => 51 [10] => 32 [11] => 45 [12] => 32 [13] => 5217 [14] => 65af [15] => 806f [16] => 32 [17] => 49 [18] => 13 [19] => 10 [20] => 13 [21] => 10 [22] => 66fc [23] => 806f [24] => 13 [25] => 10 [26] => 96f2 [27] => 5c3c [28] => 65af [29] => 7279 [30] => 7f85 [31] => 32 [32] => 49 [33] => 50 [34] => 44 [35] => 50 [36] => 50 [37] => 13 [38] => 10 [39] => 4fdd [40] => 8036 [41] => 32 [42] => 51 [43] => 50 [44] => 13 [45] => 10 [46] => 13 [47] => 10 [48] => 5217 [49] => 65af [50] => 806f [51] => 13 [52] => 10 [53] => 963f [54] => 502b [55] => 53f2 [56] => 5bc6 [57] => 592b [58] => 32 [59] => 50 [60] => 55 )
***** END OF RESULT *****
Most of the Chinese Characters are converted ok but the other characters are not converted properly (e.g. element [59] and [60] are still in their decimal value).
Thanks for your help in advance.