[Resolved] Parsing UTF-8 Encoded Files

mattymcg

(cross posted in Newbies)

Hi there.

I am trying to parse a rather large (~14M😎 XML document so that I can insert it into a MySQL database. Being fairly inexperienced with PHP I have a few hurdles ahead of me, including the best way to parse the xml file, but the biggest problem I am running into at the moment is the document's encoding.

The original file (well, I am working with a much smaller subset of it for now) is UTF-8 encoded. The reason for this is that there are some Japanese, Korean and German characters in there.

I have made some progress parsing the document using a script called "class_path_parser" written by Luis Argerich, which I believe makes use of PHP's inbuilt expat classes. I can walk the document structure fine. And the standard alphabet and numbers display fine when I print them to the screen. Unfortunately the Japanese ones come up as jibberish.

I have tried everything - the utf8_decode() function, passing UTF-8 as the parameter when declaring a new parser, adding charset=UTF-8 in the header of the HTML files that is output, even using good old MS Word to convert the XML data to Shift_JIS (which my browser displays ok) and trying to parse that.

Regardless of the encoding, the characters always display correctly when viewed as raw XML, but as soon as I try to parse the file and display individual tags with PHP the output of these Japanese characters is junk.

Any ideas, tips, experience to share?

I tried to include some simplified sample data in this post but unfortunately the UTF characters get converted to their ISO codes, so instead here is a subset of the file I am using.

Thanks in advance. I like to consider myself a fairly thorough problem-solver but this has me stumped.

AstroTeg

Double check and make sure the encoding type (I think that's what its called) you're sending back is UTF-8 or a flavor of unicode.

Unicode is just a multiple byte representation for a particular character in a particular character set. The trick is the tool (be it PHP or your browser) has to know what to do with the unicode characters to add meaning to them. In theory, PHP doesn't have to do a single thing, unless its converting the upper ASCII characters (the ones above ASCII 127) to something different, but I doubt its doing that. Have you fiddled with all your browser's character encoding types (you mentioned JIS)? Try setting the encoding type to UTF-8 and see how it goes. If it works, then you need to have PHP send the encoding type as UTF-8.

Of course, you may have done all of this and/or this may not fix anything...

mattymcg

Have just been playing around a bit more, and learned that I hadn't installed the php multi-byte extension, or enabled it. I am on Win32 so googled for the DLL and installed it (hope the version is ok and compatible???)

I can confirm that the encoding is indeed UTF-8 as I just made use of my first of the functions available with this package - mb_detect_encoding(). It hasn't got me any Japanese characters on the screen yet but I know I am close. Unfortunately it is also late so will have to wait for tomorrow...

Like I said the browser has no problem displaying Shift_JIS or UTF-8 encoded files, because I can view the raw XML file fine in each of those encodings. Just php files that it won't come to the party on.

Will sleep on it. Anyone with experience in this, please comment.

mattymcg

No progress yet so here is some sample code I am using and a test file. Can anyone get this to correctly display Japanese characters on their machine? Fiddling with different settings for mbstring doesn't seem to make any difference (note you will possibly need asian fonts installed for this to have a chance of working).

<html>
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>

<?
$file = "utf8_test.txt";
$link = fopen($file, "r");
$contents = fread($link, filesize($file));
fclose($link);

$contents_array = explode("\n\n\n\n", $contents);
$SJIS_Var = mb_convert_encoding($contents, "SJIS");

echo "Contents decoding is " . mb_detect_encoding($contents) . "...<br />";
print_r($contents);
echo "<hr />";

echo "SJIS_Var decoding is " . mb_detect_encoding($SJIS_Var) . "...<br />";
print_r($SJIS_Var);
echo "<hr />";

echo "Exploded array decoding is " . mb_detect_encoding($Var) . "...<br />";
print_r($contents_array['0']);
echo "<hr />";

?>

</body>
</html>

mattymcg

Forgot to attach the test file...

mattymcg

Ok, it's working

Due to the fact that, and I quote, "utf8_decode can not properly store 'wide' code entities which have a numeric value too big for a byte" I had to use the ingenious numeric_entify_utf8() subroutine submitted in the comments of the
utf8_decode manual page by Morris Hirsch.

It now works a treat. Onto the XML coding. Thankyou Morris if you ever read this!