(cross posted in Newbies)
Hi there.
I am trying to parse a rather large (~14M😎 XML document so that I can insert it into a MySQL database. Being fairly inexperienced with PHP I have a few hurdles ahead of me, including the best way to parse the xml file, but the biggest problem I am running into at the moment is the document's encoding.
The original file (well, I am working with a much smaller subset of it for now) is UTF-8 encoded. The reason for this is that there are some Japanese, Korean and German characters in there.
I have made some progress parsing the document using a script called "class_path_parser" written by Luis Argerich, which I believe makes use of PHP's inbuilt expat classes. I can walk the document structure fine. And the standard alphabet and numbers display fine when I print them to the screen. Unfortunately the Japanese ones come up as jibberish.
I have tried everything - the utf8_decode() function, passing UTF-8 as the parameter when declaring a new parser, adding charset=UTF-8 in the header of the HTML files that is output, even using good old MS Word to convert the XML data to Shift_JIS (which my browser displays ok) and trying to parse that.
Regardless of the encoding, the characters always display correctly when viewed as raw XML, but as soon as I try to parse the file and display individual tags with PHP the output of these Japanese characters is junk.
Any ideas, tips, experience to share?
I tried to include some simplified sample data in this post but unfortunately the UTF characters get converted to their ISO codes, so instead here is a subset of the file I am using.
Thanks in advance. I like to consider myself a fairly thorough problem-solver but this has me stumped.