I have a class that uses SimpleXML to parse a raw XML feed from Wordpress (RSS2 format) and have a problem with the way SimpleXML handles the HTML entity for an apostrophe. Here's my code:

// step 1: get the feed
$rawFeed = file_get_contents($this->blog_url);
$xml = new SimpleXmlElement($rawFeed);

As soon as I load my XML into a SimpleXML object, it converts all occurances of

’

into

’

. I cannot figure out what combination of flags to pass to the constructor to keep this from happening. I know it has something to do with character encoding, but I've made sure the database is UTF8 and wordpress is outputting the feed in UTF8. So I have no idea why the characters are getting converted to something else by PHP's SimpleXML object.

    Just to clarify, are you seeing the "garbage" characters when you do a "view source" of the output in the browser, or if not, where are you seeing this?

      OK, I think I see what you're talking about:

      <?php
      header('Content-Type: text/plain');
      
      $text = "<?xml version='1.0' encoding='UTF-8'?><data><text>What&#8217;s up?</text></data>";
      
      $test = new SimpleXmlElement($text);
      
      print_r($test);
      

      Which outputs:

      SimpleXMLElement Object
      (
          [text] => What&#8217;s up?
      )
      

      It sort of seems as if it is running with the LIBXML_NOENT option by default, but a bit of searching did not turn up anything documenting that behavior, nor could I locate anything in bugs.php.net.

      However, in my case it did at least convert it to a right quote character and not the strange characters you reported, so I'm not sure what might be going on in that respect.

        Thanks for looking at this. It's sort of hard to describe because you can't easily paste the character code into this forum without having them converted to a single right curly quote. It works if you paste it into some PHP bbcode tags, but then if you edit the post further it is converted in to a right single quote. Anyway, in my first post I managed to get it to appear as the character code that is causing the problems. I checked the feed output and it outputs that character code when the raw feed is retrieved via file_get_contents(), but as soon as I instantiate the SimpleXML object using the retrieved raw xml feed the character code is converted into the garbage characters. I tried with and without the LIBXML_NOENT flag when I instantiate the SimpleXML object, but it seems to have no effect. I also tried doing a utf8_decode() on the parsed elements of the feed before it is displayed, but the garbage is converted to a "?" character when I do that. It's very strange.

          Just a stab in the dark here: does anything different happen if you skip the file_get_contents() step and directly load the file via the SimpleXmlElement constructor?

          $xml = new SimpleXmlElement($this->blog_url, null, true);
          
            NogDog;10886891 wrote:

            Just a stab in the dark here: does anything different happen if you skip the file_get_contents() step and directly load the file via the SimpleXmlElement constructor?

            $xml = new SimpleXmlElement($this->blog_url, null, true);
            

            Thanks for the idea, but it didn't have any effect. Here's a link to the example blog feed so you can see the XML that's causing the trouble. The second item in the feed has the apostrophe character code which seems to be incorrectly converted when I instantiate the SimpleXML object.

            http://75.126.106.225/blog/feed/

            Please let me know if you have any other ideas. I'm starting to think this may be an actual PHP5 or LibXML bug, though I don't see how something like this could escape notice for very long. Seems like anyone parsing Wordpress feeds with SimpleXML would have seen this problem. I have found one interesting bug report on the LibXML bug tracker and there may be more that could apply to my problem. My client's web host is using LibXML v2.6.32 and the latest release is 2.7.x so I should probably try to talk the web host into upgrading to rule that out as a possible cause. The PHP version is 5.2.6.

              Sure seems to be a bug (or undocumented feature?) somewhere in the process. Only thing I've come up with is to change them to "straight" quotes:

              <?php
              $xml = file_get_contents('http://75.126.106.225/blog/feed/');
              $quotes = array(
                 '&#38;#8216;' => "'",
                 '&#38;#8217;' => "'",
                 '&#38;#8220;' => '"',
                 '&#38;#8221;' => '"'
              );
              $xml = str_replace(array_keys($quotes), $quotes, $xml);
              $test = new simpleXmlElement($xml);
              echo "<pre>".print_r($test,1)."</pre>";
              

                Thanks again NogDog. I'm asking the web host to upgrade to the latest version of LibXML to see if that solves it. The SimpleXML object apparently relies heavily (or entirely) on LibXML, so it seems like that's probably where I should be looking. If that doesn't fix it (or if the web host refuses to upgrade) I will probably just end up querying the Wordpress DB directly and forget about these XML encoding problems. Fortunately everything is on the same server, so I don't really need to do this with an XML feed. I've had this working on another site for months with identical code, so there must be something screwy with either the server or maybe even the WordPress feed...perhaps the apostrophes in the feed should be converted to a different html entity rather than the character code being used. "SimpleXML" is an oxymoron. At this point I'm just confused. 😕

                  Just had another thought: you could try using the [man]DOM[/man] extension instead of SimpleXML.

                    euhh did you try the utf8_encode and utf8_decode functions ??

                      Similar problem as highlighted by NogDog

                      I had a problem with translations of extended ISO-8859-1 characters in a similar fashion to your problem.

                      I used utf8_encode / decode as mentioned by Ready F (cheers!) and managed to get over the problem. Problem and solution posted in the other thread.

                        Thanks for all the suggestions, but none of this has worked for me. At best, I end up with question marks replacing the characters that were previously just garbage, but it's still not right. I filed a bug report at PHP.net and they marked it as bogus, but I don't understand what the person meant who closed the bug. I replied to their message and am awaiting clarification. You can follow the bug report here:

                        http://bugs.php.net/bug.php?id=46129

                        It probably isn't a bug since it seems like other people are able to parse other feeds with SimpleXML, but I still don't know what I could possibly be doing wrong in my code at this point. I would love to see anyone write some code using SimpleXML that successfully parses the apostrophes in the title of this feed:

                        http://75.126.106.225/blog/?feed=rss2

                        If someone can verify that this feed can be parsed by SimpleXML without destroying the apostrophe in the title of the first item I'll be happily surprised and interested to know how.

                          [FONT="Arial Black"]Final Diagnosis (I think): [/FONT]
                          It turns out that SimpleXML was doing exactly what it should do. The garbage characters shown in the first post in this thread were simply the proper UTF-8 characters rendered as ISO-8859-1. Despite having the charset defined in the page via:

                          <meta http-equiv="Content-type" content="text/html;charset=UTF-8" />

                          It appears that Apache is still sending the charset header as ISO-8859-1 which is overriding the charset defined in the page. I have added the following directives to an .htaccess file to try to force the encoding to UTF-8 for all php files, but it has no effect:

                          AddDefaultCharset UTF-8
                          AddCharset UTF-8 .php

                          I think the host has disabled these directives in .htaccess, or I'm setting the directives incorrectly. Further suggestions anyone?

                            How about setting it via PHP's [man]header/man function?

                              You just gave me an idea that solved the problem. I'm doing this inside an X-Cart template and I just realized that there is a little-used feature in the X-Cart languages admin area where you can set the character encoding for whichever language you're editing. I now realize that X-Cart is using this to send a character-encoding header via PHP before it renders the templates. I just logged into the admin interface and edited the "Language" settings for English. The character encoding was set to iso-8859-1. I changed it to UTF-8 and I can hardly believe it...I'm almost embarrassed to say....the apostrophe problem is gone. Ugh. I can't believe how much time I wasted on this.

                              I'm going to go beat myself about the head and shoulders now and then possibly drink too much. Thanks again for your replies.

                                w00t!

                                Don't forget to mark this thread resolved (if it is).

                                  Write a Reply...