Hi guys,

My name's Aidan and this is my first post here. I work for a small company in NZ, and I'm currently trying to work through the unexpected behaviour of the php xml_parse function. 😕

I am having some trouble with the expat-based xml parser in php (xml_parser_create, xml_set_character_data_handler, et al) dropping whitespace. I'm hoping someone else has run into this problem and can help, or at least I can find someone to just share in my frustration... :bemused:

I've wrapped the functions into a nice OO package, and on the whole it's working perfectly. However on PHP4 on at least one platform (I think it's a mac), my character data callback function (set with xml_set_character_data_handler() ) never receives any whitespace, including newlines and carriage returns! For my purpose, this is unacceptable (the newlines/c-returns are highly significant in my CDATA). This problem does not occur on php5 on windows.

Here is an example:
XML

<?xml version='1.0' ?>
<some_text>
This is a short sentence followed by a new line.
Here is a second sentence.
</some_text>

The actual character data from the <some_text> node after parsing would be:

This is a short sentence followed by a new line.Here is a second sentence.

I've had a wee hunt around in the manual, and I found an undocumented (in my manual) option 'XML_OPTION_SKIP_WHITE', which seems like it should be the culprit in my case. But this option seems to be set to false by default even on my php4/mac target.

This seems to be a behaviour that no one else seems to mind, which is fine, except I can't find much other info on this problem anywhere else on the net.

Has anyone figured out a way around this (stupid) parsing behaviour?

Thanks!

    does xml_parser_set_option(parser,XML_OPTION_SKIP_WHITE,false) not work either?

      konsu wrote:

      does xml_parser_set_option(parser,XML_OPTION_SKIP_WHITE,false) not work either?

      Nope. 🙁

      Although the function returns TRUE (success), and xml_parser_get_option(parser,XML_OPTION_SKIP_WHITE) returns FALSE, the whitespace (ie newlines!) is still skipped.

      I've also tried setting it to TRUE, on a hunch, but that doesn't make a difference either...

      Thanks for your reply, konsu.

      😕

        i do not know. it is probably a bug which was fixed in version 5.

        strictly speaking, white space in xml data should not be relied upon.

          konsu wrote:

          i do not know. it is probably a bug which was fixed in version 5.

          Mmm, you're probably right. I'll probably have to look at a workaround involving generating the xml data differently. What a pain in the ass. :glare:

          konsu wrote:

          strictly speaking, white space in xml data should not be relied upon.

          Respectfully, I disagree with this entirely. Although whitespace within tags should not be relied on, whitespace within CDATA should be preserved by all xml parsers. The utility of the xml format is drastically reduced for me as a developer if CDATA whitespace is not guaranteed. I use the CDATA in my application for article text. How would you feel if all your forum posts were stripped of newline characters? :p

          Once again, thanks for your reply though!

            but your xml fragment did not have any <![CDATA[...]]> sections in it...

              Oops, I confused data between tags with a CDATA section... :queasy:

              What I mean is, whitespace in data between tags should be preserved.
              W3C XML Spec: White Space Handling

              An XML processor must always pass all characters in a document that are not markup through to the application.

              In any case, I didn't really make this post to get into a discussion about whether or not xml parsers should preserve whitespace or not... 😉 I was hoping there might be a workaround someone has bumped into that they would like to share.

                well, as a workaround, i would try to put the text into a CDATA section.

                  konsu wrote:

                  well, as a workaround, i would try to put the text into a CDATA section.

                  Excellent idea. I was sure it was going to work, and was excited about the prospect of posting a confirmed solution, but it doesn't... 🙁

                  I also tried to use the ' xml:space="preserve" ' attribute on the affected node, which is said to force preservation of whitespace, but this fails also.
                  (This is mentioned here: W3C XML spec )

                  I give up. My workaround is to put each paragraph into a seperate entity, instead of using newlines. However, if anyone encounters this problem and resolves it, please post here - I'll be very interested.

                  (Thanks for your posts konsu!)

                    did you try to use a DOM parser or something else? also, another thing to try is to use unix style newlines. or msdos style newlines. maybe it makes a difference. just a guess.

                      True ... I will try the DOM XML parser and post my results.

                        24 days later

                        Alrighty, the DOM parser solves my problem (it preserves whitespace on mac php4).

                        Though I've had to bundle both parsers (the 'traditional' expat one and the dom one) because my implementation of the DOM parser doesn't work under php5 (because I'm stupid/lazy).

                        But since expat works on php5, and dom works on php4, I've covered it all I think...

                        Thanks for the feedback, konsu.

                          Write a Reply...