I am having a problem with processing XML using PHP. This is illustrated by the following piece of code. I have tried to recreate the error I am getting using code that is as close as possible to examples on php.net so I am sure I have not done something funny. Obviously the functionality below is a pointless exercise but it contains code that produces the error.

The PHP writes an XML file containing numerical character entities then reads it back in, parses it and writes it to another file. This should produce 2 identical files, but in the second, extra characters appear.

For illustration I have used the non-breaking space character (&#160😉, which appears when parsed preceded by an Acirc character (&#194😉.

I think this might be a character encoding problem so have set "default_charset = "UTF-8"" in php.ini but this did not solve the problem.

Any help greatly appreciated. Thank you.

My PHP file is:

<?php
$filename = 'test.xml';
$content = "<"."?xml version=\"1.0\" encoding=\"UTF-8\"?".">\n<test>\n\t<p>Add this to the <b>file</b>   </p>\n</test>";

/*
*
*	File writing adapted from example at http://uk2.php.net/fwrite
*
*/

write_out_to_file ($filename, $content);

// made function as used twice
function write_out_to_file ($filename, $content) {
	if (!$handle = fopen($filename, 'w')) {
		 echo "Cannot open file ($filename)";
		 exit;
	}

if (fwrite($handle, $content) === FALSE) {
	echo "Cannot write to file ($filename)";
	exit;
}

fclose($handle);
}



/*
*
*	XML parsing adapted from example at http://uk.php.net/xml
*
*/

global $output;
$output = "<"."?xml version=\"1.0\" encoding=\"UTF-8\"?".">\n";

$file = "test.xml";
$map_array = array(
    "TEST"     => "test",
    "P" => "p",
    "B"  => "b"
);

function startElement($parser, $name, $attrs) {
    global $map_array, $output;
    if (isset($map_array[$name])) {
        $output .= "<$map_array[$name]>";
    }
}

function endElement($parser, $name)
{
    global $map_array, $output;
    if (isset($map_array[$name])) {
        $output .= "</$map_array[$name]>";
    }
}

function characterData($parser, $data)
{
    global $output;
    $output .= strictify($data);
}

$xml_parser = xml_parser_create();
// use case-folding so we are sure to find the tag in $map_array xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true); xml_set_element_handler($xml_parser, "startElement", "endElement"); xml_set_character_data_handler($xml_parser, "characterData"); if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        die(sprintf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser)));
    }
}
xml_parser_free($xml_parser);



/*
*
*	strictify adapted from example at in comments http://uk2.php.net/chr
*
*/
function strictify ( $string ) {

   $fixed = htmlspecialchars( $string, ENT_QUOTES );

   $trans_array = array();
   for ($i=127; $i<255; $i++) {
	   $trans_array[chr($i)] = "&#" . $i . ";";
   }

   $really_fixed = strtr($fixed, $trans_array);

   return $really_fixed;

}



/*
*
* write_out_to_file from above
*
*/
$filename = "test2.xml";
write_out_to_file ($filename, $output);
?>

test.xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<test>
	<p>Add this to the <b>file</b>&#38;#160;&#38;#160;&#38;#160;</p> </test>

test2.xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<test>
	<p>Add this to the <b>file</b>&#38;#194;&#38;#160;&#38;#194;&#38;#160;&#38;#194;&#38;#160;</p>
</test>

    A note for anyone replying to this: I had some trouble posting the entity references for the characters I was trying to write in my post. Below I have added whitespace between each character to illustrate this:

    At first I simply wrote (strip white space)
    & # 1 6 0 ;

    In preview, this replaced what I had written with an actual non-breaking space (&#160😉 so I changed it to (again I have put precisely one white space character between every other character):

    & a m p ; 1 6 0 ;

    This was reproduced entirely, i.e. "&amp;" was not replaced by the ampersand character. So I went with

    & # 3 8 ; # 1 6 0 ;

    which produces (here, at least) the desired effect of &#38;#160; (ampersand, pound sign, 160, semi-colon).

    I hope this is helpful to anyone trying to write back!

      <?php
      
      // Create a new DOM instance
      $xml = new DomDocument('1.0','utf-8');
      
      // Make the output pretty
      $xml->formatOutput = true;
      
      //Create the test element and append it 
      $test = $xml->createElement('test');
      $xml->appendChild($test);
      
          //Create a CDATA section as html should be within one
          $cdata = $xml->createCDATASection('<p>Add this to the <b>file</b>&#38;#160;&#38;#160;&#38;#160;</p>');
          //Append the cdata to the test node
      	$test->appendChild($cdata);
      
      //return the created xml into the output variable
      $output = $xml->saveXML();
      
      
      //write the output to text.xml
      $file_handle = @fopen('test.xml','w+');
      
      if ($file_handle){
      
      	fwrite($file_handle, $output);
      	fclose($file_handle); 
      }
      
      //Now read the XML back with DOM
      $doc = new DOMDocument();
      $doc->load('test.xml');
      
      $test_value = $doc->getElementsByTagName('test')->item(0)->nodeValue;
      
      echo $test_value;
      
      
      //As you can see, utf-8 is no longer an issue.
      test.xml
      
      <?xml version="1.0" encoding="utf-8"?>
      <test><![CDATA[<p>Add this to the <b>file</b>&#38;#160;&#38;#160;&#38;#160;</p>]]></test>
      
      the html output
      
      <p>Add this to the <b>file</b>&#38;#160;&#38;#160;&#38;#160;</p>

      I used the PHP DOM, which as of php5 is very effective, to read and write the xml example you provided. It should be pretty straightforward and there is way more that can be done with the dom and php.

      This is also a good reference as well as the php manual
      http://www-128.ibm.com/developerworks/opensource/library/os-xmldomphp/

      Hope it helps let me know if I can explain anything more in depth for you.

      Mike

        Thanks Mike, this looks very interesting.

          Write a Reply...