I am having a problem with processing XML using PHP. This is illustrated by the following piece of code. I have tried to recreate the error I am getting using code that is as close as possible to examples on php.net so I am sure I have not done something funny. Obviously the functionality below is a pointless exercise but it contains code that produces the error.
The PHP writes an XML file containing numerical character entities then reads it back in, parses it and writes it to another file. This should produce 2 identical files, but in the second, extra characters appear.
For illustration I have used the non-breaking space character (&#160, which appears when parsed preceded by an Acirc character (&#194
.
I think this might be a character encoding problem so have set "default_charset = "UTF-8"" in php.ini but this did not solve the problem.
Any help greatly appreciated. Thank you.
My PHP file is:
<?php
$filename = 'test.xml';
$content = "<"."?xml version=\"1.0\" encoding=\"UTF-8\"?".">\n<test>\n\t<p>Add this to the <b>file</b> </p>\n</test>";
/*
*
* File writing adapted from example at http://uk2.php.net/fwrite
*
*/
write_out_to_file ($filename, $content);
// made function as used twice
function write_out_to_file ($filename, $content) {
if (!$handle = fopen($filename, 'w')) {
echo "Cannot open file ($filename)";
exit;
}
if (fwrite($handle, $content) === FALSE) {
echo "Cannot write to file ($filename)";
exit;
}
fclose($handle);
}
/*
*
* XML parsing adapted from example at http://uk.php.net/xml
*
*/
global $output;
$output = "<"."?xml version=\"1.0\" encoding=\"UTF-8\"?".">\n";
$file = "test.xml";
$map_array = array(
"TEST" => "test",
"P" => "p",
"B" => "b"
);
function startElement($parser, $name, $attrs) {
global $map_array, $output;
if (isset($map_array[$name])) {
$output .= "<$map_array[$name]>";
}
}
function endElement($parser, $name)
{
global $map_array, $output;
if (isset($map_array[$name])) {
$output .= "</$map_array[$name]>";
}
}
function characterData($parser, $data)
{
global $output;
$output .= strictify($data);
}
$xml_parser = xml_parser_create();
// use case-folding so we are sure to find the tag in $map_array xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true); xml_set_element_handler($xml_parser, "startElement", "endElement"); xml_set_character_data_handler($xml_parser, "characterData"); if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);
/*
*
* strictify adapted from example at in comments http://uk2.php.net/chr
*
*/
function strictify ( $string ) {
$fixed = htmlspecialchars( $string, ENT_QUOTES );
$trans_array = array();
for ($i=127; $i<255; $i++) {
$trans_array[chr($i)] = "&#" . $i . ";";
}
$really_fixed = strtr($fixed, $trans_array);
return $really_fixed;
}
/*
*
* write_out_to_file from above
*
*/
$filename = "test2.xml";
write_out_to_file ($filename, $output);
?>
test.xml looks like:
<?xml version="1.0" encoding="UTF-8"?>
<test>
<p>Add this to the <b>file</b>&#160;&#160;&#160;</p> </test>
test2.xml looks like:
<?xml version="1.0" encoding="UTF-8"?>
<test>
<p>Add this to the <b>file</b>&#194;&#160;&#194;&#160;&#194;&#160;</p>
</test>