Help with preg_split and XML

Rodoom · Sep 27, 2005

Hello:

I need help to parse the content of an XML file so that I can store all the data fields into a database.

The structure of the XML is something like this:

<Licitacion>
<Numero>000060040092005</Numero>
<Modificada>1</Modificada>
<PrecioBases>$881.00</PrecioBases>
<Estado>9</Estado>
<Dependencia>SECRETARÍA DE HACIENDA Y CRÉDITO PÚBLICO</Dependencia>
<AceptaElectronica>SI</AceptaElectronica>
<Caracter>NACIONAL</Caracter>
<Publicacion>06/09/2005</Publicacion>
<Tipo>SERVICIOS</Tipo>
<LimiteBases>21/09/2005</LimiteBases>
<Junta>20/09/2005, 10:00</Junta>
<AperturaTecnica>27/09/2005, 11:00</AperturaTecnica>
<AperturaEconomica>27/09/2005, 11:01</AperturaEconomica>
<Visita></Visita>
<AreaConsulta>DIRECCIÝN DE CONTRATACIÝN DE SERVICIOS EN INTERNET EN HTTP://COMPRANET.GOB.MX</AreaConsulta>
<Domicilio>CALZADA DE LA VIRGEN No. 2799, EDIFICIO "C", SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÝN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL</Domicilio>
<DiasVenta>LUNES A VIERNES</DiasVenta>
<HorarioVenta>9:00 A 14:00</HorarioVenta>
<Telefono>91-58-45-89</Telefono>
<Fax>91-58-45-97</Fax>
<FormadePago></FormadePago>
<LugarEntrega>CENTRO DE CÝMPUTO INSTITUCIONAL TRIARA, SITA EN LIBRAMIENTO A SANTA ROSA NÝMERO 111, COL. FUTURO APODACA, CÝDIGO POSTAL 66600, APODACA, NUEVO LEÝN</LugarEntrega>
<DiasEntrega>DE LUNES A DOMINGO LOS 365 DÝAS DEL AÝO</DiasEntrega>
<HorarioEntrega>LAS 24 HORAS DEL DÝA</HorarioEntrega>
<PlazoEntrega>DEL 17 DE OCTUBRE DE 2005 AL 16 DE OCTUBRE DE 2008 (36 MESES)</PlazoEntrega>
<UnidadCompradora>
<Nombre>DIRECCIÓN DE CONTRATACIÓN DE SERVICIOS</Nombre>
<DomicilioUnidad>CALZADA DE LA VIRGEN No. 2799, EDIFICIO C, SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÁN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL</DomicilioUnidad>
</UnidadCompradora>
<Partidas>
<DetallePartida>
<NumeroPartida>1</NumeroPartida>
<CABMS>C810800000</CABMS>
<Descripcion>SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP </Descripcion>
<Cantidad>1</Cantidad>
<Unidad>SERVICIO</Unidad>
<EspecificacionTecnica>SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP</EspecificacionTecnica>
</DetallePartida>
</Partidas>
<Documentos>
<Archivo>
<Liga>http://web.compranet.gob.mx:8002/HSM/UNICOM/00006/004/2005/009/217249.doc</Liga>
<Docto>Convocatoria</Docto>
</Archivo>
</Documentos>
<FormaPago>http://web.compranet.gob.mx:8001/pls/cnet2k/C2Pagos.reg_pago?NumeroLc=000060040092005&TipCon=1</FormaPago>
<Proveedores>
<Proveedor>
<NombreProveedor>HEWLETT-PACKARD MEXICO, S. DE R.L. DE C.V.</NombreProveedor>
<FechaCompra>06/09/2005</FechaCompra>
</Proveedor>
</Proveedores>
</Licitacion>[/quote]
And I need to have the output like:
000060040092005
1
$881.00
9
SECRETARÍA DE HACIENDA Y CRÉDITO PÚBLICO
SI
NACIONAL
06/09/2005
SERVICIOS
21/09/2005
20/09/2005, 10:00
27/09/2005, 11:00
27/09/2005, 11:01

DIRECCIÝN DE CONTRATACIÝN DE SERVICIOS EN INTERNET EN HTTP://COMPRANET.GOB.MX
CALZADA DE LA VIRGEN No. 2799, EDIFICIO "C", SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÝN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL
LUNES A VIERNES
9:00 A 14:00
91-58-45-89
91-58-45-97

CENTRO DE CÝMPUTO INSTITUCIONAL TRIARA, SITA EN LIBRAMIENTO A SANTA ROSA NÝMERO 111, COL. FUTURO APODACA, CÝDIGO POSTAL 66600, APODACA, NUEVO LEÝN
DE LUNES A DOMINGO LOS 365 DÝAS DEL AÝO
LAS 24 HORAS DEL DÝA
DEL 17 DE OCTUBRE DE 2005 AL 16 DE OCTUBRE DE 2008 (36 MESES)
DIRECCIÓN DE CONTRATACIÓN DE SERVICIOS
CALZADA DE LA VIRGEN No. 2799, EDIFICIO C, SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÁN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL

1
C810800000
SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP
1
SERVICIO
SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP

http://web.compranet.gob.mx:8002/HSM/UNICO.../009/217249.doc
Convocatoria
[url=http://web.compranet.gob.mx:8001/pls/cnet2k/C2Pagos.reg_pago?NumeroLc=000060040092005&TipCon=1]http://web.compranet.gob.mx:8001/pls/cnet2...05&TipCon=1[/url]
HEWLETT-PACKARD MEXICO, S. DE R.L. DE C.V.
06/09/2005
Pretty much what I need is to strip all the XML tags and be able to add each line to the database matching field.

I have tried this regular expression: <(\w+)(\s(\w=".?")?)((/>)|((/?)>.?</\1>)) using the tester found here: http://www.regexlib.com/RETester.aspx?regexp_id=433 and it works* in the way that it does separate the XML fields, even though it leaves the opening and closing tag.

But when I tried the above regular expresion using preg_split I dont get any output, I am using this script for that:
<?php

if( isset($HTTP_POST_VARS['first']) && strlen($HTTP_POST_VARS['first'])>1 )
{
    $handle = fopen($HTTP_POST_VARS['first'] , "r");
    $contents = '';
    if($handle)
    {
     while (!feof($handle)) {
       $contents .= fread($handle, 8192);
     }
    fclose($handle);
    }
    $tt=preg_split('<(\w+)(\s(\w*=".*?")?)*((/>)|((/*?)>.*?</\1>))',$contents);
    $start = 0;
    $end = count($tt);
    while($start <= $end)
    {
            echo $tt[$start + 1];
        $start = $start + 1;
    }
    ask($HTTP_POST_VARS['first']);
}
else
{
    if( isset($HTTP_POST_VARS['first']) && strlen($HTTP_POST_VARS['first'])<=1 )
    {
        ask($HTTP_POST_VARS['first']);
    }
    else
    {
        ask("AdVig01JAN0131DEC05.xml");
    }
}

function ask($init)
{
    ?>
    <FORM ACTION="extract_dependencias.php" METHOD="POST">
    Archivo: <INPUT TYPE="text" NAME="first" VALUE="<?php echo $init; ?>">
    <BR>
    <INPUT TYPE="SUBMIT">
    </FORM>
    <?php
}
?> 
So my two problems:

a) how can I use a regular expression with preg_split so that I can strip all the XML tags from my example in order to add the data into a mysql database.

b) which regular expression would be best for this issue in order to only get the text between the tags and not the whole line including the tags.

Sorry for the huge post, but I couldn´t think of any other way to explain everything without so much information.

Hope someone can help me.

Thanks.

pd. By the way, the information (as well as the tags) are in spanish, so I have to deal with characters other than a-z, like accents and such.

Weedpacket · Sep 27, 2005

Why not use PHP's own [man]XML[/man] functionality, or one of the XML classes in PEAR? Either would seem more straightforward than writing Yet Another XML Parser yourself.

Anyway: preg_split() won't do what you want. It will remove all the parts of the string that match the pattern and leave you with an array of the pieces that remain. You want [man]preg_match_all[/man].

Rodoom · Sep 28, 2005

Oh, I see, ok, I am going to check PHP´s own XML functions, thanks a lot!

Rodoom · Sep 28, 2005

Ok, I found this code:

<?php
   // Output in plain text format
   //header('Content-Type: text/plain');
   // The name of the XML file to parse
   $file_name = 'AdVig01JAN0131DEC05';
   // Opening element handler function
   function startElement($parser, $name, $attrs) {
      // Output the element name
      echo ucfirst(strtolower($name)), ': ';
   }
   // Closing element handler function
   function endElement($parser, $name) {
      // Output a new line
      echo "<br>";
   }
   // Character data handler function
   function characterData($parser, $value) {
      // Output the character data (text)
      echo $value;
   }
   // Create a new instance of the XML parser
   $parser = xml_parser_create();
   // Set the handler functions
   xml_set_element_handler($parser, 'startElement', 'endElement');
   xml_set_character_data_handler($parser, 'characterData');
   // Open the XML file
   if ($file = @fopen($file_name, 'r')) {
      // Loop through the file and parse XML
      while ($data = fgets($file, 4096)) {
         // If there was an error with the parsing
         if (!xml_parse($parser, $data, feof($file))) {
            // Show an error message
            die('XML parsing error.');
         }
      }
      // Close the file
      fclose($file);
   // File couldn't be opened
   } else {
      // Show an error message
      die('Could not open file '.$file_name.'.');
   }
   // Free the parser instance
   xml_parser_free($parser);
?>

Which does exactly what I was looking for , now I just need to tweak it to actually store the values on my MySQL database. I am thinking on creating the SQL query at the startElement function, finishing it at the endElement function and performing the insert there... any better ideas?.

Thanks

Weedpacket · Sep 28, 2005

That's basically what I'd do. The only difference is that instead of building the query string itself on the fly I'd build an associative array using the field names for the keys. Then build the query in one spot at the end. But that's just me.

Rodoom · Sep 28, 2005

Thats a good idea... however the XML document I want to parse and insert into my MySQL database is almost 10 MB big, so I am afraid using an array to hold up that much data could be an issue. I will consider testing it according to how much time I´ve got left after doing the first option.

Thanks.

Weedpacket · Sep 28, 2005

10MB - ah..... possibly. The big hit would come once you've got the array and are building the query - at that point you'll be storing both (since xml_parse() can work on chunks at a time, you only ever need to store one chunk of the XML at any given time, so that's not a problem).

Just out of interest: all this is going into one record? I was just thinking that if it were feasible, you could have several insertions queries, doing each one as the relevant data comes to hand. Then you don't even need a 10MB query. (Frankly, what's worrying me at this point is MySQL's coping with such a large insert.)

Rodoom · Sep 29, 2005

Well, yes, I plan to store all the information into a single record, according to the corresponding field.

So pretty much the way I want to do this is adding an empty record into the database whenever I find the first element of the "data chunk" and use autoincremental ID´s, so that I can use that ID to add the rest of the information into the record using UPDATE.

The XML structure defines a "master" category (sorry for my lack of technical slang regarding XML, I am quite new to it) and for every one "master" category it has "n" number of two subcategories, so I was thinking on adding this ones into different tables, linking them back to the "master" category using the autoincremental ID. Perhaps now you are more confused because I suck at explaining this kind of stuff, however I think I have it all clear now, I just need to start doing it to see if I encounter any problems.

I dont care much about eating all the resources of my computer since this process will be performed at my local WAMP installation, so thats not really an issue.

I will update this post if I can succesfully add the data into the database and post the code I used, it might be useful for someone else.

Thanks again.

Rodoom · Sep 30, 2005

Weedpacket, I have one question and I am hoping that you can help me.

I am running a modified version of the above code to try to determine the maximum size of the text stored at each XML "line" to create the MySQL database, however I am finding some problems.

The code is basically the same, the part I modified was this:

   function startElement($parser, $name, $attrs) {
      global $tag_name;
      $tag_name = $name;
      // Output the element name 
      echo ucfirst(strtolower($name)), ': '; 
   } 
   // Closing element handler function 
   function endElement($parser, $name) { 
      // Output a new line 
      echo "<br>"; 
   } 

   // Character data handler function 
   function characterData($parser, $value) { 
     global $SizeOfFields;
      // Output the character data (text) 
	  if (strlen($value) > strlen($SizeOfFields[$tag_name]))
	  {
		$SizeOfFields[$tag_inicial] = strlen($value);
	  }
      echo $value; 
   }

Where $SizeOfFields is an empty array ($tag_name simply stores the current XML tag name so that I can identify to which value does $value come from at the characterData function) where I plan to store all the maximum sizes so that finally I can print it out with print_r($SizeOfFields);

However I am getting weird results:

[LICITACION] => 2
[NUMERO] => 13
[MODIFICADA] => 2
[PRECIOBASES] => 2
[ESTADO] => 2
[DEPENDENCIA] => 23
[ACEPTAELECTRONICA] => 2
[CARACTER] => 2
[PUBLICACION] => 2
[TIPO] => 13
[LIMITEBASES] => 2
[JUNTA] => 2
[APERTURATECNICA] => 2
[APERTURAECONOMICA] => 2
[VISITA] => 106
[AREACONSULTA] => 68
[DOMICILIO] => 116
[DIASVENTA] => 15
[HORARIOVENTA] => 33
[TELEFONO] => 14
[FAX] => 14
[FORMADEPAGO] => 2
[LUGARENTREGA] => 155
[DIASENTREGA] => 15
[HORARIOENTREGA] => 36
[PLAZOENTREGA] => 35
[NOMBRE] => 3
[DOMICILIOUNIDAD] => 113
[PRESUPUESTOMINIMO] => 4
[PRESUPUESTOMAXIMO] => 2
[DETALLEPARTIDA] => 4
[NUMEROPARTIDA] => 4
[CABMS] => 4
[DESCRIPCION] => 4
[CANTIDAD] => 4
[UNIDAD] => 4
[ESPECIFICACIONTECNICA] => 4
[NOMBREPROVEEDOR] => 4
[FECHACOMPRA] => 2
[LIGA] => 4
[DOCTO] => 2
[LICITACIONES] => 1
[UNIDADCOMPRADORA] => 3
[PARTIDAS] => 3
[DOCUMENTOS] => 3
[ARCHIVO] => 4
[FORMAPAGO] => 46
[PROVEEDORES] => 2
[PROVEEDOR] => 4
[REQUISITOS] => 973
[EXPERIENCIATECNICA] => 105
[CRITERIOSADJUDICACION] => 465
[CAPITALCONTABLE] => 11
[UBICACION] => 4
[FECHAINICIO] => 4
[PLAZOEJECUCION] => 4
[CLAVE] => 4
[ESPECIFICACION] => 2

Some of those numbers make sense, but many of them are way too small... as you can see from my first post, <EspecificacionTecnica> is much bigger than the 4 characters that this array is storing... I am not sure if this is a logical or another kind of problem.

If I cant get this to work I guess I will just make 255 big all of the fields (expect for those clearly bigger than that, where I would use text) and then make some SQL query to find out the real sizes and correct the fields.

I guess this is more a really proud question since I cant believe my reasoning is that wrong so that I cant get the correct numbers to be stored :p...

anyway, thanks for your help.

Help with preg_split and XML

RRodoom

Weedpacket

RRodoom

RRodoom

Weedpacket

RRodoom

Weedpacket

RRodoom

RRodoom