Hello:

I need help to parse the content of an XML file so that I can store all the data fields into a database.

The structure of the XML is something like this:

<Licitacion>
<Numero>000060040092005</Numero>
<Modificada>1</Modificada>
<PrecioBases>$881.00</PrecioBases>
<Estado>9</Estado>
<Dependencia>SECRETARÍA DE HACIENDA Y CRÉDITO PÚBLICO</Dependencia>
<AceptaElectronica>SI</AceptaElectronica>
<Caracter>NACIONAL</Caracter>
<Publicacion>06/09/2005</Publicacion>
<Tipo>SERVICIOS</Tipo>
<LimiteBases>21/09/2005</LimiteBases>
<Junta>20/09/2005, 10:00</Junta>
<AperturaTecnica>27/09/2005, 11:00</AperturaTecnica>
<AperturaEconomica>27/09/2005, 11:01</AperturaEconomica>
<Visita></Visita>
<AreaConsulta>DIRECCIÝN DE CONTRATACIÝN DE SERVICIOS EN INTERNET EN HTTP://COMPRANET.GOB.MX</AreaConsulta>
<Domicilio>CALZADA DE LA VIRGEN No. 2799, EDIFICIO "C", SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÝN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL</Domicilio>
<DiasVenta>LUNES A VIERNES</DiasVenta>
<HorarioVenta>9:00 A 14:00</HorarioVenta>
<Telefono>91-58-45-89</Telefono>
<Fax>91-58-45-97</Fax>
<FormadePago></FormadePago>
<LugarEntrega>CENTRO DE CÝMPUTO INSTITUCIONAL TRIARA, SITA EN LIBRAMIENTO A SANTA ROSA NÝMERO 111, COL. FUTURO APODACA, CÝDIGO POSTAL 66600, APODACA, NUEVO LEÝN</LugarEntrega>
<DiasEntrega>DE LUNES A DOMINGO LOS 365 DÝAS DEL AÝO</DiasEntrega>
<HorarioEntrega>LAS 24 HORAS DEL DÝA</HorarioEntrega>
<PlazoEntrega>DEL 17 DE OCTUBRE DE 2005 AL 16 DE OCTUBRE DE 2008 (36 MESES)</PlazoEntrega>
<UnidadCompradora>
<Nombre>DIRECCIÓN DE CONTRATACIÓN DE SERVICIOS</Nombre>
<DomicilioUnidad>CALZADA DE LA VIRGEN No. 2799, EDIFICIO C, SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÁN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL</DomicilioUnidad>
</UnidadCompradora>
<Partidas>
<DetallePartida>
<NumeroPartida>1</NumeroPartida>
<CABMS>C810800000</CABMS>
<Descripcion>SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP </Descripcion>
<Cantidad>1</Cantidad>
<Unidad>SERVICIO</Unidad>
<EspecificacionTecnica>SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP</EspecificacionTecnica>
</DetallePartida>
</Partidas>
<Documentos>
<Archivo>
<Liga>http://web.compranet.gob.mx:8002/HSM/UNICOM/00006/004/2005/009/217249.doc</Liga>
<Docto>Convocatoria</Docto>
</Archivo>
</Documentos>
<FormaPago>http://web.compranet.gob.mx:8001/pls/cnet2k/C2Pagos.reg_pago?NumeroLc=000060040092005&amp;TipCon=1</FormaPago>
<Proveedores>
<Proveedor>
<NombreProveedor>HEWLETT-PACKARD MEXICO, S. DE R.L. DE C.V.</NombreProveedor>
<FechaCompra>06/09/2005</FechaCompra>
</Proveedor>
</Proveedores>

</Licitacion>[/quote]

And I need to have the output like:

000060040092005
1
$881.00
9
SECRETARÍA DE HACIENDA Y CRÉDITO PÚBLICO
SI
NACIONAL
06/09/2005
SERVICIOS
21/09/2005
20/09/2005, 10:00
27/09/2005, 11:00
27/09/2005, 11:01

DIRECCIÝN DE CONTRATACIÝN DE SERVICIOS EN INTERNET EN HTTP://COMPRANET.GOB.MX
CALZADA DE LA VIRGEN No. 2799, EDIFICIO "C", SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÝN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL
LUNES A VIERNES
9:00 A 14:00
91-58-45-89
91-58-45-97

CENTRO DE CÝMPUTO INSTITUCIONAL TRIARA, SITA EN LIBRAMIENTO A SANTA ROSA NÝMERO 111, COL. FUTURO APODACA, CÝDIGO POSTAL 66600, APODACA, NUEVO LEÝN
DE LUNES A DOMINGO LOS 365 DÝAS DEL AÝO
LAS 24 HORAS DEL DÝA
DEL 17 DE OCTUBRE DE 2005 AL 16 DE OCTUBRE DE 2008 (36 MESES)

DIRECCIÓN DE CONTRATACIÓN DE SERVICIOS
CALZADA DE LA VIRGEN No. 2799, EDIFICIO C, SEGUNDO PISO COLONIA UNIDAD CTM CULHUACÁN, C.P. 04480 , DELEGACION COYOACÁN , DISTRITO FEDERAL

1

C810800000
SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP
1
SERVICIO
SERVICIOS DE CÝMPUTO INSTITUCIONAL DE VALOR AGREGADO, CON ALTOS NIVELES DE DISPONIBILIDAD, PARA CENTRALIZAR LA PLATAFORMA DE COMPUTO HP

http://web.compranet.gob.mx:8002/HSM/UNICO.../009/217249.doc
Convocatoria

[url=http://web.compranet.gob.mx:8001/pls/cnet2k/C2Pagos.reg_pago?NumeroLc=000060040092005&TipCon=1]http://web.compranet.gob.mx:8001/pls/cnet2...05&TipCon=1[/url]

HEWLETT-PACKARD MEXICO, S. DE R.L. DE C.V.
06/09/2005

Pretty much what I need is to strip all the XML tags and be able to add each line to the database matching field.

I have tried this regular expression: <(\w+)(\s(\w=".?")?)((/>)|((/?)>.?</\1>)) using the tester found here: http://www.regexlib.com/RETester.aspx?regexp_id=433 and it works* in the way that it does separate the XML fields, even though it leaves the opening and closing tag.

But when I tried the above regular expresion using preg_split I dont get any output, I am using this script for that:

<?php

if( isset($HTTP_POST_VARS['first']) && strlen($HTTP_POST_VARS['first'])>1 )
{
    $handle = fopen($HTTP_POST_VARS['first'] , "r");
    $contents = '';
    if($handle)
    {
     while (!feof($handle)) {
       $contents .= fread($handle, 8192);
     }
    fclose($handle);
    }
    $tt=preg_split('<(\w+)(\s(\w*=".*?")?)*((/>)|((/*?)>.*?</\1>))',$contents);
    $start = 0;
    $end = count($tt);
    while($start <= $end)
    {
            echo $tt[$start + 1];
        $start = $start + 1;
    }
    ask($HTTP_POST_VARS['first']);
}
else
{
    if( isset($HTTP_POST_VARS['first']) && strlen($HTTP_POST_VARS['first'])<=1 )
    {
        ask($HTTP_POST_VARS['first']);
    }
    else
    {
        ask("AdVig01JAN0131DEC05.xml");
    }
}

function ask($init)
{
    ?>
    <FORM ACTION="extract_dependencias.php" METHOD="POST">
    Archivo: <INPUT TYPE="text" NAME="first" VALUE="<?php echo $init; ?>">
    <BR>
    <INPUT TYPE="SUBMIT">
    </FORM>
    <?php
}
?> 

So my two problems:

a) how can I use a regular expression with preg_split so that I can strip all the XML tags from my example in order to add the data into a mysql database.

b) which regular expression would be best for this issue in order to only get the text between the tags and not the whole line including the tags.

Sorry for the huge post, but I couldn´t think of any other way to explain everything without so much information.

Hope someone can help me.

Thanks.

pd. By the way, the information (as well as the tags) are in spanish, so I have to deal with characters other than a-z, like accents and such.

    Why not use PHP's own [man]XML[/man] functionality, or one of the XML classes in PEAR? Either would seem more straightforward than writing Yet Another XML Parser yourself.

    Anyway: preg_split() won't do what you want. It will remove all the parts of the string that match the pattern and leave you with an array of the pieces that remain. You want [man]preg_match_all[/man].

      Oh, I see, ok, I am going to check PHP´s own XML functions, thanks a lot! 😃

        Ok, I found this code:

        <?php
           // Output in plain text format
           //header('Content-Type: text/plain');
           // The name of the XML file to parse
           $file_name = 'AdVig01JAN0131DEC05';
           // Opening element handler function
           function startElement($parser, $name, $attrs) {
              // Output the element name
              echo ucfirst(strtolower($name)), ': ';
           }
           // Closing element handler function
           function endElement($parser, $name) {
              // Output a new line
              echo "<br>";
           }
           // Character data handler function
           function characterData($parser, $value) {
              // Output the character data (text)
              echo $value;
           }
           // Create a new instance of the XML parser
           $parser = xml_parser_create();
           // Set the handler functions
           xml_set_element_handler($parser, 'startElement', 'endElement');
           xml_set_character_data_handler($parser, 'characterData');
           // Open the XML file
           if ($file = @fopen($file_name, 'r')) {
              // Loop through the file and parse XML
              while ($data = fgets($file, 4096)) {
                 // If there was an error with the parsing
                 if (!xml_parse($parser, $data, feof($file))) {
                    // Show an error message
                    die('XML parsing error.');
                 }
              }
              // Close the file
              fclose($file);
           // File couldn't be opened
           } else {
              // Show an error message
              die('Could not open file '.$file_name.'.');
           }
           // Free the parser instance
           xml_parser_free($parser);
        ?>

        Which does exactly what I was looking for 😃, now I just need to tweak it to actually store the values on my MySQL database. I am thinking on creating the SQL query at the startElement function, finishing it at the endElement function and performing the insert there... any better ideas?.

        Thanks

          That's basically what I'd do. The only difference is that instead of building the query string itself on the fly I'd build an associative array using the field names for the keys. Then build the query in one spot at the end. But that's just me.

            Thats a good idea... however the XML document I want to parse and insert into my MySQL database is almost 10 MB big, so I am afraid using an array to hold up that much data could be an issue. I will consider testing it according to how much time I´ve got left after doing the first option.

            Thanks.

              10MB - ah..... possibly. The big hit would come once you've got the array and are building the query - at that point you'll be storing both (since xml_parse() can work on chunks at a time, you only ever need to store one chunk of the XML at any given time, so that's not a problem).

              Just out of interest: all this is going into one record? I was just thinking that if it were feasible, you could have several insertions queries, doing each one as the relevant data comes to hand. Then you don't even need a 10MB query. (Frankly, what's worrying me at this point is MySQL's coping with such a large insert😉.)

                Well, yes, I plan to store all the information into a single record, according to the corresponding field.

                So pretty much the way I want to do this is adding an empty record into the database whenever I find the first element of the "data chunk" and use autoincremental ID´s, so that I can use that ID to add the rest of the information into the record using UPDATE.

                The XML structure defines a "master" category (sorry for my lack of technical slang regarding XML, I am quite new to it) and for every one "master" category it has "n" number of two subcategories, so I was thinking on adding this ones into different tables, linking them back to the "master" category using the autoincremental ID. Perhaps now you are more confused because I suck at explaining this kind of stuff, however I think I have it all clear now, I just need to start doing it to see if I encounter any problems.

                I dont care much about eating all the resources of my computer since this process will be performed at my local WAMP installation, so thats not really an issue.

                I will update this post if I can succesfully add the data into the database and post the code I used, it might be useful for someone else.

                Thanks again.

                  Weedpacket, I have one question and I am hoping that you can help me.

                  I am running a modified version of the above code to try to determine the maximum size of the text stored at each XML "line" to create the MySQL database, however I am finding some problems.

                  The code is basically the same, the part I modified was this:

                     function startElement($parser, $name, $attrs) {
                        global $tag_name;
                        $tag_name = $name;
                        // Output the element name 
                        echo ucfirst(strtolower($name)), ': '; 
                     } 
                     // Closing element handler function 
                     function endElement($parser, $name) { 
                        // Output a new line 
                        echo "<br>"; 
                     } 
                  
                     // Character data handler function 
                     function characterData($parser, $value) { 
                       global $SizeOfFields;
                        // Output the character data (text) 
                  	  if (strlen($value) > strlen($SizeOfFields[$tag_name]))
                  	  {
                  		$SizeOfFields[$tag_inicial] = strlen($value);
                  	  }
                        echo $value; 
                     } 
                  
                  

                  Where $SizeOfFields is an empty array ($tag_name simply stores the current XML tag name so that I can identify to which value does $value come from at the characterData function) where I plan to store all the maximum sizes so that finally I can print it out with print_r($SizeOfFields);

                  However I am getting weird results:

                  [LICITACION] => 2
                  [NUMERO] => 13
                  [MODIFICADA] => 2
                  [PRECIOBASES] => 2
                  [ESTADO] => 2
                  [DEPENDENCIA] => 23
                  [ACEPTAELECTRONICA] => 2
                  [CARACTER] => 2
                  [PUBLICACION] => 2
                  [TIPO] => 13
                  [LIMITEBASES] => 2
                  [JUNTA] => 2
                  [APERTURATECNICA] => 2
                  [APERTURAECONOMICA] => 2
                  [VISITA] => 106
                  [AREACONSULTA] => 68
                  [DOMICILIO] => 116
                  [DIASVENTA] => 15
                  [HORARIOVENTA] => 33
                  [TELEFONO] => 14
                  [FAX] => 14
                  [FORMADEPAGO] => 2
                  [LUGARENTREGA] => 155
                  [DIASENTREGA] => 15
                  [HORARIOENTREGA] => 36
                  [PLAZOENTREGA] => 35
                  [NOMBRE] => 3
                  [DOMICILIOUNIDAD] => 113
                  [PRESUPUESTOMINIMO] => 4
                  [PRESUPUESTOMAXIMO] => 2
                  [DETALLEPARTIDA] => 4
                  [NUMEROPARTIDA] => 4
                  [CABMS] => 4
                  [DESCRIPCION] => 4
                  [CANTIDAD] => 4
                  [UNIDAD] => 4
                  [ESPECIFICACIONTECNICA] => 4
                  [NOMBREPROVEEDOR] => 4
                  [FECHACOMPRA] => 2
                  [LIGA] => 4
                  [DOCTO] => 2
                  [LICITACIONES] => 1
                  [UNIDADCOMPRADORA] => 3
                  [PARTIDAS] => 3
                  [DOCUMENTOS] => 3
                  [ARCHIVO] => 4
                  [FORMAPAGO] => 46
                  [PROVEEDORES] => 2
                  [PROVEEDOR] => 4
                  [REQUISITOS] => 973
                  [EXPERIENCIATECNICA] => 105
                  [CRITERIOSADJUDICACION] => 465
                  [CAPITALCONTABLE] => 11
                  [UBICACION] => 4
                  [FECHAINICIO] => 4
                  [PLAZOEJECUCION] => 4
                  [CLAVE] => 4
                  [ESPECIFICACION] => 2

                  Some of those numbers make sense, but many of them are way too small... as you can see from my first post, <EspecificacionTecnica> is much bigger than the 4 characters that this array is storing... I am not sure if this is a logical or another kind of problem.

                  If I cant get this to work I guess I will just make 255 big all of the fields (expect for those clearly bigger than that, where I would use text) and then make some SQL query to find out the real sizes and correct the fields.

                  I guess this is more a really proud question since I cant believe my reasoning is that wrong so that I cant get the correct numbers to be stored :p...

                  anyway, thanks for your help.

                    Write a Reply...