Finding match using regex from Docx file

digi23

Hello,

I am trying to read questions and answers from docx file and inserting into my MySQL tables..
Right now i can read the contents of the docx file using this below function

Read Docx file

function read_file_docx($filename){

$striped_content = '';
$content = '';

if(!$filename || !file_exists($filename)) return false;

$zip = zip_open($filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

    if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

    if (zip_entry_name($zip_entry) != "word/document.xml") continue;

    $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

    zip_entry_close($zip_entry);
}// end while

zip_close($zip);

//echo $content;
//echo "<hr>";
//file_put_contents('1.xml', $content);

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);

return $striped_content;
}
$filename = "filepath";// or /var/www/html/file.docx

$content = read_file_docx($filename);
if($content !== false) {

echo nl2br($content);
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}

My Docx file

D.1-5)   Study the picture. Complete each sentence by choosing the best option from the box below.	


Q.1)  The Q is ___ W.	
a)  (a)
b)  (b)
c)  (c)
d) (d)
e) (e)
 f)  (f)

Q.2)  The W is _________ Q and Z.
a)  (a)
b)  (b)
c)  (c)
d) (d)
e) (e)
 f)  (f)

I have split the table into two..One for storing questions and another one for answers..
[ATTACH]5319[/ATTACH][ATTACH]5321[/ATTACH]

I want to insert all the questions after Q.*) and all options after a),b),c) into corresponding table with same question ID.

Thank you

digi23

I have somehow finally managed to get regex working

$pattern = "/\)(.*)/";
if (preg_match_all($pattern, $content, $matches_out)) {

}
foreach($matches_out[1] as $row)
{
	print $row."<br>";
}

But this doesn't work for other questions..

Q.46)  A. No one do the job as well as you.
          B. No one does the job as well as you.
a) A
b) B

Regex doesn't find B. No one does the job as well as you.

dalecosp

Well, regex matching on text is pretty hard ... that's why Rasmus invented PHP, so he wouldn't have to be as concerned with regex as Larry Wall was :p

Your regex is pretty simple. "Find anything after a closing parenthesis to the end of the line". So, in theory, adding the "m" Pattern Modifier to your regex would get the next line.

However, it would probably get the next line, too. And the next ... and the next ... until EOF.

Too bad you can't use DOM ....

digi23

dalecosp;11054825 wrote:
Well, regex matching on text is pretty hard ... that's why Rasmus invented PHP, so he wouldn't have to be as concerned with regex as Larry Wall was :p

Your regex is pretty simple. "Find anything after a closing parenthesis to the end of the line". So, in theory, adding the "m" Pattern Modifier to your regex would get the next line.

However, it would probably get the next line, too. And the next ... and the next ... until EOF.

Too bad you can't use DOM ....

So what do you think would be the best way to parse docx document and insert into mysql tables?

NogDog

dalecosp;11054825 wrote:
...
Too bad you can't use DOM ....

Not sure if this was supposed to be a hint or not, but since DOCX files are XML, I would think the SimpleXML or DOM extension could deal with it?

dalecosp

NogDog;11054847 wrote:
Not sure if this was supposed to be a hint or not, but since DOCX files are XML, I would think the SimpleXML or DOM extension could deal with it?

I actually hadn't considered that, exactly, but you might be right. My realm of recent experience on this is mostly converting HTML to DOCX, which is dead easy given the right output type given to header().

But as the X in DOCX does stand for XML, I'd think the OP would be well-served to give it a try. :-)

sneakyimp

I'd be interested in seeing the contents of the DOCX file when opened by a simple text editor like notepad or gedit or something. If experience is any guide at all, M$ files are a total nasty mess.

Weedpacket

The file is a ZIP archive. Because the XML isn't supposed to be directly read by a human there is no pretty indentation or line breaks.
The standard is published by ECMA and it is more Microsoft-centric than you'd expect from an "Open" standard. It also seems a bit less flexible and future-proofed than the Open Document Format (http://docs.oasis-open.org/office/v1.2/) (e.g., it uses several specific tags for things that could have been handled by one tag with attributes).

That said, I believe that since ISO has been given care of Open Office XML they've sharpened it up a bit and have been leaning on Microsoft to catch up. I haven't looked at how well that's gone since.

sneakyimp

I pasted that data into a document using LibreOffice Writer and saved it to a DOCX file. I think it goes without saying that the LibreOffice docx file is going to be different. I tweaked the code a bit:

<?php
function read_file_docx($filename){

$striped_content = '';
$content = '';

if(!$filename || !file_exists($filename)) throw new Exception("file does not exist");

$zip = zip_open($filename);

if (!$zip || is_numeric($zip)) throw new Exception("zip open failed");

while ($zip_entry = zip_read($zip)) {
	echo zip_entry_name($zip_entry) . "\n";

	if (zip_entry_open($zip, $zip_entry) == FALSE) throw new Exception("unable to open zip file");

	if (zip_entry_name($zip_entry) != "word/document.xml") continue; // skip the ones we don't want

	$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

	zip_entry_close($zip_entry);
}// end while

zip_close($zip);

//echo $content;
//echo "<hr>";
//file_put_contents('1.xml', $content);

// 	$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
// 	$content = str_replace('</w:r></w:p>', "\r\n", $content);
// 	$content = strip_tags($content);

return $content;
}
$filename = "/path/to/file/data.docx";// or /var/www/html/file.docx

$content = read_file_docx($filename);
var_dump($content);

The raw output is pretty gnarly:

string(3405) "<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"><w:body><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve">D.1-5)   Study the picture. Complete each sentence by choosing the best option from the box below.      </w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve"> </w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve">        </w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve">Q.1)  The Q is ___ W.   </w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>a)  (a)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>b)  (b)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>c)  (c)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>d) (d)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>e) (e)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve"> </w:t></w:r><w:r><w:rPr></w:rPr><w:t>f)  (f)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve">    </w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>Q.2)  The W is _________ Q and Z.</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>a)  (a)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>b)  (b)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>c)  (c)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>d) (d)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t>e) (e)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="PreformattedText"/><w:spacing w:before="0" w:after="283"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><w:t xml:space="preserve"> </w:t></w:r><w:r><w:rPr></w:rPr><w:t>f)  (f)</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="Normal"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr></w:r></w:p><w:sectPr><w:type w:val="nextPage"/><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:left="1134" w:right="1134" w:header="0" w:top="1134" w:footer="0" w:bottom="1134" w:gutter="0"/><w:pgNumType w:fmt="decimal"/><w:formProt w:val="false"/><w:textDirection w:val="lrTb"/></w:sectPr></w:body></w:document>"

But removing tags as he does (uncomment those lines that strip_tags, etc. in the function) and you get something pretty clean:

string(295) "
D.1-5)   Study the picture. Complete each sentence by choosing the best option from the box below.      


Q.1)  The Q is ___ W.   

a)  (a)
b)  (b)
c)  (c)
d) (d)
e) (e)
 f)  (f)

Q.2)  The W is _________ Q and Z.
a)  (a)
b)  (b)
c)  (c)
d) (d)
e) (e)
 f)  (f)

"

Hard to say what the best way is to go about this. If this original docx file has formatting in it, that formatting might be key to deciding how to parse the document. Hard to say with more info about the docx.