php split html doucment into multiple files automatically

bike5

Here is my situation. I have a couple hundred word documents that vary from 5 to 100 pages each. Right now I manually save out each document as html using openoffice then run the html through the HTMLPurifier class to clean it up. In the past we have had someone manually add in a special code every couple pages in the documents between paragraphs where we wanted to split the document into a new "page". For instance my php code might read a 100 page word file converted into html and split it into 50 different html "snippets/pages" based on where the special code was found in the document.

This is becoming very difficult to manage as they are living legal documents that change quite often and the master file has to be changed before it gets split into chapters files which then get the "special code" added to them and we must maintain documents with the codes and without the codes for printing. So basically I need to automatically break apart these files without the aid of someone placing in the special code for page breaks. We realize the page breaks won't always be the best place to logically split but we cannot keep up on the manual process. The only requirement is that we do not break paragraphs.

So I need code that will be able to say split out roughly 2 document pages of text automatically without breaking paragraphs. If it happens to be 1 to 3 pages in order to not break paragraphs that's fine it just needs to come as close to possible to 2 pages +- without breaking a paragraph. The HTMLPurifier does a pretty good job of making sure there is valid starting/ending tags everywhere.

Does anyone have any suggestions, or links to an article that might go over something like this? It needs to be fully automated, so not manual marks placed in the file to split anything. If you have any suggestions or links to anything helpful please let me know. Thank you for your time.

Weedpacket

These documents don't have headings or other standard sectioning markup that you can use? Not even any <p> tags? Or does the original generation process (i.e., writing the Word document) miss the whole idea of markup?

PHP does have an HTML parser built in; after parsing it wouldn't be difficult to find where one paragraph ends and the next begins if each paragraph is marked up as a paragraph.

bike5

Here is an example of the data I am working with after running it through the filters. (Attached)[ATTACH]4845[/ATTACH]

example-output.txt