Here is my situation. I have a couple hundred word documents that vary from 5 to 100 pages each. Right now I manually save out each document as html using openoffice then run the html through the HTMLPurifier class to clean it up. In the past we have had someone manually add in a special code every couple pages in the documents between paragraphs where we wanted to split the document into a new "page". For instance my php code might read a 100 page word file converted into html and split it into 50 different html "snippets/pages" based on where the special code was found in the document.
This is becoming very difficult to manage as they are living legal documents that change quite often and the master file has to be changed before it gets split into chapters files which then get the "special code" added to them and we must maintain documents with the codes and without the codes for printing. So basically I need to automatically break apart these files without the aid of someone placing in the special code for page breaks. We realize the page breaks won't always be the best place to logically split but we cannot keep up on the manual process. The only requirement is that we do not break paragraphs.
So I need code that will be able to say split out roughly 2 document pages of text automatically without breaking paragraphs. If it happens to be 1 to 3 pages in order to not break paragraphs that's fine it just needs to come as close to possible to 2 pages +- without breaking a paragraph. The HTMLPurifier does a pretty good job of making sure there is valid starting/ending tags everywhere.
Does anyone have any suggestions, or links to an article that might go over something like this? It needs to be fully automated, so not manual marks placed in the file to split anything. If you have any suggestions or links to anything helpful please let me know. Thank you for your time.