Is Html-tidy and preg_replace, preg_match safe for anonymous input?

SteelCurtain67

This is a bit off topic (and long) for a php forum but I have searched for an active html- tidy forum and can't find anything. I apologize for any incoincidence but I really need to know and I am using tidy from a php script and I am sure the experts here are knowledgeable about these security issues. I am setting up a couple web forms to process html code output from word processors like open office and MS word. I would like to pass it through Html Tidy to clean it up and provide me with a more standardized code. Essentially I save the code as a file and then run tidy (FreeBSD 5.5-RELEASE, Apache/1.3.34 (Unix), PHP 4.4.2, tidy 200804) and recover the output and process it with preg_replace etc.. I am planning on using the following code for the tidy but I could use the php mod.

function MyTidy($D,$ID) {
if(DEBUG > 79) DEBUG_L(LINE,"MyTidy() ",$this);

            $TmpTidy = uniqid();
            $TmpTidy = "${ID}_" . base_convert ($TmpTidy,16,36);       
            $D = stripslashes($D);
            $IN = "/usr/local/www/libsci/tmp/${TmpTidy}_IN.html";
            $OUT = "/usr/local/www/libsci/tmp/${TmpTidy}_OUT.html";
            file_put_contents($IN,$D) ;
            `/usr/local/bin/tidy -config /usr/local/etc/tidy.cfg -f /tmp/tidy.errors $IN  > $OUT`;
             $D =   file_get_contents($OUT) ;
            $D = addslashes($D);
            return $D;
    }

where $D = $_POST[HTML_CODE].

So my question is can I do this on code/text that anyone can paste into a form box and not worry about a code exploit. The code that operates on the tidy output is mostly preg_replace and preg_match code to extract out the <a href and <img tags and some string concatenation to build up the ebup .ncx and .opf xml files. There are a number of other small text input boxes for things like files name etc but I can just limit them to allowed characters. The bulk html has such a wide range of characters that need to be allowed (for example a lot of authors use backticks for quotes) that character limititation does not seem practical. The final result is converted with php htmlentites before it is output so it should be safe at that point.

I would appreciate any thoughts on this, I am doing this for myself to help in converting text documents to epub format but there a lot of people who ask for help doing this and I would like to make it public but not at the cost of a server compromise. Code safety is such a complex area that I don't want to bet my server on my guesses as to what is safe.

Weedpacket

Once you've tidied the document (and note that PHP has a [man]Tidy[/man] extension) then using the [man]DOM[/man] to process it further would be much more robust than regular expressions. If you're making XML then you might want to look further towards using XSLT to effect the transformation.

SteelCurtain67

Thanks for the reply Weedpacket. I am sure using the dom and XSLT would be a superior way of doing this but I am not a professional programmer and at my age I don't want to invest the time to learn them. I am able to get by in PHP pretty well (I have been using it since about version 1.5 or so). This is really a very simple project, it is just 3 forms and I have the code written and am just adding a few more input boxes for bits of metadata that authors might want to use. Take a look at some of the examples at this epub blog . My code just parses out a two tags (href, img) and takes a list of authors names and creates the lines like

<dc:creator opf:role='aut' opf:file-as='Irving, Washington'>Washington Irving</dc:creator>

<item id='graphics5' media-type='image/png' href='America-1A_html_m701cd0f0.png' />

<navPoint id='navpoint-26' playOrder='28' ><navLabel><text>Chiral Stationary Phases</text></navLabel><content src='vfw1xhtml#RefHeading7173_629850089'/></navPoint>

It is really really simple but it is the security aspect that I am worried about as I want to make this open to the public. I know it doesn't look like much of a problem but take a look at the Amazon Kindle Support Forum . and see the problems many of the authors have. They can write 500 page books that sell 100 copies a month but they don't have a clue what an image tag is and many of them don't really understand the difference between writing in a word processor and editing source code text in a text editor; the Kindle guidelines mandate an .ncx navigation file and epub (iPad,Nook,Sony) require a valid ebup document.

Weedpacket

Okay. It's just that the DOM functions would be easier to work with because all the parsing would already have been done and you could skip all the regular expression stuff and go straight to taking the pieces you want (e.g., [font=monospace]$img_taglist = $document->getElementsByTagName('img');[/font]).

You don't have to touch XSLT (and it probably would be overkill for this).
And while you could also use the DOM to build XML documents (reliably: [font=monospace]$itemref = $package->createElement('itemref'); $itemref->setAttribute('idref', 'preface'); $spine->appendChild($itemref);[/font]), you don't have to do that either.

SteelCurtain67 wrote:
I have been using it since about version 1.5 or so

You're ahead of me then; I only got introduced to it late in version 3.

SteelCurtain67

Excellent idea about the DOM, I will remember that for the next time. Thanks again. I assume since no one has said this is unsafe that I can go ahead and let anyone submit code to be tidied.

Cheers,
Charlie