This is a bit off topic (and long) for a php forum but I have searched for an active html- tidy forum and can't find anything. I apologize for any incoincidence but I really need to know and I am using tidy from a php script and I am sure the experts here are knowledgeable about these security issues. I am setting up a couple web forms to process html code output from word processors like open office and MS word. I would like to pass it through Html Tidy to clean it up and provide me with a more standardized code. Essentially I save the code as a file and then run tidy (FreeBSD 5.5-RELEASE, Apache/1.3.34 (Unix), PHP 4.4.2, tidy 200804) and recover the output and process it with preg_replace etc.. I am planning on using the following code for the tidy but I could use the php mod.
function MyTidy($D,$ID) {
if(DEBUG > 79) DEBUG_L(LINE,"MyTidy() ",$this);
$TmpTidy = uniqid();
$TmpTidy = "${ID}_" . base_convert ($TmpTidy,16,36);
$D = stripslashes($D);
$IN = "/usr/local/www/libsci/tmp/${TmpTidy}_IN.html";
$OUT = "/usr/local/www/libsci/tmp/${TmpTidy}_OUT.html";
file_put_contents($IN,$D) ;
`/usr/local/bin/tidy -config /usr/local/etc/tidy.cfg -f /tmp/tidy.errors $IN > $OUT`;
$D = file_get_contents($OUT) ;
$D = addslashes($D);
return $D;
}
where $D = $_POST[HTML_CODE].
So my question is can I do this on code/text that anyone can paste into a form box and not worry about a code exploit. The code that operates on the tidy output is mostly preg_replace and preg_match code to extract out the <a href and <img tags and some string concatenation to build up the ebup .ncx and .opf xml files. There are a number of other small text input boxes for things like files name etc but I can just limit them to allowed characters. The bulk html has such a wide range of characters that need to be allowed (for example a lot of authors use backticks for quotes) that character limititation does not seem practical. The final result is converted with php htmlentites before it is output so it should be safe at that point.
I would appreciate any thoughts on this, I am doing this for myself to help in converting text documents to epub format but there a lot of people who ask for help doing this and I would like to make it public but not at the cost of a server compromise. Code safety is such a complex area that I don't want to bet my server on my guesses as to what is safe.