Please be a little more specific on question nr. 1 ... you could ofcourse use regular expressions but it could also prove overkill.
About your second question I suggest you take a look at HTMLDoc which runs a Linux. I've used it myself for a big project to generate PDF-files, so it might also work for you too!