I have several remote directories with .zip files containing .doc files that I need to parse and store in a local database for search. The directories can be accessed via HTTP but I don't have any write access to these directories.
New files are added frequently and need to be automatically indexed, possibly through a cron job.
Basically what I need is:
Finding any unindexed .zip files.
Unzip them into a temporary directory as a .doc file or just into a temporary variable if possible.
I need some way of parsing the .doc file. I on a Linux server so I can't create a COM object.
Adding the parsed information to a local database.
The steps I would need help with is 2 and 3. I've been able to do some rudimentary parsing of the .doc files just by extracting all text and parsing using headings in the files as they are fairly standardized. But if it was possible to convert to real XML I'm guessing that would be a better more robust solution.
Any help would be appreciated.
Thank you.