Hey guys, project specs are pretty simple and making it work is not that complicated.. but making it as EFFICIENT AS POSSIBLE is something else, and that's where I stand, asking other equally or more experienced developers out there for advises regarding the best way to execute this project
Site Specs:
Unix,Apache, MySQL, and PHP
The Project:
Think of this project as a library. The client has PDF documents.. the PDF documents ranges from 5-20 pages each..
He simply wants to bring all of this online, and make it searchable (basic and advance).. he wants the data to display in plain HTML (no longer in PDF)..
He wants the users to be able to search through it via:
Author, date, title, category, and keywords found IN THE BODY OF THE DOCUMENTS... (maybe more search options but not important)
Pretty simple project really, but what complicates it is that he's got over 1.5 million PDF documents!!!! lol
So i need your advises on the BEST way to do this system in terms of:
1) how to make the searching as fast as possible and not kill the server
2) Whats the best engine to use for this? MyISAM (always my favorite)? or InnoDB(never used this)?
3) How the heck will i automate conversion of PDF to plain text and insert into the database properly (date, author, body, etc).. (is this even possible?)
I'll leave this to here for now. hehe.
any advises as well as tips and warnings will be MUCH APPRECIATED!
thanks all!
Tea