Looking for a better way

ts10

I am trying to find a better way to do what I'm doing.

Here's the idea:

I am creating software for users to scan in a document (like a paperless archive). Of course, scanning unless OCR is an image. So, users scan a document and upload the image to my website.

Obviously I can change the size of the image, resultion, etc. via GD when uploading.

Once the images are uploaded (I currently keep them at the original uploaded size), I display them, one by one (as they click a "next" button) and on the fly, resize them to a better 300px width. This takes a long time if the image is large. I was hoping that resizing and decreasing the resolution on the fly would not take so long, but it does.

Then...they can click on the image and it creates a PDF file of the image. This takes even longer because it has to be a higher resultion and larger.

So, is there a better, quicker way to do what I'm doing? Maybe convert the image to some kind of server-side OCR or anything?

What can I do to make this thing a lot faster displaying the images?

Any help is appreciated! I have searched, but can't come up with anything.

Thanks!

Weedpacket

Do the resizing and PDF generation at the upload stage, and save those as well as the original.

No doubt the script that handles the upload currently produces some output saying the file has been uploaded or suchlike. Putting the thumbnail and PDF generation code after this output means the user will get that output without having to wait for the thumbnail or PDF generation.

By the time they get around to paging through the uploads, the thumbnails and PDFs are all ready and waiting to be served.

kkobashi

It depends on your architecture.

For example, you could have an HTTP gateway server that puts up all the HTML pages for navigation, document searching, etc. A repository server for storing the document images. And a database server to store metainfo of the documents and their locations on disk. They could all exist on the same machine, on separate machines. For performance, I would do it on separate machines.

When the user wants to search, he/she would type in keywords to find the document or use some other search criteria. A database server would hold the meta description of those document images. This will likely be a transactional database (see more on this below).

On return of the document search query, a list would be generated of the documents that match. The list generated would be allow the user to choose among various sizes and formats (JPG, PDF, etc). The list would essentially be HTML anchor links. This is the "GET" of the HTTP request.

So let's say the user searches for flowers. The list could look something like this:

Name: Dandelion
Created: 04/05/05

Description:
A picture of dandelions in my backyard that I need to weed.
PDF Full Thumb
JPG Full Thumb

Name: Roses
Created: 05/03/04
Description:
A bouquet of roses on mom's table.
PDF Full Thumb
JPG Full Thumb

As weedpacket mentioned, these images would be already rendered in various sizes and formats. In other words, they are pre-processed ahead of time to reduce the amount of processing on the server at the time of the request. Of course, the disadvantage to this is the amount of disk space needed.

You need to think carefully about how you are going to organize the files on disk. You also need to think carefully about how you are going to handle the situation when you run out of disk space. You need to design the application so that you can store the documents across multiple physical disks. Also be aware that the operating system will limit the number of files in a folder, for example. Thus, plan on application scaling because disk space will be a premium.

Now when a document is added to repository, it could be queued up. A background task would check the queue and do the conversions to the various sizes and data formats you require. Notice that this would likely be transaction based, so choice of the database is important. You may want to use mySQL InnoDB for this. The same goes true on a delete or update.

Now one would design this to not actually delete the files. They would be moved to a "garbage" folder where another background task would exist to delete (or another application to remove by administrator). You do this in case during the delete transaction, only 2 files get deleted out of the 4. You would then be able to rollback and restore from the garbage folder as the files would still exist.

These images can exist on the gateway server or on another server entirely. Probably best if its on another server to increase performance. You don't want to have background queue processing on a front end HTTP server stealing cycles while users are executing your application. The purpose of the HTTP gateway server should be to simply respond to search requests for your MIME objects (in your case, PDF documents and JPG images), retrieve and respond back to the client with the data object.

These are just some things at the top o f my head.

Feel free to contact me ts10 privately.

ts10

Thanks so much for both replies.

kkobashi, it seems that your response would be more than needed, since only images would be uploaded originally. However, of course, disk space would eventually be an issue, so I really have to think it through.

Weedpacket, I thought of your ideas orignally but there were two problems:

1) Disk space, rather than generating the image on the fly
2) What if they wanted to make a PDF of several files? If the PDF's have already been created, I am not sure of any way to combine them server-side. I am using FPDF to create my PDF files.

Is there any possible way for server side to take an image, OCR it into a PDF document so that it's not just a big image but an actual image/text document?

Thanks again!