Web Crawler database design
Small question about database design concerning a table that will hold several millions of records
Containing URL information.
Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.
When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically
Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts