Can MySQL Handle This?

csousley

I'm about to make a 60,000,000 record table that will continue to grow for the rest of it's life. I can expect it to double over the next 7-10 years ( if it lives that long, but I need to assume it might ).

This table will be storing 2 small varchars and 2 id numbers. These are used to match a table of a few hundered thousand words to a table of a couple hundred thousand stories. All living together under the same database.

The question is, can MySQL handle something that big? Thoughts, Advice?

Thanks...

jerdo

You might want to think about your layout if you are going to have any DB with 60 million plus rows. Any DB you use is likely to run into problems with this because of resource constraints.

BeastRider

Technically, MySql database size is limited only by the maximum allowable file sizes allowed by the operating system, which for the moment is hovering in the 2 gigabyte range.

"2 small varchars and 2 id numbers" ... hmmm... let's say the varchars are each 10 bytes, plus 8 more for the numbers... you are looking at more than a gig of data. If you index anything, you get even more space.

But, I would be curious? What app requires 60 million records in a single database? Social security numbers perhaps? But even they will have broken the database into multiple chunks that can be handled by today's machines.

By the sound of your problem , it looks like you are trying to set up a many to many relationship between a list of words and the articles that contain those words, but how in the world do you get 60,000,000 for the joining table?

csousley

Here is what I have, and I may not be thinking of this right so if not, feel free to call me stupid.

I have a database that exists now with 200,000 archived news stories. The news articles are held in text fields (BLOBS) and therefor can not be indexed. Keyword searching against this is very slow, to slow, because of this.

So my thought, with the help of a couple folks in the forum, was to create a database of words ( all the words from every story ) then give every word an id number. I now have a DB I can index.

Now what's needed is a DB to handle the connection between a few hundred thousand words and a couple hundred thousand stories.

I set everything up and ran one story through giving me just under 300 unique words for that story. Also giving me just under 300 entries into the DB that makes the connection between words and stories.

Now the next story will come though and add some more words, not as many as last time because some of the same words where added to the DB from the first story, but the DB that handles the connection between words and stories would be handed a full load of relationships, another 300. Do that 200,000 times and you have 60 million entries.

Right?

I'm not worried about keeping this approach, so any sugestions will be happily accepted!

Here is my current DB Stucture:

DB -->> Content -->> id, date, author, head, body... (and so on)
-->> Words -->> word_id, word_word
-->> Map -->> map_id, map_word_id, map_story_id, map_count

map_count is just used to count how many times a word appears in a story, basic weight sorts.

map_story_id is the connection back to the content DB

map_word_id is the connection back to the word DB

map_id is the map DB's id

Any thoughts???
Thanks...

BeastRider

Well, the reality of the situation is that if you use a normalized model of a theoretical many to many link database, you will end up with 60,000,000 link entries. That is why a normalization violation (and a violation of academic orderliness!) is in order. I would probably do something like this.

a) keep your story database as is.

DB -->> Content -->> id, date, author, head, body... (and so on)

b) build your word database as follows

word_word - the word itself - after all it is unique to all other words in the database, no?
word_seqn - an arbitrary integer from 1 to n since there may be multiple records for a word.

These two fields together constitute the primary key, but you should also index just word_word by itself to speed up the SELECT's.

Next, the data:

word_column_count - integer showing how many of the following pairs are in the record

word_story_id - key into the content database
word_story_count - count of total occurence in the story.

So, these paired items are integers and take 8 bytes. Thus 500 of them stored in a record would result in a 4000 byte record, not at all unreasonable, and you could probably run a larger record size than that, say 1000 per record, for a record size of 8000 bytes. Now, if a word appears in all 200,000 stories, there will be exactly 200 records for the word, and reading in all the records would result in about 1.6 megs of memory (more actually if you want to read all the gory details of mysql memory allocation), but still less than if you had to read in 200,000 linked entry records if a word appeared in all stories..

So, by tossing the link file and violating normalization rules, you cut your 60,000,000 records down to a maximum of 60,000 since each record holds a 1,000 entries.

Dealing with arrays of pairs is really a piece of cake in mysql since the fetched array can be viewed as either associative or numeric. You use the word_column_count to know where to add new entries in partially filled records. Or you could zero fill unused entries and blow off the column count.

From a speed perspective, it is much faster to read one record with 1000 entries in it, then it is to read 1000 records, each with 1 entry. And, by blowing off the link file, you eliminate the need to search yet another table to find the links. Besides getting the database down to a manageable size, you'll see a big improvement in search speed.

This approach does have one drawback. You can't use it to find all the words in a story via a db lookup. But, I suppose one could do that by looking at the story, no?

csousley

Thanks for your help. I'm off to sit in front of a white board now and try and make heads or tails of this, but I think we're on the right track.

Thanks again...

jerdo

Maybe this article can help. http://www.onlamp.com/pub/a/php/2002/10/24/simplesearchengine.html
This guy made a search engine for an internal site that could be modified to fit your needs or at least provide a fairly decent reference for you.

ultraslacker

It is much easier to proceed from a normalized design to a denormalized one, so you are not on the wrong track here.

It depends on the indexing, as you aren't sequentially scanning the words-to-story table, so row size needn't be your primary concern. Test the normalized version first, under heavy load.