High performance database

Anon

Hello, I need to create a searchable databasa via web. Basically, I will store:

date,autor,title,information.
with date,autur,title there are no problems.
The problem is with information, where the value can be 10 to 1 million words.
I have limited hardware, the system operating is FreeBSD.
Can anyone recommend me a free database system capable of handling things like this at a reasonable speed?

Any pointer is more than welcome..

Thanks!

mattee

www.postgresql.org

smarlowe

The important question right now is how are you modelling this, and what are likely to be sticking points for a given database. If you are doing text searching for hundreds of simos, then you might find it quickest to just store text files and run grep across them by group or something. Grep uses little memory and is pretty fast. Since each simo eats memory and CPU time from a finite pool, scalability may be limited by memory or CPU speed/response time.

So limiting resource usage is important. Postgresql now has no official row limit, but I haven't tested it with 1 million byte rows either. Nothings gonna be real fast at chewing through a million bytes byte by byte, which is how many programs work.

It might be fast enough, and it has a full text search engine addition. I've never used it (the full text indexing stuff) but have heard good things about it and want to try it out.

So now you probably need full text indexing, which is nothing like normal database work, but great for looking up a few works here and there in huge documents. It basically breaks down a file into the important words, and uses a reverse lookup table (i.e. we have a table that says the word "document" occurs in these documents:001x1, 004x5, 056x3, 349x7 etc...)
where x3 means the word occurs 3 times in that document.

So, maybe it would be better to just publish to flat html file, seperated by page numbering, i.e. book001-001.html through book001-999.html for one book, book 2 is book002-001 through 999.html etc... bigger numbers make for a bigger database. BSD file system handles large directories much better than linux's ext2 does (although a couple of patches make it fly on large directories, and may get put into the main tree sooner or later. If someone wants to do this on Linux, I'd recommend using Reiserfs, it has much better large directory performance.

Then, run ht/dig on top of it to index it. It does fuzzy/soundex/mutex/ending/prefix/phrase searching in one tight little C/C++ package running berkely dbms, which nearly everything unix has some version or another.

That's how I'd index stuff anyway, and you can build the documents around template or punch-through design if they need editing in place. (i.e. put a berkely db or something bigger on the backend to store all the data in big files.)

To those who think just toss it in a database and index it... (not anyone in particular, and I'm not picking on anyone really) and explanation of what doesn't work but I've seen suggeted many times. Benchmarks in most databases proves me out on this one.)

If you've got an online index of your data lookups are gonna be much faster, and indexing text is much harder than just:

create table storage (body text, id serial);
copy storage from stdin;
LOTSA DATA IN ROWS
.
create index storage_ndx on storage (body);

for postgresql, as well as most other databases, most simple indexes will only work if you seach on an exact match of the body, which isn't likely to happen in large documents. And they usually don't speed up wildcards like '%value%' at all.

Back to my post... :⁾

I can tell you searching in postgresl is gonna cost you a lot more in memory than most berkely dbms systems you can build or download.

But you need to test each to see how much delta they each use while searching your data. Write a page that searches based on a random word, and call it with apache abuse (ab is the command) try 'ab --help' to get a syntax listing for it. It's an easy program. Run anywhere from 1 to 100 simos for a few thousand times and see where it starts to bog down for you using it at the same time. Note that 100 simos seems like a lot but it's not really, since apache runs better when it can have a lot of simo backends open to each hold a tiny cache of data and provide a kind of connection oriented service. Even though http is connectionless by nature, caching and keep-alive give it a great performance boost like connections, but a backend will switch between a couple of different requestors if you start run out of backends, then eventually your backends will bog down and you won't get answers to some requests.

Anyway, while ab is running use top (there is a BSD version of that right? I think so.) or some other system analyzer to watch resource usage like CPU usage and memory usage. Lots of cache memory in use is good, as the OS will "hog" all the memory for cache use if programs aren't using it directly. You can see how many apache daemons are running by running 'ps aux|grep httpd|wc -l' from the shell and subtracting one, maybe two from that. If each apache daemon shows 4.5 megs in use, with 3 megs shared, then the delta for each is really 1.5 megs. The 3 megs is pretty much the same 3 megs for all the other httpd daemons. It's the unshared memory that costs you.

You can check on postgresql deltas (and similarly mysql and all the others out there) by doing 'ps aux|grep post|wc -l'. Deltas may be huge when running text searching using the:

select * from documents where body like ~'%nugget%';

but may be small enough to be acceptable when running a full text search.

You'll find htdig's the lightest solution for lots of simo lookups and it's already built for you, just install configure and go. But it may not do what you need it to, but remember you can reprogram it to do what you want if need be.

If they show 20 megs with only 5 megs shared, then you have a delta of 15 megs, and that's getting big. That kind of delta will bring a small box (64 meg or less, P200 and below) to it's knees under any kind of load really (5 machine simos will make it crawl.)

Look at CPU utilization, and if your machine is bogging down at 15% CPU and lots of of cache memory in use (i.e. lots of "free" memory) then it's likely your drive subsytem is configured sub par.

If you are running postgresql, you can also

Anon

Hi Abel!
I'm not so "expert" to give you a technical explanation like Scott has done!

Only this, ... in chapter 1.6 of MySQL manual (http://www.mysql.com/documentation/mysql/bychapter/manual_Introduction.html#Features), i found this sentence: "Handles large databases. We are using MySQL with some databases that contain 50,000,000 records and we know of users that
uses MySQL with 60,000 tables and about 5,000,000,000 rows".

Hehehe, i don't know if it's true or not, and i don't know if this can help you, but don't discart the idea to try it!

Good luck ... Paolo

smarlowe

I agree completely. Try all the databases you can get your hands on, just make sure you're testing them doing what you need them to do, i.e. handle really huge row sizes and index the text withing for fast retrieval.

We use Oracle at work, and on many millions of small records, the full text indexing returns all results in about a second, on a slow box. Big records, I don't know. :⁾

Anon

Hi,

The limitation of the number of rows depends on which linux distribution you use and config. I had a mysql database (3.23.32) with huge tables. I find that the limit is at the file size. Each "table" cannot exceed 2GB.

I don't know it is correct on latest versions. Please check.