how fast of query?

dico

Hi There,

I've got a table that will potentially have 2,700,000 rows in it. What I'm wondering is how long (roughly) will it take a p4 2.8 ghz processors with 1 gig ram to do a simply query such as:

SELECT id,name,date FROM large_table WHERE date>xxxxxx AND date<yyyyyyyy

(say this would return 1000 records)

I am just wondering if this would take 1/2 a second or 10 seconds or a minute.

Suggestions would be appreciated.

Thanks,

Dico

LordShryku

Do you have any indexed fields? Does the table have any overhead? What kind of database is this? (Oracle, MySQL, PostgreSQL, MSSQL, Firebird....)

I suggest you try it and find out for yourself. Any answers you get from here will really only be speculation.

Sxooter

Postgresql testing

Ok, I created a test table and populated it with this script:

#!/usr/local/bin/php -q
<?php
if ($argc!=2) die("\n\nUsage: insert_rand_text_date <tablename>\n\n");
$table = $argv[1];
$conn = pg_connect("dbname=mydb user=me");
$filename = "/usr/share/dict/words";
$fp=fopen($filename,"r");
while (!feof($fp)){
        $lines[] = addslashes(chop(fgets($fp,4096)));
}
$count = count($lines);
$max = 2700000;
function make_seed() {
    list($usec, $sec) = explode(' ', microtime());
    return (float) $sec + ((float) $usec * 100000);
}
mt_srand(make_seed());
pg_exec($conn,"truncate $table");
pg_exec($conn,"begin");
$jan12004 = mktime(0,0,0,1,1,2004);
for ($i=0;$i<$max;$i++){
        if ($i%1000==0&&$i!=0){
                print $i."\n";
                pg_exec($conn,"commit");
                pg_exec($conn,"begin");
                usleep(10);
        }
        $query = "insert into $table (info,dt) values ";
        $query.= "('";
        $query.= $lines[mt_rand(0,$count-2)]." ";
        $query.= $lines[mt_rand(0,$count-2)]." ";
        $query.= $lines[mt_rand(0,$count-2)]." ";
        $query.= $lines[mt_rand(0,$count-2)]."',";
        $rnum = mt_rand(0,31535999);
        $date = date("Y-m-d G:i:s",$rnum+$jan12004);
#       print $date;
#       exit;
        $query.= "'".$date."'";
        $query.= ")";
#       print $query."\n";
#       exit;
        @pg_exec($conn,$query);
        if (pg_last_error($conn)) {
                print "\n";
                print pg_last_error($conn);
                print "\n";
                print $query;
                print "\n";
                exit;
        }
}
pg_exec($conn,"commit");
pg_exec($conn,"vacuum full analyze $table");
?>

which put 2.7 million rows of about 50 bytes width into the database.

On my database, the default setting for random_page_cost = 1.4

I ran a couple of queries:

For one random day:

explain analyze select * from test where dt between '2004-07-01 00:00:00' and '2004-07-01 23:59:59';

returned 7365 rows with an index scan, took 2 seconds.

For two days:

explain analyze select * from test where dt between '2004-07-01 00:00:00' and '2004-07-02 23:59:59';

returned 14937 rows with an index scan and took 3.5 seconds.

Five days:

explain analyze select * from test where dt between '2004-07-01 00:00:00' and '2004-07-05 23:59:59';

returned 37044 rows with an index scan in 10.5 seconds.

10 days:

explain analyze select * from test where dt between '2004-07-01 00:00:00' and '2004-07-10 23:59:59';

returned 74513 rows with a seq scan in 17.6 seconds.

At this point, to see if my random_page_cost was a good choice, I forced the planner to use an index scan by setting random_page_cost to 0.0, the response time was 18 seconds.

NOW, I clustered on the dt field:

I'll post the results when the database finishes clustering. This could take a while.

dico

Thanks so much. This is quite interesting.

What specs are the machine you're running and are you using MySQL?

Thanks,

Dico

Sxooter

This is on a Celeron 1.1GHz with 512 Megs of PC133 memory, a 23 gig IDE hard drive.

I'm running Postgresql 7.4 on this machine.

I've got a dual CPU 2800 MHz P-IV with 2 gigs and a battery backed caching SCSI RAID controller I'll test it on after the work day so as not to piss off the users by sucking up bandwidth... 🙂

dico

Thanks again!

I look forward to hearing the results. It'll determine what kind of system we purchase for this application.

-dr

Sxooter

OK dokee. The cluster command finished (only took about 10 minutes, not bad really)
Ran analyze again too, don't forget to run it after a change like clustering.

Here's what I get now:

10 days:

returns 74513 rows with index scan in 1.2 seconds

30 days:

returns 222638 rows with index scan in 2.7 seconds

3 months

returns 680974 rows with an index scan in 8.7 seconds

6 months:

returns 1353849 rows with an index scan in 13.7 seconds.

Note that cluster order is not maintained during updates in postgresql, so you will have to cluster it every so often during off times if you want this kind of speed.

Sxooter

About systems. For postgresql, the most important things are, in order:

fast drive subsystem (especially if you write a lot or have a HUGE data set that won't fit in memory) battery backed caching RAID controllers do nicely here.

lots o ram

speed of ram and cache

number of CPUs

speed of CPUs.

Note that the speed of the CPUs is the last thing to worry about. I.e. if you have a choice between a pair of 2400 and 2800 MHz CPUs, and they have the same buss speed and cache, spend the money you'd spend on the faster CPUs on more main memory or a faster hard drive setup.

dico

Thanks very much.

If you had to configure a system to do those queries you tested for me (without clustering) and having a max of a 7 day search... each row having 50 bytes of information.... what type of system would you spec out?

-dr

Sxooter

Well, seeing as how my poor little celeron isn't gonna be fast enough for most of that (seems to be about 10 seconds for a week of info, 18 seconds for ranges of a month or so.

I'm tossing the data into my production server to see how it behaves there. Specs on the production server are:

Dual P-IV 2800 MHz with 512k cache
2 gigs 533MHz memory
LSI/MegaRAID with 128 Meg bb cache
dual Seagate Barracuda 36 Gig 10k rpm UWSCSIU drives

It has hyperthreading turned off.

(i.e. it's a dell 2650 with the upgraded RAID controller, nice little boxes)

It's also running postgresql 7.2.4, not the newer 7.4 my workstation is, so it might be a tad bit slower than it could be, but there's not a lot of performance enhancements to 7.4 that would apply to this particular case (lots of enhancements though, just they tend to show up when you're running complex, nasty queries.)

OK, on my big honking server, with an index on the date field, and no clustering, and running an old version of postgresql (7.2.4) I get 258 msec to retrieve a week's worth of records.

3 months worth of data (681269 rows) takes 3.4 seconds.

dico

Wow! That is fast. And thats with the Dell, eh? I'll have to price up such a system. Thanks very much for all your help!

Anything you wish you had on your server now that you don't have?

Thanks,

-dr

Sxooter

Originally posted by dico
Anything you wish you had on your server now that you don't have?

A few more drives would be nice, but honestly, this machine sits at an average load factor of about 0.2.

I wouldn't add memory to it, since going over 2 gigs appears to invoke a hack to access the extra memory that slows intel 32 bit boxes down by up to 25%, or so I've read.

When we first got it we tested it with 5 drives in a RAID 5 array, and it was blisteringly fast under even very heavy parallel load. the current setup, 2 drives in a RAID1 is still pretty snappy, and the battery backed cache keeps the writes from stalling in a queue waiting for fsync.

The machine it replaced, a Dual PIII-750 with 1.5 gig ram but no RAID controller (just 18 gig UWSCSI drives in software mirror set) sat at a solid 1.0 to 2.0 most days.

dico

Thanks again.

I was just checking ebay and they have many of the 2650 units (NEW) listed such as this one:

http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3078291713&category=51227

-Dico

Sxooter

Note that we have the same basic thing with the PERC 3 DC, not the DI. The DI is the adaptec based RAID controller, while the DC is the lsi / megaraid based one. I prefer the lsi one, and have read up quite a bit on the two. It would appear the lsi / megaraid card, with the megaraid2 driver, is the faster of the two.

dico

With the faster raid card... the machine would just provide the disk mirroring that much faster, right? It wouldn't have any affect on the actual DB queries, would it?

Thanks,

-dr

Sxooter

The primary advantage of the battery backed caching RAID controller is that it can handle heavier write load with less impact on reads.

The read speed isn't greatly improved in and of itself compared to software RAID1. In fact in a mostly read environment, it might be a complete wash comparing hardware to software RAID. It's when you start writing that the caching controller shines.

If you're going to be mostly read, it might make more sense to spring for more drives and building a RAID5 in software. For a given number of drives, n, RAID 5 is the same on reads as that number of drives -1 (n-1) running RAID0, but provides redundancy. Of course, writes on a RAID5 are significantly slower than a RAID0, but a RAID0 has not redundancy.