Methods for eliminating duplicate rows

Sxooter

I'm looking for methods for removing duplicates. Currently, what I'm using is to add a new column with a sequential number in it to use as a unique key. If your system has some pseudo row number type, like postgresql's oid column you could use that too.

After that, I use the following statement to find and eliminate the duplicate rows:

delete from 
    mytable
where id in (
    select 
        l1.id 
    from 
        mytable l1 
    join 
        mytable l2 
    on (
        l1.field1=l2.field1 
        AND 
        l1.field2=l2.field2 
        AND 
        l1.field3=l2.field3 
        AND 
        l1.id>l2.id
    )
);

In the above, id is the field that is unique (added if necessary) while the other fields are the ones that are possible dups. In testing on a real dataset, with a single field that needs to be compared for uniqueness, I got the following performance results:

Rows | Deleted | Time (in s)
1M | 480 | 17
2M | 1905 | 37
4M | 7531 | 103
8M | 30,000 | 440
18.8M | 170,000 | 1061
28.6M | 390,000 | 2472

My machine has 2 gigs of ram, and as soon as the dataset got too big to be buffered in ram, around the 8 million row mark, there was a big drop off. Notice I doubled the size of the set there, but more than quadrupled the time. However, from there on, the increase in time is only slightly worse than linear. The other methods I tried had much worse results as the set size increased.

select distinct * into newtable was one of those methods, as was inserting rows one at a time into a table with a unique index.

csn

I don't have a method to contribute 🙁, but it is possible in Postgres to create a unique key that spans two or more columns (perhaps even in two or more tables)? Like a multi-column index (but unique). I've never tried.

Sxooter

Yeah, postgresql supports multi-column unique indexes. Note that these can be quite expensive to update.

However this is more for batch processing. I.e. fixing up bad data from some legacy app or something like that.

In this case, what I'm doing is taking a fixed width identifier, randomly generated, and removing the duplicates made in the process.

Each new batch or 1 million or so gets smushed up against the current table and de-duped then added to it.

csn

One thing I used to do to delete dupes was export the data, then use 'sort' and 'uniq' on the text files. Then reimport. uniq is pretty amazingly fast.

Sxooter

Yeah, one of the interesting things I ran into is that we want the data to retain it's random order. So, if I ran sort and uniq on it, I'd have to re-randomize it.

I did try the sort | uniq thing, and it was VERY fast. But re-randomizing was pretty slow. I might want to investigate some better methods for randomizing than "select * into newtable fro old table order by random()" hehe.

Like a nice shuffle algorithm.

csn

Yeah. I thought I remembered a 'rand' or 'random' command line utility, or a '-random' flag for 'sort', but I sure can't find any info.

csn

Hmm, maybe this is worth a try:

rl is a command-line tool that reads lines from an input file or stdin, randomizes the lines and outputs a specified number of lines. It does this with only a single pass over the input while trying to use as little memory as possible.

http://freshmeat.net/projects/rl/?branch_id=13812&release_id=194284

Sxooter

OK, grabbed rl, am gonna give it a try. No telling how well it will work with my size of dataset.

Anyway, thought I'd post an update to my little experiment.

Here's a simple table of the times involved in de-duping my list of 6 character long ID codes:

Number of Rows  |  Rows Deleted  |  Time To Run in Seconds
     999545     |          479   |          17  

    1998140     |         1905   |          37
    3992568     |         7531   |         103
    7970020     |        30021   |         440
   18830597     |       168158   |        1061
   28606418     |       387790   |        2472
   94503139     |      4154933   |       14528

That last run was just about 4 hours exactly.

But, that gives me enough unique identifiers to last about 25 to 50 years, minimum...