I'm looking for methods for removing duplicates. Currently, what I'm using is to add a new column with a sequential number in it to use as a unique key. If your system has some pseudo row number type, like postgresql's oid column you could use that too.
After that, I use the following statement to find and eliminate the duplicate rows:
delete from
mytable
where id in (
select
l1.id
from
mytable l1
join
mytable l2
on (
l1.field1=l2.field1
AND
l1.field2=l2.field2
AND
l1.field3=l2.field3
AND
l1.id>l2.id
)
);
In the above, id is the field that is unique (added if necessary) while the other fields are the ones that are possible dups. In testing on a real dataset, with a single field that needs to be compared for uniqueness, I got the following performance results:
Rows | Deleted | Time (in s)
1M | 480 | 17
2M | 1905 | 37
4M | 7531 | 103
8M | 30,000 | 440
18.8M | 170,000 | 1061
28.6M | 390,000 | 2472
My machine has 2 gigs of ram, and as soon as the dataset got too big to be buffered in ram, around the 8 million row mark, there was a big drop off. Notice I doubled the size of the set there, but more than quadrupled the time. However, from there on, the increase in time is only slightly worse than linear. The other methods I tried had much worse results as the set size increased.
select distinct * into newtable was one of those methods, as was inserting rows one at a time into a table with a unique index.