Delete duplicate files from directory

JamesBrauman · 2008-08-31T11:32:48+00:00

I have a directory full of pictures (all .jpg) with different file names, I want to find duplicates by thier contents (the actual picture, not the file name)...

Delete duplicate files from directory

yamarc

Sure 🙂

Using md5_file()

With a 1MB file:
Hash = df1555ec0c2d7fcad3a03770f9aa238a; time = 0.005006

With a 2MB file:

Hash = 4387904830a4245a8ab767e5937d722c; time = 0.010393

With a 10MB file:

Hash = b89f948e98f3a113dc13fdbd3bdb17ef; time = 0.241907

bradgrafelman

Eh, I meant benchmarks of the entire methods mentioned above. I just didn't feel like actually writing out the code for my method above.

yamarc

The only thing actually taking more than about 0.00000x seconds is the file hash. 🙂
Of course my function could possibly be optimized further, but it was just a quick solution. 🙂 Didn't bother to take the time to think about it hehe.

yamarc

Not to mention, having an array with each cell containing 5KB of data is usually suicide in shared servers, since even with a directory containing 10 files, your script would have to read and store 50KB of string data in memory.
100 files and you've got half a meg wasted on a PHP script. 😉 Which could've taken only about 3.2KB with hashes. (32 * 100 / 1000 considering it's a unicode file and each character is a full byte, so this is actually 3.2KB in the worst case. 🙂)

Installer

My function, tested on a directory with two nested subdirectories of 41 .jpg files totaling 63 MB, takes 24 seconds on my slow machine. As I implied, it's meant for administrative use, like on a development machine before file upload, or during low server-usage times.

NogDog

Of course, if you want to be really anal about it, while it's unlikely, it's not impossible for two different images to generate an identical hash value. Perhaps two different hashing methods should be used, or use one but also compare the file size, etc., just to make such a coincidence even more remote?

Installer

That's true. But for the chances of that happening to be more than infinitesimal (esp. with large files), a person would need to be checking through several orders of magnitude of individual pictures. Even then other problems would likely arise first, such as the cpu getting bored and walking away.

yamarc

To make it practically impossible that 2 images will be the same

Instead of md5_file($fileName) I did, do

md5_file($fileName).sha1_file($fileName).filesize($fileName)

But only if you really need it.

Weedpacket

Meanwhile of course there is the possibility that two indistinguishable files (even pixel-for-pixel, let alone merely visually) have different hashes 🙂 - such a false negative would be much more likely than a false positive.

But for the hashing; I have previously used that: storing an MD5 of each file (video) in its db record, and comparing the MD5 of each incoming video with those. I didn't sweat the false negatives too much; it was mainly to prevent accidental multiple uploads.

JamesBrauman

Thankyou for all your help, I accomplished what I needed to do :-)

Drakla

I thought of the hash solution, too. If speed was an issue I'd only grab the first K of each file, then do a pass looking for duplicates and then do a filesize and then if necessary a full hash comparison.

yamarc

You're quite welcome. 🙂

yamarc

And if you're gonna use that first Xkb matching thing, make sure you unset everything you need so you won't get stuck with a megabyte of memory filled up for no reason until the end of your script.

Cleaner007

Hi! To delete duplicate files (including from directory) i use Clone Remover. Very nice app. Recommend🙂

louisegornalla

Hi, I know this discussion is a bit old but I will still response from this, if you're looking for a software to find duplicate files try this tool easy duplicate finder it helps you find and remove your duplicate files and it is easy to use.