Delete duplicate files from directory

JamesBrauman · Aug 31, 2008

I have a directory full of pictures (all .jpg) with different file names, I want to find duplicates by thier contents (the actual picture, not the file name) and delete the duplicates.

How would I go about accomplishing this?

bradgrafelman · Aug 31, 2008

You're searching for exact duplicates? Not sure PHP is the best language to do this in.

What you could try and do is keep an array of data, say the first 5kb of each file. What you'd use this for is that on each iteration through the list of files, you'd check of the first 5kb of the current file matches any entry in your data array. If it does, see if the entire contents of the previously stored file matches the entire contents of the current file. If it does, delete (the current file? the previous file? your choice), and continue. If it doesn't, store the 5kb of data in the array and continue on. You'd be searching the array using [man]array_search/man, and you could store the filename as the key.

The only reason I suggested keeping only the first 5kb or so is to help keep memory consumption to a minimum for the script. If you had a large amount of files, storing them all into memory and trying to search that array each time would most likely bog down your server considerably (and take a little while to complete, as well).

yamarc · Aug 31, 2008

This should do the trick

function getUniqueFiles($fileNamesArray) {
	$yourArray = $fileNamesArray;

$len = count($yourArray);
$hashes = array();
for($i=0;$i<$len;$i++) {
	$hashes[$i] = md5_file($yourArray[$i]);
}

// remove duplicates and save original indexes
$newArray = array();
$indexes = array();
for($i=0;$i<$len;$i++) {
	if(!in_array($hashes[$i], $newArray)) {
		array_push($newArray, $hashes[$i]);
		array_push($indexes, $i);
	}
}

// retrive from original by index of unique value
$duplicateFree = array();
foreach($indexes as $index) {
	array_push($duplicateFree, $yourArray[$index]);
}

return $duplicateFree;
}

and then of course

$uniqueFiles = getUniqueFiles($allFileNames);

foreach($allFileNames as $fileName)
{
if(!in_array($fileName, $uniqueFiles)
{
// delete or what ever you want
}
}

Installer · Aug 31, 2008

Here's a function that can delete duplicate files either from a single directory or recursively (i.e. duplicate files from different directories will be removed).
-- I've left in a lot of testing code; if you want to use it, test it carefully first to make sure my tests didn't miss anything.
-- It might use a lot of memory if your $path has a huge number of files under it.
-- The $restart argument shouldn't be used when the function is called. It's there for clearing the static variables on subsequent calls.
-- It probably needs work (at the least, removal of the echo lines).
-- As the function is, there's no telling which directory a file will be deleted from. That depends on the order in which entries are read.
-- Allowance isn't made for file/directory permissions, so make sure first that they're correct.
-- Not meant for frequent use on an active server.

function rm_dup_files($path, $recurse = false, $exempt = array(), $restart = true)
{
    if ((substr($path, -1) !== '/') && (substr($path, -1) != '\\')) {
        $path .= DIRECTORY_SEPARATOR;
    }
    static $md5_arr = array();
    static $ret_arr = array();
    if ($restart) {
        $md5_arr = array();
        $ret_arr['ct'] = 0;
        $ret_arr['removed'] = array();
        $ret_arr['not_removed'] = array();
        if (!is_dir($path)) {
            echo '!is_dir(' . $path . ')<br />';
            return false;
        }
    }
    $dir = dir($path);
    while (false !== ($entry = $dir->read())) {
        if (($entry == '.') || ($entry == '..')) {
            continue;
        }
        if (!is_dir($path . $entry)) {
            $md5 = md5(file_get_contents($path . $entry));
            echo $path . $entry . ' => ' . $md5;
            if (in_array($md5, $md5_arr) && !in_array($entry, $exempt)) {
                unlink ($path . $entry);  

                $ret_arr['ct']++;
                $ret_arr['removed'][] = $path . $entry;
                echo ' => ' . '<span style="color:red">removed</span>';
            } else {
                $md5_arr[] = $md5;
                $ret_arr['not_removed'][] = $path . $entry;
            }
            echo '<br />';
        } elseif ($recurse) {
            echo '<span style="color:blue">' . $path . '</span><br />';
            $subdir_path = $path . $entry;
            echo '<span style="color:green">recursing ' . $subdir_path . '</span><br />';
            rm_dup_files($subdir_path, true, $exempt, false);
        }
    }
    $dir->close();
    return $ret_arr;
}

Some test code:

$path = '/path';
$exempt_files = array('file_a', 'file_2', 'file_iii');
$remove_result = rm_dup_files($path, true, $exempt_files);
if (!$remove_result) {
    echo 'error';
} else {
    echo '<pre>';
    print_r($remove_result);
    echo '</pre>';
}

$remove_result = rm_dup_files($path);
if (!$remove_result) {
    echo 'error';
} else {
    echo '<pre>';
    print_r($remove_result);
    echo '</pre>';
}

bradgrafelman · Aug 31, 2008

D'oh - didn't think about using file hashes. I'd be curious to see benchmarks, though.

yamarc · Aug 31, 2008

Sure

Using md5_file()

With a 1MB file:
Hash = df1555ec0c2d7fcad3a03770f9aa238a; time = 0.005006

With a 2MB file:

Hash = 4387904830a4245a8ab767e5937d722c; time = 0.010393

With a 10MB file:

Hash = b89f948e98f3a113dc13fdbd3bdb17ef; time = 0.241907

bradgrafelman · Aug 31, 2008

Eh, I meant benchmarks of the entire methods mentioned above. I just didn't feel like actually writing out the code for my method above.

yamarc · Aug 31, 2008

The only thing actually taking more than about 0.00000x seconds is the file hash.
Of course my function could possibly be optimized further, but it was just a quick solution. Didn't bother to take the time to think about it hehe.

yamarc · Aug 31, 2008

Not to mention, having an array with each cell containing 5KB of data is usually suicide in shared servers, since even with a directory containing 10 files, your script would have to read and store 50KB of string data in memory.
100 files and you've got half a meg wasted on a PHP script. Which could've taken only about 3.2KB with hashes. (32 * 100 / 1000 considering it's a unicode file and each character is a full byte, so this is actually 3.2KB in the worst case. )

Installer · Aug 31, 2008

My function, tested on a directory with two nested subdirectories of 41 .jpg files totaling 63 MB, takes 24 seconds on my slow machine. As I implied, it's meant for administrative use, like on a development machine before file upload, or during low server-usage times.

NogDog · Aug 31, 2008

Of course, if you want to be really anal about it, while it's unlikely, it's not impossible for two different images to generate an identical hash value. Perhaps two different hashing methods should be used, or use one but also compare the file size, etc., just to make such a coincidence even more remote?

Installer · Aug 31, 2008

That's true. But for the chances of that happening to be more than infinitesimal (esp. with large files), a person would need to be checking through several orders of magnitude of individual pictures. Even then other problems would likely arise first, such as the cpu getting bored and walking away.

yamarc · Aug 31, 2008

To make it practically impossible that 2 images will be the same

Instead of md5_file($fileName) I did, do

md5_file($fileName).sha1_file($fileName).filesize($fileName)

But only if you really need it.

Weedpacket · Sep 1, 2008

Meanwhile of course there is the possibility that two indistinguishable files (even pixel-for-pixel, let alone merely visually) have different hashes - such a false negative would be much more likely than a false positive.

But for the hashing; I have previously used that: storing an MD5 of each file (video) in its db record, and comparing the MD5 of each incoming video with those. I didn't sweat the false negatives too much; it was mainly to prevent accidental multiple uploads.

JamesBrauman · Sep 1, 2008

Thankyou for all your help, I accomplished what I needed to do :-)

Drakla · Sep 1, 2008

I thought of the hash solution, too. If speed was an issue I'd only grab the first K of each file, then do a pass looking for duplicates and then do a filesize and then if necessary a full hash comparison.

yamarc · Sep 1, 2008

You're quite welcome.

yamarc · Sep 1, 2008

And if you're gonna use that first Xkb matching thing, make sure you unset everything you need so you won't get stuck with a megabyte of memory filled up for no reason until the end of your script.

Cleaner007 · Sep 4, 2008

Hi! To delete duplicate files (including from directory) i use Clone Remover. Very nice app. Recommend

louisegornalla · Dec 3, 2021

Hi, I know this discussion is a bit old but I will still response from this, if you're looking for a software to find duplicate files try this tool easy duplicate finder it helps you find and remove your duplicate files and it is easy to use.

Delete duplicate files from directory

JJamesBrauman

Bbradgrafelman

Yyamarc

IInstaller

Bbradgrafelman

Yyamarc

Bbradgrafelman

Yyamarc

Yyamarc

IInstaller

NogDog

IInstaller

Yyamarc

Weedpacket

JJamesBrauman

DDrakla

Yyamarc

Yyamarc

CCleaner007

Llouisegornalla