I have a directory full of pictures (all .jpg) with different file names, I want to find duplicates by thier contents (the actual picture, not the file name) and delete the duplicates.

How would I go about accomplishing this?

    You're searching for exact duplicates? Not sure PHP is the best language to do this in.

    What you could try and do is keep an array of data, say the first 5kb of each file. What you'd use this for is that on each iteration through the list of files, you'd check of the first 5kb of the current file matches any entry in your data array. If it does, see if the entire contents of the previously stored file matches the entire contents of the current file. If it does, delete (the current file? the previous file? your choice), and continue. If it doesn't, store the 5kb of data in the array and continue on. You'd be searching the array using [man]array_search/man, and you could store the filename as the key.

    The only reason I suggested keeping only the first 5kb or so is to help keep memory consumption to a minimum for the script. If you had a large amount of files, storing them all into memory and trying to search that array each time would most likely bog down your server considerably (and take a little while to complete, as well).

      This should do the trick

      function getUniqueFiles($fileNamesArray) {
      	$yourArray = $fileNamesArray;
      
      $len = count($yourArray);
      $hashes = array();
      for($i=0;$i<$len;$i++) {
      	$hashes[$i] = md5_file($yourArray[$i]);
      }
      
      // remove duplicates and save original indexes
      $newArray = array();
      $indexes = array();
      for($i=0;$i<$len;$i++) {
      	if(!in_array($hashes[$i], $newArray)) {
      		array_push($newArray, $hashes[$i]);
      		array_push($indexes, $i);
      	}
      }
      
      // retrive from original by index of unique value
      $duplicateFree = array();
      foreach($indexes as $index) {
      	array_push($duplicateFree, $yourArray[$index]);
      }
      
      return $duplicateFree;
      }
      

      and then of course

      $uniqueFiles = getUniqueFiles($allFileNames);
      
      foreach($allFileNames as $fileName)
      {
      if(!in_array($fileName, $uniqueFiles)
      {
      // delete or what ever you want
      }
      }
      

        Here's a function that can delete duplicate files either from a single directory or recursively (i.e. duplicate files from different directories will be removed).
        -- I've left in a lot of testing code; if you want to use it, test it carefully first to make sure my tests didn't miss anything.
        -- It might use a lot of memory if your $path has a huge number of files under it.
        -- The $restart argument shouldn't be used when the function is called. It's there for clearing the static variables on subsequent calls.
        -- It probably needs work (at the least, removal of the echo lines).
        -- As the function is, there's no telling which directory a file will be deleted from. That depends on the order in which entries are read.
        -- Allowance isn't made for file/directory permissions, so make sure first that they're correct.
        -- Not meant for frequent use on an active server.

        function rm_dup_files($path, $recurse = false, $exempt = array(), $restart = true)
        {
            if ((substr($path, -1) !== '/') && (substr($path, -1) != '\\')) {
                $path .= DIRECTORY_SEPARATOR;
            }
            static $md5_arr = array();
            static $ret_arr = array();
            if ($restart) {
                $md5_arr = array();
                $ret_arr['ct'] = 0;
                $ret_arr['removed'] = array();
                $ret_arr['not_removed'] = array();
                if (!is_dir($path)) {
                    echo '!is_dir(' . $path . ')<br />';
                    return false;
                }
            }
            $dir = dir($path);
            while (false !== ($entry = $dir->read())) {
                if (($entry == '.') || ($entry == '..')) {
                    continue;
                }
                if (!is_dir($path . $entry)) {
                    $md5 = md5(file_get_contents($path . $entry));
                    echo $path . $entry . ' => ' . $md5;
                    if (in_array($md5, $md5_arr) && !in_array($entry, $exempt)) {
                        unlink ($path . $entry);  
        $ret_arr['ct']++; $ret_arr['removed'][] = $path . $entry; echo ' => ' . '<span style="color:red">removed</span>'; } else { $md5_arr[] = $md5; $ret_arr['not_removed'][] = $path . $entry; } echo '<br />'; } elseif ($recurse) { echo '<span style="color:blue">' . $path . '</span><br />'; $subdir_path = $path . $entry; echo '<span style="color:green">recursing ' . $subdir_path . '</span><br />'; rm_dup_files($subdir_path, true, $exempt, false); } } $dir->close(); return $ret_arr; }

        Some test code:

        $path = '/path';
        $exempt_files = array('file_a', 'file_2', 'file_iii');
        $remove_result = rm_dup_files($path, true, $exempt_files);
        if (!$remove_result) {
            echo 'error';
        } else {
            echo '<pre>';
            print_r($remove_result);
            echo '</pre>';
        }
        
        $remove_result = rm_dup_files($path);
        if (!$remove_result) {
            echo 'error';
        } else {
            echo '<pre>';
            print_r($remove_result);
            echo '</pre>';
        }

          D'oh - didn't think about using file hashes. I'd be curious to see benchmarks, though.

            Sure πŸ™‚

            Using md5_file()

            With a 1MB file:
            Hash = df1555ec0c2d7fcad3a03770f9aa238a; time = 0.005006

            With a 2MB file:

            Hash = 4387904830a4245a8ab767e5937d722c; time = 0.010393

            With a 10MB file:

            Hash = b89f948e98f3a113dc13fdbd3bdb17ef; time = 0.241907

              Eh, I meant benchmarks of the entire methods mentioned above. I just didn't feel like actually writing out the code for my method above.

                The only thing actually taking more than about 0.00000x seconds is the file hash. πŸ™‚
                Of course my function could possibly be optimized further, but it was just a quick solution. πŸ™‚ Didn't bother to take the time to think about it hehe.

                  Not to mention, having an array with each cell containing 5KB of data is usually suicide in shared servers, since even with a directory containing 10 files, your script would have to read and store 50KB of string data in memory.
                  100 files and you've got half a meg wasted on a PHP script. πŸ˜‰ Which could've taken only about 3.2KB with hashes. (32 * 100 / 1000 considering it's a unicode file and each character is a full byte, so this is actually 3.2KB in the worst case. πŸ™‚)

                    My function, tested on a directory with two nested subdirectories of 41 .jpg files totaling 63 MB, takes 24 seconds on my slow machine. As I implied, it's meant for administrative use, like on a development machine before file upload, or during low server-usage times.

                      Of course, if you want to be really anal about it, while it's unlikely, it's not impossible for two different images to generate an identical hash value. Perhaps two different hashing methods should be used, or use one but also compare the file size, etc., just to make such a coincidence even more remote?

                        That's true. But for the chances of that happening to be more than infinitesimal (esp. with large files), a person would need to be checking through several orders of magnitude of individual pictures. Even then other problems would likely arise first, such as the cpu getting bored and walking away.

                          To make it practically impossible that 2 images will be the same

                          Instead of md5_file($fileName) I did, do

                          md5_file($fileName).sha1_file($fileName).filesize($fileName)

                          But only if you really need it.

                            Meanwhile of course there is the possibility that two indistinguishable files (even pixel-for-pixel, let alone merely visually) have different hashes πŸ™‚ - such a false negative would be much more likely than a false positive.

                            But for the hashing; I have previously used that: storing an MD5 of each file (video) in its db record, and comparing the MD5 of each incoming video with those. I didn't sweat the false negatives too much; it was mainly to prevent accidental multiple uploads.

                              Thankyou for all your help, I accomplished what I needed to do :-)

                                I thought of the hash solution, too. If speed was an issue I'd only grab the first K of each file, then do a pass looking for duplicates and then do a filesize and then if necessary a full hash comparison.

                                  And if you're gonna use that first Xkb matching thing, make sure you unset everything you need so you won't get stuck with a megabyte of memory filled up for no reason until the end of your script.

                                    13 years later

                                    Hi, I know this discussion is a bit old but I will still response from this, if you're looking for a software to find duplicate files try this tool easy duplicate finder it helps you find and remove your duplicate files and it is easy to use.

                                      Write a Reply...