I was reminded of this by dalecosp's TIL post about glob().

Some of my operating practices can leave me with lots of empty directories (create directories, add files, remove files, rinse and repeat, sometimes a dozen levels deep with thousands of files), so I figured I'd automate the job of pruning a directory tree to remove empty directories. Of course, "empty directories" includes directories that contain only empty directories.

It's just a command-line script: nothing fancy (it doesn't, for example, cope with symlinks). [font=monospace]php <path>/prune-tree.php <root of tree to prune>[/font].

// Current directory if none other specified.
$root = realpath($argv[1] ?? '.') or die("Root not found");

$queues = [[$root]];
$dirs = [[]];
while(!empty($queues))
{
	foreach(array_pop($queues) as $d)
	{
		chdir($d);
		echo $d,"\n";
		$queues[] = $dirs[] = array_map('realpath', array_diff(glob('{,.}*', GLOB_ONLYDIR | GLOB_BRACE | GLOB_NOSORT), ['.','..']));
	}
}
$dirs = array_merge(...$dirs);
usort($dirs, function($a, $b)
{
	// Lazy way to ensure that child directories sort
	// before their parents.
	return strlen($b) - strlen($a);
});

echo "\n\n\n\n";

chdir($root);

$previous = '';
foreach($dirs as $dir)
{
	if(count(scandir($dir, SCANDIR_SORT_NONE)) == 2) // '.' & '..'
	{
		echo $dir,"\n";
		if(strncmp($dir, $previous, strlen($dir)) == 0)
		{
			// On Windows at least deleting empty
			// directories immediately after deleting
			// their contents is unreliable. At least
			// that's how it looks. So wait a little
			// after deleting the last child before
			// deleting the parent.
			usleep(10);
		}
		$previous = $dir;
		rmdir($dir);
	}
}
    21 days later

    <sigh>

    $root = realpath($argv[1] ?? '.') or die("Root not found");
    $queues = [[$root]];
    
    while(!empty($queues))
    {
    	foreach(array_shift($queues) as $d)
    	{
    		if(count(scandir($d, SCANDIR_SORT_NONE)) == 2)
    		{
    			while($d != $root && count(scandir($d, SCANDIR_SORT_NONE)) == 2)
    			{
    				echo $d,"\n";
    				chdir($d.'/..');
    				rmdir($d);
    				$d = realpath('.');
    			}
    		}
    		else
    		{
    			echo '.';
    			chdir($d);
    			$queues[] = array_map('realpath', array_diff(glob('{,.}*', GLOB_ONLYDIR | GLOB_BRACE | GLOB_NOSORT), ['.','..']));
    		}
    	}
    }
    
      5 months later

      So just to catch up here, I made the feedback while it's running a bit more useful.

      $root = realpath($argv[1] ?? '.') or die("Root not found");
      $queues = [[$root]];
      
      $oldlen = 0;
      while(!empty($queues))
      {
      	foreach(array_shift($queues) as $d)
      	{
      		if(count(scandir($d, SCANDIR_SORT_NONE)) == 2)
      		{
      			while($d != $root && count(scandir($d, SCANDIR_SORT_NONE)) == 2)
      			{
      				echo str_pad($d, $oldlen), "\n";
      				$oldlen = strlen($d);
      
      				chdir($d.'/..');
      				rmdir($d);
      				$d = realpath('.');
      			}
      		}
      		else
      		{
      			echo str_pad($d, $oldlen), "\r";
      			$oldlen = strlen($d);
      
      			chdir($d);
      			$queues[] = array_map('realpath', array_diff(glob('{,.}*', GLOB_ONLYDIR | GLOB_BRACE | GLOB_NOSORT), ['.','..']));
      		}
      	}
      }
      echo str_pad('', strlen($d)), "\n\n";

        I found myself going down a rabbit-hole looking at this yesterday. I'm not sure it's so complex that I can't wrap my head around it, but I'm not sure I can spare the CPU time, if you grok what I'm saying.

        One curious question though: between v1 and v2 you changed from popping to shifting the $queues array. Any particular reason for that?

        dalecosp between v1 and v2 you changed from popping to shifting the $queues array. Any particular reason for that?

        Um, nope. No particular reason on correctness grounds - it's just the difference between a depth-first (pop) and a breadth-first (shift) search. (besides: you shift a queue but pop a stack! I couldn't make up my mind.) Popping would be "faster" because the queue wouldn't need reindexing, but the slow part is actually unlinking the directory. Note the comment in the original code about Windows. Going breadth-first (especially on broad flat directory tree structures) would make the situation in that comment less likely to occur with the code as it stood then. But it doesn't happen with the later code and I'm not entirely sure why that should be (though I do have suspicions).
        [Edit: realised I never mentioned another difference I found. On one large tree, using depth-first traversal the stack got up to 12,752 items long; breadth-first on the same tree, 154,163. So that's a difference worth noting.]

        But that first version is way more complicated than it needed to be: I don't need a two-part "traverse then prune" (with a sort in between) process because the nature of the traversal means that I never end up planning to remove parent directories before their children anyway. (Since the first pass always goes "visit parent, visit child" and the second "remove child, remove parent", I can have just the one pass going "visit parent, visit child, remove child, remove parent").


        Just as a rough guide to what's going on in the later versions:
        The queue is an array of array of directory names (realpath()ed to ensure the traversal doesn't wander off and get lost). It could just be an array of directory names, and then the double loop would be a single loop, but that queues[] = ... would become an array_merge and PHP would end up copying and recopying it over and over and over (array_merge loops are slow).

        As a queue it's plain enough: If directory $d has child directories then they go on the queue. If it's empty (neither child directories nor files) then the innermost loop there kicks in. I go up to $d's parent from where I can safely remove $d (fun fact: you can rmdir('.') but you probably shouldn't), and make the parent the new $d in case we just removed its last child.

        The feedback is just a line showing the current directory that scrolls each time one is removed to produce a list of removed directories.

          Write a Reply...