Search My Blog

Monday, July 4, 2011

How to remove duplicate files without wasting time | TechRepublic

Follow this blog:: RSS; Email Alert

Linux and Open Source

How to remove duplicate files without wasting time

Print
Add to Favorites
Del.icio.us
Digg
Facebook
Google Buzz

Hacker News
LinkedIn
Reddit
StumbleUpon
Technorati
Twitter

By Marco Fioretti

June 30, 2011, 11:02 AM PDT

Takeaway: Marco Fioretti provides some code snippets to streamline the search and removal of duplicate files on your computer.

Duplicate files can enter in your computer in many ways. No matter how it happened, they should be removed as soon as possible. Waste is waste: why should you tolerate it? It’s not just a matter of principle: duplicates make your backups, not to mention indexing with Nepomuk or similar engines, take more time than it’s really necessary. So let’s get rid of them.

First, let’s find which files are duplicates

Whenever I want to find and remove duplicate files automatically I run two scripts in sequence. The first is the one that actually finds which files are copies of each other. I use for this task this small gem by J. Elonen, pasted here for your convenience:

  1 #! /usr/bin/perl

  3 use strict;

  4 undef $/;

  5 my $ALL = <>;

  6 my @BLOCKS = split (/\n\n/, $ALL);

  8    foreach my $BLOCKS (@BLOCKS) {

  9      my @I_FILE = split (/\n/, $BLOCKS);

  10    my $I;

  11    for ($I = 1; $I <= $#I_FILE; $I++) {

  12           substr($I_FILE[$I], 0,1) = '     ';

  13           }

  14   print join("\n", @I_FILE), "\n\n";

  15 }

This code puts all the text received from the standard input inside $ALL, and then splits it in @BLOCKS, using two consecutives newlines as blocks separator (line 6). Every element of each block is then split in one array of single lines (@I_FILE in line 9). Next, the first character of all but the first element of that array (which, if you’ve been paying attention, was the shell comment character, ‘#’) is replaced by four white spaces. One would be enough, but code indentation is nice, isn’t it?

When you run this second script (I call it dup_selector.pl) on the output of the first one, here’s what you get:

Read More...
http://www.techrepublic.com/blog/opensource/how-to-remove-duplicate-files-without-wasting-time/2667

I need to try this out on an unimportant folder. I've been uing FSlint to do this in the GUI. But it is very slow. And seems to completely hang up when I try to Delete a few fils after a Scan. That is on about 800GB of Data though...

Don