Description: | Command-line tool to find and remove duplicate files. |
Latest Version: | 2022 |
Source Code: | src/ |
Architecture: |
|
Dependencies: |
|
Arch Repositories: |
|
AUR Page: | rmdupes |
Arch Forum Thread: | 223750 |
Tags: |
rmdupes is a command-line utility to scan a directory for duplicate files and remove them. The main feature is an option to use a reference directory: all files in the target directory that are duplicates of files in the reference directory will be removed. If no reference directory is given, files in the target directory will be compared against each other.
There is also an option to move files to a backup directory with preserved relative paths.
By default it will not delete any files without confirmation. There are options to perform a dry run, delete without confirmation, automatically select one file to keep from a set of duplicates (oldest, newest, first in alphabetical order). There is also an option for inclusive and exclusive regular expressions to find-type the search by e.g. file extensions or subdirectories.
There is also an option to generate shell scripts to remove the files after inspection. Before you ask why there is such an option, the answer is “because I can”.
The algorithm is naïve. All files are first scanned by filesize.
Files with the same size are then compared using Python’s filecmp.cmp
function. By default, rmdupes
uses deep comparions and
guarantees that no false positives will be reported. There is an option
to enable shallow comparisons for faster execution.
A few quick tests on a large set of files with a lot of duplicates
(~4 GB of photos) ran on the same order of time as fdupes
.
There are likely faster tools out there but this one is good enough for
me and the options are exactly what I need (what a coincidence!).
Remove all files from foo
that are duplicates of files
in bar
:
rmdupes -r bar foo
bar
can be a subdirectory of foo
. For
example, if you have organized your photos in
/home/me/photos
and you want to remove leftover copies from
other subdirectories in your home directory, use
rmdupes -r /home/me/photos /home/me
Collect all duplicates in a backup directory instead of deleting them:
rmdupes -r /home/me/photos -b /home/me/photos.bak /home/me
Remove all duplicate files in foo
with prompts for
selecting files and confirming deletions:
rmdupes `foo`
Same as above but without the deletion confirmation dialogues:
rmdupes --noconfirm `foo`
Keep the oldest version of a file and automatically remove all others without any confirmation:
rmdupes --noconfirm --keep oldest `foo`
Check which files would be removed with keep oldest
(without deleting them):
rmdupes -n --keep oldest `foo`
Move all duplicates except for the newest ones to a backup directory:
rmdupes --keep newest -b backup
$ rmdupes --help
usage: rmdupes [-h] [-r <reference directory> [<reference directory> ...]]
[-i] [-b <backup directory>] [--restore] [--symlink {abs,rel}]
[--hardlink] [--copy] [-l] [--rename] [-n] [-v]
[--display {script,json}] [--noconfirm]
[--keep {oldest,newest,first}]
[-f <i|e><regex> [<i|e><regex> ...]] [--shallow]
<target directory> [<target directory> ...]
Prune duplicate files.
positional arguments:
<target directory> Directories to scan for duplicates.
options:
-h, --help show this help message and exit
-r <reference directory> [<reference directory> ...], --refdir <reference directory> [<reference directory> ...]
Directories of reference files. The target directory
will be scanned for duplicates of these files.
-i, --invert Invert file selection in the target directories.
Instead of selecting duplicates for removal, this will
select non-duplicates. This may be useful when using a
reference directory to limit files in a target
directory to a subset of files in the reference
directory.
-b <backup directory>, --bakdir <backup directory>
Move duplicates to a backup directory instead of
deleting them. Relative paths are preserved.
--restore Attempt to restore files to the target directory/-ies
from the backup directory. This will also restore
files which were affixed with suffixes. Perform a dry
run with this option to check that it does what you
want before using it. Use filters to restrict selected
files if necessary.
--symlink {abs,rel} Create absolute or relative symlinks when deleting
files. This does nothing with --invert.
--hardlink Try to create hardlinks when deleting files. This may
be combined with --symlinks to create symlinks when
hardlinks are not possible. This does nothing with
--invert.
--copy When using a reference diretory and a backup
directory, this will copy duplicates of the reference
files in the target directory to the backup directory
while preserving their relative subpaths. This can be
useful to copy a subset of a file hierarchy.
-l, --list List duplicates and exit.
--rename Rename duplicate files after their reference files.
This is useful for keeping file names synchronized in
unsynchronized directories.
-n, --dryrun Dry run. List actions on STDOUT.
-v, --verbose Increase the verbosity of logging messages. Pass once
for INFO messages, twice for DEBUG messages.
--display {script,json}
Display dryrun output in the chosen format. Implies
--dryrun.
--noconfirm Do not prompt for confirmation before deleting files.
--keep {oldest,newest,first}
Automatically select the file to keep in a set of
duplicates.
-f <i|e><regex> [<i|e><regex> ...]
Regular expression filters: prefix with "i" for
inclusive, "e" for exclusive. The patterns are applied
in order. The last one that matches determines if the
file is included or excluded. For example, to exclude
everything in a directory named "foo" except for a
subdirectory named "bar", use "-f e^foo/ i^foo/bar/".
--shallow Compare files by os.stat only. See Python's filecmp
library for details. This is faster than the default
mode which compares files by content, but may result
in false positives. If unsure, try a dry run first.
--keep oldest
and
--keep newest
(they were switched).