IntegrityChecker java (icj): 'dupes' command
The dupes command finds duplicate files, producing a report showing the duplicates. In some cases, large amounts of space can be saved using the dupes command. In Lloyd’s case, 700GB was saved by detecting duplicating folders of image files.
An update must be done first so that hash information is current—if not done, then icj does its best to ignore any files of unknown status.
Additionally:
- Commands are emitted that can be copied/pasted to remove the duplicates or to clone them (cloning requires an APFS volume on macOS) or symlink* them. These commands are both a report as well as actual commands that can be used (or not) in Terminal.
- Duplicates can be found subject to a minimum size, e.g., --size 64K.
- Duplicates can be found subject to certain file types including “smart” types like " digital camera raw files, e.g., --types raw,jpg.
By default, duplicate files are only shown above 32K size (this keeps the "noise" down).
* not generally recommended unless files will always remain in the same relative locations on the same volume.
Excluding folders for comparison purposes
The dupes.ignore preference in .icj_prefs can be used to exclude folders from consideration. For example, to exclude the folder /Work/testing, add it to dupes.ignore:
[dupes.ignore] /Work/testing
Command line usage
icj dupes [<path>]*
Like all commands, this one is recursive, cleaning the entire folder hierarchy. If no path is specified, the current working directory is used.
--size option
The --size option specifies a minimum file size below which files are ignored e.g., --size 64K.
--types option
The --types option specifies one or more file extensions, for example "txt", "jpg", "raw". Types are case insensitive.
The following types are special “smart” types:
- --types raw specifies all known raw file types e.g., {DNG, ARW, CR2, NEF, etc}
- --types jpg specifies ".jpg" and ".jpeg" files (case insensitive).
More than one type can be specified with a comma (no spaces!), e.g., --types doc,docx,rtf,txt,html.
--emit=<rm|clone|symlink|nop> option
By default emits commands to remove or clone duplicate files. These commands can be pasted in for execution. Use some caution because icj cannot know which files should be preferred to keep (though it applies some logic).
Using --emit nop suppresses emission of such commands.
Examples
Lines starting with "#" are comments.
# Show duplicate files in current working directory
icj dupes
# Show duplicate files in current working directory, don't emit any commands for dealing with
icj dupes --
# Show duplicate files on volume Master (or folder Master within current directory):
icj dupes Master
# Show duplicate RAW and jpeg files in Master that (jpg includes .jpg and .jpeg)
icj dupes Master --types RAW,jpg
# Show duplicate files of all types of at least 64K in size on all mounted volumes
icj dupes --size 64K /Volumes/*
# Show duplicate files at least 4K in size of type ".txt" and ".html" on volume Master
icj dupes --size 4K --types txt,html Master
Special note on using the clone feature
Clones require an APFS volume on macOS.
When run, icj dupes will emit a report that includes appropriate commands.
To generate commands suitable for making clones for duplicate files, use the --emit clone option, like this (append the folder or volume name to operate on):
icj dupes --emit clone
Cloning files immediately reclaims all disk space for all duplicates except one. After cloning, all files look and behave the same, and there is actually no way to tell if a file is a clone or not (even with the Finder or other programs) . Therefore, icj will report the duplicates all over again! But there is no harm in re-cloning; it just won't reclaim space that is already reclaimed.
The key decision is whether a clone is desirable, versus just removing the duplicate files. Sometimes you want a duplicate. Other times it is just a mistake. But the beauty of clones is that the decision can be deferred, and the space immediately reclaimed.
If you wish to remove the duplicate files instead, please note that while icj makes a very intelligent guess at which of the duplicates is the best one to keep, that is ultimately your own call:
icj dupes --emit rm
The report includes comment lines which start with "#". All the lines (includig the comment lines) can be pasted directly into a Terminal window to execute them.
There
# 72177 bytes /Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-one/publish/js/jquery-1.4.2.min.js /Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-workstation/publish/js/jquery-1.4.2.min.js /Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-laptop/publish/js/jquery-1.4.2.min.js /Volumes/Master/diglloyd/DOMAINS/MPG/_diglloydTools/publish/js/jquery-1.4.2.min.js
cp -c "/Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-one/publish/js/jquery-1.4.2.min.js" "/Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-workstation/publish/js/jquery-1.4.2.min.js"
cp -c "/Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-one/publish/js/jquery-1.4.2.min.js" "/Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-laptop/publish/js/jquery-1.4.2.min.js"
cp -c "/Volumes/Master/diglloyd/DOMAINS/MPG/_defunct/_mpg-pro-one/publish/js/jquery-1.4.2.min.js" "/Volumes/Master/diglloyd/DOMAINS/MPG/_diglloydTools/publish/js/jquery-1.4.2.min.js"
Copyright © 2022 diglloyd Inc, all rights reserved