Thursday·30·August·2012
Finding similar but not identical files //at 17:10 //by abe
There are quite some tools to find duplicate files in Debian (Ua is not even packaged for Debian!!!1!eleven! SCNR — via Chrütertee) and depending on the task I use either hardlink (see this blog posting), fdupes (if I need output with all identical files on one line; see example below), or duff (if it has to be performant).
But for code deduplication in historically grown code you sometimes need a tool which does not only find identical files, but also those which just differ in a few blanks or blank lines.
I found two tools in Debian which can give you some kind of percentage of similarity: simhash (which is btw. orphaned; upstream homepage) and similarity-tester (upstream homepage).
simhash has the shorter name and hecne sounds more usable on the command-line. But it seems only be able to compare two files at once and also only after first computing and writing down its similarity hash to a file. Not really usable for those one-liner cases on the command-line.
similarity-tester has the longer name (and one which made me suspect that it may be a GUI tool), but provides what I was looking for:
$ find . -type f | sim_text -ipTt 75
This lists all files in the current directory which have at 75% (“-t 75”) in common with another file in the list of files. The option “-i” causes sim_text to read the files to compare from standard input; “-p” causes sim_text to just output the similarity percentage; and “-T” suppresses the per-file list of found tokens.
I used similarity-tester’s “sim_text” tool to compare natural langauge as most of the files, I had to test, are shell scripts. But similarity-tester also provides tools to test the similarity of code in specific programming languages, namely C, Java, Pascal, Modula-2, Lisp and Miranda.
Example output from the xen-tools project (after I already did a lot of code deduplication):
./intrepid/30-disable-gettys consists for 100 % of ./edgy/30-disable-gettys material ./edgy/30-disable-gettys consists for 100 % of ./intrepid/30-disable-gettys material ./common/90-make-fstab-rpm consists for 98 % of ./centos-5/90-make-fstab material ./centos-5/90-make-fstab consists for 98 % of ./common/90-make-fstab-rpm material ./gentoo/55-create-dev consists for 91 % of ./dapper/55-create-dev material ./dapper/55-create-dev consists for 90 % of ./gentoo/55-create-dev material ./gentoo/55-create-dev consists for 88 % of ./common/55-create-dev material ./common/90-make-fstab-deb consists for 87 % of ./common/90-make-fstab-rpm material ./common/90-make-fstab-rpm consists for 85 % of ./common/90-make-fstab-deb material ./common/30-disable-gettys consists for 81 % of ./karmic/30-disable-gettys material ./intrepid/80-install-kernel consists for 78 % of ./edgy/80-install-kernel material ./edgy/30-disable-gettys consists for 76 % of ./karmic/30-disable-gettys material ./karmic/30-disable-gettys consists for 76 % of ./edgy/30-disable-gettys material ./common/50-setup-hostname-rpm consists for 76 % of ./gentoo/50-setup-hostname material
Depending on the length of possible filenames and amount of files this
can be made more readable using the column
utility from
the bsdmainutils package and reversed by using
tac
from the coreutils package:
$ find . -type f | sim_text -ipTt 75 | tac | column -t ./common/50-setup-hostname-rpm consists for 76 % of ./gentoo/50-setup-hostname material ./karmic/30-disable-gettys consists for 76 % of ./edgy/30-disable-gettys material ./edgy/30-disable-gettys consists for 76 % of ./karmic/30-disable-gettys material ./intrepid/80-install-kernel consists for 78 % of ./edgy/80-install-kernel material ./common/30-disable-gettys consists for 81 % of ./karmic/30-disable-gettys material ./common/90-make-fstab-rpm consists for 85 % of ./common/90-make-fstab-deb material ./common/90-make-fstab-deb consists for 87 % of ./common/90-make-fstab-rpm material ./gentoo/55-create-dev consists for 88 % of ./common/55-create-dev material ./dapper/55-create-dev consists for 90 % of ./gentoo/55-create-dev material ./gentoo/55-create-dev consists for 91 % of ./dapper/55-create-dev material ./centos-5/90-make-fstab consists for 98 % of ./common/90-make-fstab-rpm material ./common/90-make-fstab-rpm consists for 98 % of ./centos-5/90-make-fstab material ./edgy/30-disable-gettys consists for 100 % of ./intrepid/30-disable-gettys material ./intrepid/30-disable-gettys consists for 100 % of ./edgy/30-disable-gettys material
Compared to that, fdupes only finds the two 100% identical files:
$ fdupes -r1 . ./intrepid/30-disable-gettys ./edgy/30-disable-gettys
But fdupes helped me already a lot to find the first
bunch of identical files in xen-tools. :-)
Tagged as: bsdmainutils, C, cleanup, column, coreutils, Debian, deduplication, duff, duplicates, fdupes, find, hardlink, Java, Lisp, Miranda, Modula-2, Ohloh, Pascal, recursive, sim_text similarity-tester, simhash, similarity, tac, UUUT, xen-tools
// show without comments // write a comment