Stoppt die Vorratsdatenspeicherung! Jetzt klicken &handeln! Willst du auch an der Aktion teilnehmen? Hier findest du alle relevanten Infos und Materialien:
Jump to menu and information about this site.

Thursday·30·August·2012

Finding similar but not identical files //at 17:10 //by abe

from the whitespace-change dept.

There are quite some tools to find duplicate files in Debian (Ua is not even packaged for Debian!!!1!eleven! SCNRvia Chrütertee) and depending on the task I use either hardlink (see this blog posting), fdupes (if I need output with all identical files on one line; see example below), or duff (if it has to be performant).

But for code deduplication in historically grown code you sometimes need a tool which does not only find identical files, but also those which just differ in a few blanks or blank lines.

I found two tools in Debian which can give you some kind of percentage of similarity: simhash (which is btw. orphaned; upstream homepage) and similarity-tester (upstream homepage).

simhash has the shorter name and hecne sounds more usable on the command-line. But it seems only be able to compare two files at once and also only after first computing and writing down its similarity hash to a file. Not really usable for those one-liner cases on the command-line.

similarity-tester has the longer name (and one which made me suspect that it may be a GUI tool), but provides what I was looking for:

$ find . -type f | sim_text -ipTt 75

This lists all files in the current directory which have at 75% (“-t 75”) in common with another file in the list of files. The option “-i” causes sim_text to read the files to compare from standard input; “-p” causes sim_text to just output the similarity percentage; and “-T” suppresses the per-file list of found tokens.

I used similarity-tester’s “sim_text” tool to compare natural langauge as most of the files, I had to test, are shell scripts. But similarity-tester also provides tools to test the similarity of code in specific programming languages, namely C, Java, Pascal, Modula-2, Lisp and Miranda.

Example output from the xen-tools project (after I already did a lot of code deduplication):

./intrepid/30-disable-gettys consists for 100 % of ./edgy/30-disable-gettys material
./edgy/30-disable-gettys consists for 100 % of ./intrepid/30-disable-gettys material
./common/90-make-fstab-rpm consists for 98 % of ./centos-5/90-make-fstab material
./centos-5/90-make-fstab consists for 98 % of ./common/90-make-fstab-rpm material
./gentoo/55-create-dev consists for 91 % of ./dapper/55-create-dev material
./dapper/55-create-dev consists for 90 % of ./gentoo/55-create-dev material
./gentoo/55-create-dev consists for 88 % of ./common/55-create-dev material
./common/90-make-fstab-deb consists for 87 % of ./common/90-make-fstab-rpm material
./common/90-make-fstab-rpm consists for 85 % of ./common/90-make-fstab-deb material
./common/30-disable-gettys consists for 81 % of ./karmic/30-disable-gettys material
./intrepid/80-install-kernel consists for 78 % of ./edgy/80-install-kernel material
./edgy/30-disable-gettys consists for 76 % of ./karmic/30-disable-gettys material
./karmic/30-disable-gettys consists for 76 % of ./edgy/30-disable-gettys material
./common/50-setup-hostname-rpm consists for 76 % of ./gentoo/50-setup-hostname material

Depending on the length of possible filenames and amount of files this can be made more readable using the column utility from the bsdmainutils package and reversed by using tac from the coreutils package:

$ find . -type f | sim_text -ipTt 75 | tac | column -t
./common/50-setup-hostname-rpm  consists  for  76   %  of  ./gentoo/50-setup-hostname    material
./karmic/30-disable-gettys      consists  for  76   %  of  ./edgy/30-disable-gettys      material
./edgy/30-disable-gettys        consists  for  76   %  of  ./karmic/30-disable-gettys    material
./intrepid/80-install-kernel    consists  for  78   %  of  ./edgy/80-install-kernel      material
./common/30-disable-gettys      consists  for  81   %  of  ./karmic/30-disable-gettys    material
./common/90-make-fstab-rpm      consists  for  85   %  of  ./common/90-make-fstab-deb    material
./common/90-make-fstab-deb      consists  for  87   %  of  ./common/90-make-fstab-rpm    material
./gentoo/55-create-dev          consists  for  88   %  of  ./common/55-create-dev        material
./dapper/55-create-dev          consists  for  90   %  of  ./gentoo/55-create-dev        material
./gentoo/55-create-dev          consists  for  91   %  of  ./dapper/55-create-dev        material
./centos-5/90-make-fstab        consists  for  98   %  of  ./common/90-make-fstab-rpm    material
./common/90-make-fstab-rpm      consists  for  98   %  of  ./centos-5/90-make-fstab      material
./edgy/30-disable-gettys        consists  for  100  %  of  ./intrepid/30-disable-gettys  material
./intrepid/30-disable-gettys    consists  for  100  %  of  ./edgy/30-disable-gettys      material

Compared to that, fdupes only finds the two 100% identical files:

$ fdupes -r1 . 
./intrepid/30-disable-gettys ./edgy/30-disable-gettys 

But fdupes helped me already a lot to find the first bunch of identical files in xen-tools. :-)

Comments

Re: Finding similar but not identical files

Posted by: Anonymous
Website: 
Time: Tue, 05 Jun 2012 22:25

While it doesn't work as a standalone tool, git has mechanisms for detecting similarity as well, and it even provides a similarity percentage when saying how much a file has changed.

Reply

Your Comment

Spam Protection: To post a comment, you'll have to answer the following question: What is 42 minus 19?

Name:
URL or E-Mail: [http://... or mailto:you@example.com] (optional)
Title: (optional)
Spam Protection Answer:
Comment:

Tag Cloud

2CV, aha, Apache, APT, aptitude, ASUS, Automobiles, autossh, Berlin, bijou, Blogging, Blosxom, Blosxom Plugin, Browser, BSD, CDU, Chemnitz, Citroën, CLI, CLT, Conkeror, CX, deb, Debian, Doofe Parteien, E-Mail, eBay, EeePC, Emacs, Epiphany, Etch, ETH Zürich, Events, Experimental, Firefox, Fläsch, FreeBSD, FVWM, Galeon, Gecko, git, GitHub, GNOME, GNU, GNU Coreutils, GNU Screen, Google, GPL, grep, grml, gzip, Hackerfunk, Hacks, Hardware, Heise, HTML, identi.ca, IRC, irssi, Jabber, JavaShit, Kazehakase, Lenny, Liferea, Linux, LinuxTag, LUGS, Lynx, maol, Meme, Microsoft, Mozilla, Music, mutt, Myon, München, nemo, Nokia, nuggets, Open Source, Opera, packaging, Pentium I, Perl, Planet Debian, Planet Symlink, Quiz, Rant, ratpoison, Religion, RIP, Sarcasm, Sarge, Schweiz, screen, Shell, Sid, Spam, Squeeze, SSH, Stöckchen, SuSE, Symlink, Symlink-Artikel, Tagging, Talk, taz, Text Mode, ThinkPad, Ubuntu, USA, USB, UUUCO, UUUT, VCFe, Ventilator, Vintage, Wahlen, Wheezy, Wikipedia, Windows, WML, Woody, WTF, X, Xen, zsh, Zürich, ÖPNV

Calendar

 2012 
Months
Aug
 August 
Mo Tu We Th Fr Sa Su
   
30    

Tattletale Statistics

Blog postings by posting time
Blog posting times this month



Search


Advanced Search


Categories


Recent Postings

0 most recent of 0 postings total shown.


Recent Comments

Hackergotchi of Axel Beckert

About...

This is the blog or weblog of Axel Stefan Beckert (aka abe or XTaran) who thought, he would never start blogging... (He also once thought, that there is no reason to switch to this new ugly Netscape thing because Mosaïc works fine. That was about 1996.) Well, times change...

He was born 1975 at Villingen-Schwenningen, made his Abitur at Schwäbisch Hall, studied Computer Science with minor Biology at University of Saarland at Saarbrücken (Germany) and now lives in Zürich (Switzerland), working at the IT Support Group (ISG) of the Departement of Physics at ETH Zurich.

Links to internal pages are orange, links to related pages are blue, links to external resources are green and links to Wikipedia articles, Internet Movie Database (IMDb) entries or similar resources are bordeaux. Times are CET respective CEST (which means GMT +0100 respective +0200).


RSS Feeds


Identity Archipelago


Picture Gallery


Button Futility

Valid XHTML Valid CSS
Valid RSS Any Browser
GeoURL
This content is licensed under a Creative Commons License (SA 3.0 DE). Some rights reserved. Hacker Emblem
Get Mozilla Firefox! Powered by Linux!
Typed with GNU Emacs Listed at Tux Mobil
XFN Friendly Button Maker

Blogroll

Blog or not?


People I know personally


Other blogs I like or read


Independent News


Interesting Planets


Web comics I like and read

Stalled Web comics I liked


Blogging Software

Blosxom Plugins I use

Bedside Reading

Just read

  • Bastian Sick: Der Dativ ist dem Genitiv sein Tod (Teile 1-3)
  • Neil Gaiman and Terry Pratchett: Good Omens (borrowed from Ermel)

Currently Reading

  • Douglas R. Hofstadter: Gödel, Escher, Bach
  • Neil Gaiman: Keine Panik (borrowed from Ermel)

Yet to read

  • Neil Stephenson: Cryptonomicon (borrowed from Ermel)

Always a good snack

  • Wolfgang Stoffels: Lokomotivbau und Dampftechnik (borrowed from Ermel)
  • Beverly Cole: Trains — The Early Years (getty images)

Postponed