Stoppt die Vorratsdatenspeicherung! Jetzt klicken &handeln! Willst du auch an der Aktion teilnehmen? Hier findest du alle relevanten Infos und Materialien:
Jump to menu and information about this site.

Tuesday·05·June·2012

Automatically hardlinking duplicate files under /usr/share/doc with APT //at 20:43 //by abe

from the no-space-left-on-device dept.

On my everyday netbook (a very reliable first generation ASUS EeePC 701 4G) the disk (4 GB as the product name suggests :-) is nearly always close to full.

TL;DWTR? Jump directly to the HowTo. :-)

So I came up with a few techniques to save some more disk space. Installing localepurge was one of the earliest. Another one was to implement aptitude filters to do interactively what deborphan does non-interactively. Yet another one is to use du and friends a lot – ncdu is definitely my favourite du-like tool in the meantime.

Using du and friends I often noticed how much disk space /usr/share/doc takes up. But since I value the contents of /usr/share/doc a lot, I condemn how Nokia solved that on the N900: They let APT delete all files and directories under /usr/share/doc (including the copyright files!) via some package named docpurge. I also dislike Ubuntu’s “solution” to truncate the shipped changelog files (you can still get the remainder of the files on the web somewhere) as they’re an important source of information for me.

So when aptitude showed me that some package suddenly wanted to use up quite some more disk space, I noticed that the new package version included the upstream changelog twice. So I started searching for duplicate files under /usr/share/doc.

There are quite some tools to find duplicate files in Debian. hardlink seemed most appropriate for this case.

First I just looked for duplicate files per package, which even on that less than four gigabytes installation on my EeePC found nine packages which shipped at least one file twice.

As recommended I rather opted for an according Lintian check (see bugs. Niels Thykier kindly implemented such a check in Lintian and its findings are as reported as tags “duplicate-changelog-files” (Severity: normal, from Lintian 2.5.2 on) and “duplicate-files” (Severity: minor, experimental, from Lintian 2.5.0 on).

Nevertheless, some source packages generate several binary packages and all of them (of course) ship the same, in some cases quite large (Debian) changelog file. So I found myself running hardlink /usr/share/doc now and then to gain some more free disk space. But as I run Sid and package upgrades happen more than daily, I came to the conclusion that I should run this command more or less after each aptitude run, i.e. automatically.

Having taken localepurge’s APT hook as example, I added the following content as /etc/apt/apt.conf.d/98-hardlink-doc to my system:

// Hardlink identical docs, changelogs, copyrights, examples, etc

DPkg
{
Post-Invoke {"if [ -x /usr/bin/hardlink ]; then /usr/bin/hardlink -t /usr/share/doc; else exit 0; fi";};
};

So now installing a package which contains duplicate files looks like this:

~ # aptitude install perl-tk
The following NEW packages will be installed:
  perl-tk 
0 packages upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 2,522 kB of archives. After unpacking 6,783 kB will be used.
Get: 1 http://ftp.ch.debian.org/debian/ sid/main perl-tk i386 1:804.029-1.2 [2,522 kB]
Fetched 2,522 kB in 1s (1,287 kB/s)  
Selecting previously unselected package perl-tk.
(Reading database ... 121849 files and directories currently installed.)
Unpacking perl-tk (from .../perl-tk_1%3a804.029-1.2_i386.deb) ...
Processing triggers for man-db ...
Setting up perl-tk (1:804.029-1.2) ...
Mode:     real
Files:    15423
Linked:   3 files
Compared: 14724 files
Saved:    7.29 KiB
Duration: 4.03 seconds
localepurge: Disk space freed in /usr/share/locale: 0 KiB
localepurge: Disk space freed in /usr/share/man: 0 KiB
localepurge: Disk space freed in /usr/share/gnome/help: 0 KiB
localepurge: Disk space freed in /usr/share/omf: 0 KiB

Total disk space freed by localepurge: 0 KiB

Sure, that wasn’t the most space saving example, but on some installations I saved around 100 MB of disk space that way – and I still haven’t found a case where this caused unwanted damage. (Use of this advice on your own risk, though. Pointers to potential problems welcome. :-)

Comments

Re: Automatically hardlinking duplicate files under /usr/share/doc with APT

Posted by: Fernando
Website: 
Time: Sat, 14 Apr 2012 21:15

Wouldn't that solution mean that if the duplicate files stop being the same (i.e., they stop being created from the same source package, or in case of a binNMU), the most recent copy will overwrite? Unless dpkg rm's the file first, I think that would be the case.

In that sense, a better solution would be deduplication. I think there's a FUSE filesystem (or 10) for that.

Reply

Re: Automatically hardlinking duplicate files under /usr/share/doc with APT

Posted by: Julian Andres Klode
Website: mailto:jak@debian.org
Time: Sun, 15 Apr 2012 11:21

You might want to pass -m to hardlink, so it maximises the link count (it then always replaces the file with the lowest link count).

As a tip: Instead of localepurge, you can also use dpkg's integrated filtering:

path-exclude=/usr/share/locale/* path-include=/usr/share/locale/de*/* path-include=/usr/share/locale/en*/* path-include=/usr/share/locale/locale.alias

# We do not want translated manual pages. path-exclude=/usr/share/man/* path-include=/usr/share/man/man[1-9]/*

This way, dpkg should know about the modifications, and the files do not disappear.

Fernando: No. dpkg replaces files by first creating a new temporary file and then renaming the file to the final name. This means that if a package is upgraded, the de-duplicated files first all become duplicated new files. Then hardlink gets run again, de-duplicating the new files.

Reply

Your Comment

Spam Protection: To post a comment, you'll have to answer the following question: What is 42 minus 19?

Name:
URL or E-Mail: [http://... or mailto:you@example.com] (optional)
Title: (optional)
Spam Protection Answer:
Comment:

Tag Cloud

2CV, aha, Apache, APT, aptitude, ASUS, Automobiles, autossh, Berlin, bijou, Blogging, Blosxom, Blosxom Plugin, Browser, BSD, CDU, Chemnitz, Citroën, CLI, CLT, Conkeror, CSS, CX, deb, Debian, Doofe Parteien, E-Mail, eBay, EeePC, Emacs, Epiphany, Etch, ETH Zürich, Events, Experimental, Firefox, Fläsch, FreeBSD, Freitagstexter, FVWM, Galeon, Gecko, git, GitHub, GNOME, GNU, GNU Coreutils, GNU Screen, Google, GPL, grep, grml, gzip, Hackerfunk, Hacks, Hardware, Heise, HTML, identi.ca, IRC, irssi, Jabber, JavaShit, Kazehakase, Lenny, Liferea, Linux, LinuxTag, LUGS, Lynx, maol, Meme, Microsoft, Mozilla, Music, mutt, Myon, München, nemo, Nokia, nuggets, Open Source, Opera, packaging, Pentium I, Perl, Planet Debian, Planet Symlink, Quiz, Rant, ratpoison, Religion, RIP, Sarcasm, Sarge, Schweiz, screen, Shell, Sid, Spam, Squeeze, SSH, Stoeckchen, Stöckchen, SuSE, Symlink, Symlink-Artikel, Tagging, Talk, taz, Text Mode, ThinkPad, Ubuntu, USA, USB, UUUCO, UUUT, VCFe, Ventilator, Vintage, Wahlen, Wheezy, Wikipedia, Windows, WML, Woody, WTF, X, Xen, zsh, Zürich, ÖPNV

Calendar

 2012 
Months
Jun
 June 
Mo Tu We Th Fr Sa Su
       
5
 

Tattletale Statistics

Blog postings by posting time
Blog posting times this month



Search


Advanced Search


Categories


Recent Postings

0 most recent of 0 postings total shown.


Recent Comments

Hackergotchi of Axel Beckert

About...

This is the blog or weblog of Axel Stefan Beckert (aka abe or XTaran) who thought, he would never start blogging... (He also once thought, that there is no reason to switch to this new ugly Netscape thing because Mosaïc works fine. That was about 1996.) Well, times change...

He was born 1975 at Villingen-Schwenningen, made his Abitur at Schwäbisch Hall, studied Computer Science with minor Biology at University of Saarland at Saarbrücken (Germany) and now lives in Zürich (Switzerland), working at the Network Security Group (NSG) of the Central IT Services (Informatikdienste) at ETH Zurich.

Links to internal pages are orange, links to related pages are blue, links to external resources are green and links to Wikipedia articles, Internet Movie Database (IMDb) entries or similar resources are bordeaux. Times are CET respective CEST (which means GMT +0100 respective +0200).


RSS Feeds


Identity Archipelago


Picture Gallery


Button Futility

Valid XHTML Valid CSS
Valid RSS Any Browser
This content is licensed under a Creative Commons License (SA 3.0 DE). Some rights reserved. Hacker Emblem
Get Mozilla Firefox! Powered by Linux!
Typed with GNU Emacs Listed at Tux Mobil
XFN Friendly Button Maker

Blogroll

People I know personally


Other blogs I like or read


Independent News


Interesting Planets


Web comics I like and read

Stalled Web comics I liked


Blogging Software

Blosxom Plugins I use

Bedside Reading

Just read

  • Bastian Sick: Der Dativ ist dem Genitiv sein Tod (Teile 1-3)
  • Neil Gaiman and Terry Pratchett: Good Omens (borrowed from Ermel)

Currently Reading

  • Douglas R. Hofstadter: Gödel, Escher, Bach
  • Neil Gaiman: Keine Panik (borrowed from Ermel)

Yet to read

  • Neil Stephenson: Cryptonomicon (borrowed from Ermel)

Always a good snack

  • Wolfgang Stoffels: Lokomotivbau und Dampftechnik (borrowed from Ermel)
  • Beverly Cole: Trains — The Early Years (getty images)

Postponed