Stoppt die Vorratsdatenspeicherung! Jetzt klicken &handeln! Willst du auch an der Aktion teilnehmen? Hier findest du alle relevanten Infos und Materialien:
Jump to menu and information about this site.

Saturday·17·November·2012

deepgrep: grep nested archives with one command //at 02:00 //by abe

from the grep-revisited dept.

Several months ago, I wrote about grep everything and listed grep-like tools which can grep through compressed files or specific data formats. The blog posting sparked several magazine articles and talks by Frank Hofmann and me.

Frank recently noticed that we though missed one more or less mighty tool so far. We missed it, because it’s mostly unknown, undocumented and hidden behind a package name which doesn’t suggest a real recursive “grep everything”:

deepgrep

deepgrep is part of the Debian package strigi-utils, a package which contains utilities related to the KDE desktop search Strigi.

deepgrep especially eases the searching through tar balls, even nested ones, but can also search through zip files and OpenOffice.org/LibreOffice documents (which are actually zip files).

deepgrep seems to support at least the following archive and compression formats:

  • tar
  • ar, and hence deb
  • rpm (but not cpio)
  • gzip/gz
  • bzip2/bz2
  • zip, and hence jar/war and OpenOffice.org/LibreOffice documents
  • MIME messages (i.e. files attached to e-mails)

A search in an archive which is deeply nested looks like this:

$ deepgrep bar foo.ar
foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt:foobar
foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt:bar

deepgrep though neither seems to support any LZMA based compression (lzma, xz, lzip, 7z), nor does it support lzop, rzip, compress (.Z suffix), cab, cpio, xar, or rar.

Further current drawbacks of deepgrep:

  • Nearly no commandline options, especially none of the common grep options
  • No man-page or other documentation
  • Exit code not related to search results, you have to check the output to see if something has been found

deepfind

If you just need the file names of the files in nested archives, the package also contains the tool deepfind which does nothing else than to list all files and directories in a given set of archives or directories:

$ deepfind foo.ar
foo.ar
foo.ar/foo.tar
foo.ar/foo.tar/foo.tar.gz
foo.ar/foo.tar/foo.tar.gz/foo.zip
foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2
foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz
foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt

As with deepgrep, deepfind does not implement any common options of it’s normal sister tool find.

[The following part has been added on 17-Nov-2012]

As with deepgrep, it also doesn’t seem to support any of the more modern or more exotic compression formats, i.e. it fails on modern debian binary packages which use xz compression on the data part:

deepfind xulrunner-18.0_18.0\~a2+20121109042012-1_amd64.deb
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/debian-binary
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/triggers
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/preinst
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/md5sums
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/postinst
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/control
xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/data.tar.xz

[End of part added at 17-Nov-2012]

Dependencies

The package strigi-utils doesn’t pull in the complete Strigi framework (i.e. no daemon), just a few libraries (libstreams, libstreamanalyzer, and libclucene). On Wheezy it also pulls in some audio/video decoding libraries which may make some server administrators less happy.

Conclusion

Both tools are quite limited to some basic use cases, but can be worth a fortune if you have to work with nested archives. Nevertheless the claim in the Debian package description of strigi-utils that they’re “enhanced” versions of their well known counterparts is IMHO disproportionate.

Most of the missing features and documentation can be explained by the primary purpose of these tools: Being backend for desktop searches. I guess, there wasn’t much need for proper commandline usage yet. Until now. ;-)

42.zip

And yes, I was curious enough to let deepfind have a look at 42.zip (the one from SecurityFocus, unzip seems not able to unpack 42.zip from unforgettable.dk due a missing version compatibility) and since it just traverses the archive sequentially, it has no problem with that, needing just about 5 MB of RAM and a lot of time:

[…]
42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page e.zip
42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page e.zip/0.dll
42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page f.zip
42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page f.zip/0.dll
deepfind 42.zip  11644.12s user 303.89s system 97% cpu 3:24:02.46 total

I though won’t try deepgrep on 42.zip. ;-)

Comments

Re: deepgrep: grep nested archives with one command

Posted by: Sukhbir Mehla
Website: http://www.zeemo.com.au
Time: Tue, 14 Jan 2014 01:29

Thank you for sharing. I read the 'grep everything' post and found it very insightful, so was happy when I saw this next post.

I'm also with Trey Smith in that I would love an additional resource to get more insight.

Reply

Your Comment

Spam Protection: To post a comment, you'll have to answer the following question: What is 42 minus 19?

Name:
URL or E-Mail: [http://... or mailto:you@example.com] (optional)
Title: (optional)
Spam Protection Answer:
Comment:

Tag Cloud

2CV, aha, Apache, APT, aptitude, ASUS, Automobiles, autossh, Berlin, bijou, Blogging, Blosxom, Blosxom Plugin, Browser, BSD, CDU, Chemnitz, Citroën, CLI, CLT, Conkeror, CX, deb, Debian, Doofe Parteien, E-Mail, eBay, EeePC, Emacs, Epiphany, Etch, ETH Zürich, Events, Experimental, Firefox, Fläsch, FreeBSD, FVWM, Galeon, Gecko, git, GitHub, GNOME, GNU, GNU Coreutils, GNU Screen, Google, GPL, grep, grml, gzip, Hackerfunk, Hacks, Hardware, Heise, HTML, identi.ca, IRC, irssi, Jabber, JavaShit, Kazehakase, Lenny, Liferea, Linux, LinuxTag, LUGS, Lynx, maol, Meme, Microsoft, Mozilla, Music, mutt, Myon, München, nemo, Nokia, nuggets, Open Source, Opera, packaging, Pentium I, Perl, Planet Debian, Planet Symlink, Quiz, Rant, ratpoison, Religion, RIP, Sarcasm, Sarge, Schweiz, screen, Shell, Sid, Spam, Squeeze, SSH, Stöckchen, SuSE, Symlink, Symlink-Artikel, Tagging, Talk, taz, Text Mode, ThinkPad, Ubuntu, USA, USB, UUUCO, UUUT, VCFe, Ventilator, Vintage, Wahlen, Wheezy, Wikipedia, Windows, WML, Woody, WTF, X, Xen, zsh, Zürich, ÖPNV

Calendar

 2012 
Months
Nov
 November 
Mo Tu We Th Fr Sa Su
     
17
   

Tattletale Statistics

Blog postings by posting time
Blog posting times this month



Search


Advanced Search


Categories


Recent Postings

0 most recent of 0 postings total shown.


Recent Comments

Hackergotchi of Axel Beckert

About...

This is the blog or weblog of Axel Stefan Beckert (aka abe or XTaran) who thought, he would never start blogging... (He also once thought, that there is no reason to switch to this new ugly Netscape thing because Mosaïc works fine. That was about 1996.) Well, times change...

He was born 1975 at Villingen-Schwenningen, made his Abitur at Schwäbisch Hall, studied Computer Science with minor Biology at University of Saarland at Saarbrücken (Germany) and now lives in Zürich (Switzerland), working at the IT Support Group (ISG) of the Departement of Physics at ETH Zurich.

Links to internal pages are orange, links to related pages are blue, links to external resources are green and links to Wikipedia articles, Internet Movie Database (IMDb) entries or similar resources are bordeaux. Times are CET respective CEST (which means GMT +0100 respective +0200).


RSS Feeds


Identity Archipelago


Picture Gallery


Button Futility

Valid XHTML Valid CSS
Valid RSS Any Browser
GeoURL
This content is licensed under a Creative Commons License (SA 3.0 DE). Some rights reserved. Hacker Emblem
Get Mozilla Firefox! Powered by Linux!
Typed with GNU Emacs Listed at Tux Mobil
XFN Friendly Button Maker

Blogroll

Blog or not?


People I know personally


Other blogs I like or read


Independent News


Interesting Planets


Web comics I like and read

Stalled Web comics I liked


Blogging Software

Blosxom Plugins I use

Bedside Reading

Just read

  • Bastian Sick: Der Dativ ist dem Genitiv sein Tod (Teile 1-3)
  • Neil Gaiman and Terry Pratchett: Good Omens (borrowed from Ermel)

Currently Reading

  • Douglas R. Hofstadter: Gödel, Escher, Bach
  • Neil Gaiman: Keine Panik (borrowed from Ermel)

Yet to read

  • Neil Stephenson: Cryptonomicon (borrowed from Ermel)

Always a good snack

  • Wolfgang Stoffels: Lokomotivbau und Dampftechnik (borrowed from Ermel)
  • Beverly Cole: Trains — The Early Years (getty images)

Postponed