Saturday·17·November·2012
deepgrep: grep nested archives with one command //at 02:00 //by abe
Several months ago, I wrote about grep everything and listed grep-like tools which can grep through compressed files or specific data formats. The blog posting sparked several magazine articles and talks by Frank Hofmann and me.
Frank recently noticed that we though missed one more or less mighty tool so far. We missed it, because it’s mostly unknown, undocumented and hidden behind a package name which doesn’t suggest a real recursive “grep everything”:
deepgrep
deepgrep
is part of the Debian package strigi-utils, a package which contains utilities related to the
KDE desktop search Strigi.
deepgrep
especially eases the searching through tar
balls, even nested ones, but can also search through zip files and
OpenOffice.org/LibreOffice documents (which are actually zip files).
deepgrep
seems to support at least the following archive
and compression formats:
- tar
- ar, and hence deb
- rpm (but not cpio)
- gzip/gz
- bzip2/bz2
- zip, and hence jar/war and OpenOffice.org/LibreOffice documents
- MIME messages (i.e. files attached to e-mails)
A search in an archive which is deeply nested looks like this:
$ deepgrep bar foo.ar foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt:foobar foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt:bar
deepgrep
though neither seems to support any LZMA based
compression (lzma, xz, lzip, 7z), nor does it support lzop, rzip,
compress (.Z suffix), cab, cpio, xar, or rar.
Further current drawbacks of deepgrep
:
- Nearly no commandline options, especially none of the common grep options
- No man-page or other documentation
- Exit code not related to search results, you have to check the output to see if something has been found
deepfind
If you just need the file names of the files in nested archives, the
package also contains the tool deepfind
which does
nothing else than to list all files and directories in a given set of
archives or directories:
$ deepfind foo.ar foo.ar foo.ar/foo.tar foo.ar/foo.tar/foo.tar.gz foo.ar/foo.tar/foo.tar.gz/foo.zip foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2 foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz foo.ar/foo.tar/foo.tar.gz/foo.zip/foo.tar.bz2/foo.txt.gz/foo.txt
As with deepgrep
, deepfind
does not
implement any common options of it’s normal sister tool
find
.
[The following part has been added on 17-Nov-2012]
As with deepgrep, it also doesn’t seem to support any of the more modern or more exotic compression formats, i.e. it fails on modern debian binary packages which use xz compression on the data part:
deepfind xulrunner-18.0_18.0\~a2+20121109042012-1_amd64.deb xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/debian-binary xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/triggers xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/preinst xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/md5sums xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/postinst xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/control.tar.gz/control xulrunner-18.0_18.0~a2+20121109042012-1_amd64.deb/data.tar.xz
[End of part added at 17-Nov-2012]
Dependencies
The package strigi-utils doesn’t pull in the complete Strigi framework (i.e. no daemon), just a few libraries (libstreams, libstreamanalyzer, and libclucene). On Wheezy it also pulls in some audio/video decoding libraries which may make some server administrators less happy.
Conclusion
Both tools are quite limited to some basic use cases, but can be worth a fortune if you have to work with nested archives. Nevertheless the claim in the Debian package description of strigi-utils that they’re “enhanced” versions of their well known counterparts is IMHO disproportionate.
Most of the missing features and documentation can be explained by the primary purpose of these tools: Being backend for desktop searches. I guess, there wasn’t much need for proper commandline usage yet. Until now. ;-)
42.zip
And yes, I was curious enough to let deepfind
have a look
at 42.zip (the one from SecurityFocus, unzip seems not
able to unpack 42.zip from unforgettable.dk due a missing version compatibility)
and since it just traverses the archive sequentially, it has no
problem with that, needing just about 5 MB of RAM and a lot of time:
[…] 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page e.zip 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page e.zip/0.dll 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page f.zip 42.zip/lib f.zip/book f.zip/chapter f.zip/doc f.zip/page f.zip/0.dll deepfind 42.zip 11644.12s user 303.89s system 97% cpu 3:24:02.46 total
I though won’t try deepgrep
on 42.zip. ;-)
Tagged as: 42.zip, ar, bzip2, CLI, CLucene, deb, deepfind, deepgrep, efho, find, grep, gzip, jar, KDE, LibreOffice, Lucene, odt, OpenOffice.org, Rant, rpm, strigi, tar, UUUT, war, zip
// show without comments // write a comment