File Compression methods

Dan Egli ddavidegli at gmail.com
Thu Oct 10 02:25:03 MDT 2013


Hey pluggers, question for you all. One thing that this smaller project I'm
working on is going to need is some way of automatically transferring files
across when a new system is added to the network. The actual method of
booting isn't important right now (probably a flash drive or something).
What I am facing is that I expect that there will be around 600GB of files
that need to be written to each machine when it first boots. 600GB on a
1GbE network is nearly two hours if the math off the top of my head is
correct. And that's if only the one system is doing any network I/O at that
time. Any other network traffic would slow that down even further. I
thought I might reduce that time (and network traffic) significantly by
having a compressed archive of the various files (like the old .tar.bz2
files). But I know that bzip2 is not the best compressor anymore. It's not
too bad, but there are better ones. So I ask what you guys would recommend
as the compression system? The only restriction I have on it is that it
must be able to either handle the peculiarities of Unix vs. Dos/Windows
systems (i.e. ownerships, permissions, device files, and symlinks, like
tar) OR it must be able to compress from/decompress to stdin/stdout (like
bzip2). The goal here is to get the archive as tiny as possible so that it
uses as little network traffic as possible during the extraction. And a two
step process is unfortunately out of the question. The machines will only
have either 750GB or 1TB hdds, which obviously won't work for extracting
the tar to disk then extracting from the tar on disk. tar's extraction
process would run out of space before it finished. Libraries aren't an
issue because I could put the libraries in the nfs directory and call the
compression program with LD_LIBRARY_PATH=<nfs path>, assuming I don't just
build the program (assuming I can get the source) as static in the first
place.



Any recommendations are welcome. The only archivers I've worked with in the
past are the DOS style archivers (zip, arc, arj, lzh, & rar) which would
not handle permissions, ownership, or symlinks at all. If it was just me,
I'd say stick with .tar.bz2, as it works and works well. Maybe not the
tightest, but it gets the job done well enough. However, for this guy, he
seems to think that the network will be running at a rather high usage rate
(he claims about 75% average bandwidth usage) so every packet is at a
premium since it could easily saturate the network with only one computer.
Add in a second or more, and it's REALLY full and everything slows down.



Thanks folks!
--- Dan


More information about the PLUG mailing list