File Compression methods

Lloyd Brown lloyd_brown at byu.edu
Thu Oct 10 08:11:25 MDT 2013


Not sure about the exact implementation, but gzip is fast/responsive
enough to be put in a pipeline involving networks.  It won't get as high
of a compression ratio as bzip2, but it will still do fairly well, and
be quite a lot faster than bzip2.  A great deal will depend on how
compressable your data is.

A couple of other things to think about:

- Is it really necesary to send all that data to each device, or would
you be well served to store it in a central storage location and access
it over the network as needed?
- If you really do need to send the data out, how much variation is
there from one destination to another?  If they're identical, or even
similar, and you're imaging them at the same time, maybe something based
on multicast or peer-to-peer (eg. bittorrent-like) traffic patterns
would be helpful.
- How much chance would there be that some of the data is already on the
device, and just might need updated?  In that situation, you don't
really *need* to send all 600GB, but rather just the updates.  Rsync
would be a great choice for this.  Or even rdiff if you are working with
binary blobs, rather than lots of files.

I know that none of this is a solution, per se, but just a couple of
things to think about.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 10/10/2013 02:25 AM, Dan Egli wrote:
> Hey pluggers, question for you all. One thing that this smaller project I'm
> working on is going to need is some way of automatically transferring files
> across when a new system is added to the network. The actual method of
> booting isn't important right now (probably a flash drive or something).
> What I am facing is that I expect that there will be around 600GB of files
> that need to be written to each machine when it first boots. 600GB on a
> 1GbE network is nearly two hours if the math off the top of my head is
> correct. And that's if only the one system is doing any network I/O at that
> time. Any other network traffic would slow that down even further. I
> thought I might reduce that time (and network traffic) significantly by
> having a compressed archive of the various files (like the old .tar.bz2
> files). But I know that bzip2 is not the best compressor anymore. It's not
> too bad, but there are better ones. So I ask what you guys would recommend
> as the compression system? The only restriction I have on it is that it
> must be able to either handle the peculiarities of Unix vs. Dos/Windows
> systems (i.e. ownerships, permissions, device files, and symlinks, like
> tar) OR it must be able to compress from/decompress to stdin/stdout (like
> bzip2). The goal here is to get the archive as tiny as possible so that it
> uses as little network traffic as possible during the extraction. And a two
> step process is unfortunately out of the question. The machines will only
> have either 750GB or 1TB hdds, which obviously won't work for extracting
> the tar to disk then extracting from the tar on disk. tar's extraction
> process would run out of space before it finished. Libraries aren't an
> issue because I could put the libraries in the nfs directory and call the
> compression program with LD_LIBRARY_PATH=<nfs path>, assuming I don't just
> build the program (assuming I can get the source) as static in the first
> place.
> 
> 
> 
> Any recommendations are welcome. The only archivers I've worked with in the
> past are the DOS style archivers (zip, arc, arj, lzh, & rar) which would
> not handle permissions, ownership, or symlinks at all. If it was just me,
> I'd say stick with .tar.bz2, as it works and works well. Maybe not the
> tightest, but it gets the job done well enough. However, for this guy, he
> seems to think that the network will be running at a rather high usage rate
> (he claims about 75% average bandwidth usage) so every packet is at a
> premium since it could easily saturate the network with only one computer.
> Add in a second or more, and it's REALLY full and everything slows down.
> 
> 
> 
> Thanks folks!
> --- Dan
> 
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
> 


More information about the PLUG mailing list