Anyone using Backblaze B2 cloud storage for backup?

Levi Pearson levipearson at gmail.com
Thu Aug 24 19:08:13 MDT 2017


On Thu, Aug 24, 2017 at 4:19 PM, Riley Loader <riley.loader at gmail.com>
wrote:
>
>
> Anyway, to avoid going off-topic, I've never used or heard of Backblaze B2,
> but I do use Duplicity (with the Duply wrapper/frontend) and it seems to
> work great. I can't vouch for the massive amount of storage space or backup
> time that Levi brings up (I have no experience using other options to
> compare) but here's stats from last night's backup of my home server:
>
> --------------[ Backup Statistics ]--------------
> StartTime 1503561610.92 (Thu Aug 24 02:00:10 2017)
> EndTime 1503561711.66 (Thu Aug 24 02:01:51 2017)
> ElapsedTime 100.74 (1 minute 40.74 seconds)
> SourceFiles 171773
> SourceFileSize 83407342647 (77.7 GB)
> NewFiles 15
> NewFileSize 58450408 (55.7 MB)
> DeletedFiles 4
> ChangedFiles 6
> ChangedFileSize 182407535 (174 MB)
> ChangedDeltaSize 0 (0 bytes)
> DeltaEntries 25
> RawDeltaSize 59265398 (56.5 MB)
> TotalDestinationSizeChange 11743577 (11.2 MB)
> Errors 0
> -------------------------------------------------
>
> It appears that the "TotalDestinationSizeChange" (the amount of data that
> got pushed to S3) turned out to be 11 MB for this incremental, compressed
> backup, after having added 56 MB to the source machine. I may be reading
> that wrong, but it appears to be pretty efficient, at least for my use
> case.
>
> Riley
>

The Duplicity model is based on a full snapshot followed by a sequence of
deltas from that point. The advantage is that the snapshot is an
easy-to-use standard tarball, and the deltas are really compact. But this
means that the deltas depend on all the previous deltas since the last
snapshot, and this makes them increasingly awkward to use over time. So you
have to periodically do a full snapshot again.

The cool thing about Borg is that the differential comparison isn't
time-based, but is content based. It does still keep track of when various
bits were added/removed so you can do date-based rollback, but for storage
it just does general de-duplication along with (optional) compression. For
certain data sets with lots of redundancies (VM images are a prime example)
you get massive improvements in storage vs. a solution without
de-duplication. You can also do a rolling time-based set of snapshots
without ever having to do another full snapshot backup. You can also
periodically prune your set of snapshots so that old ones are kept at
different time resolution; i.e. weekly or monthly instead of daily. And to
periodically ensure that your repository matches the current state (which
is one of the functions of a full snapshot backup) you can instead do a
full verification that streams through your files as well as a
reconstruction on the server to match checksums.  This takes some time, but
typically a lot less than having to actually copy everything to the backup
server.

Here's a recent report:

Archive fingerprint:
6f768859ac8b3b6d5ef698815594bab0cb65b98765f492afb8e853364b155048
Time (start): Mon, 2017-08-14 02:05:07
Time (end):   Mon, 2017-08-14 03:46:56
Duration: 1 hours 41 minutes 48.59 seconds
Number of files: 13013800
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated
size
This archive:              880.92 GB            608.46 GB              4.57
MB
All archives:               16.25 TB             11.24 TB            396.47
GB

                       Unique chunks         Total chunks
Chunk index:                 3217832            234937655
------------------------------------------------------------------------------

So, "This archive" means the representation of that day's filesystem
contents, while "All archives" means all the currently stored
representations of the same filesystem.  So the differences amount to
4.57MB of compressed data, and the whole set of backups currently stored is
taking only 400GB, despite the current filesystem contents taking 880.92GB.
Obviously there's a *LOT* of redundant data here, but I knew this about my
data which is what led me to trying Borg out.

There's a lot of other cool stuff; despite the non-standard storage format,
you can actually access your backups more easily because it's got a FUSE
filesystem you can use to loopback mount any of your backups. You can
backup to any drive you can mount (it only uses very basic fs operations
for backup even though it can preserve most fs metadata from the source) or
remotely via ssh; you can encrypt everything and the encryption is done on
your machine. Your local backup server can have backups pushed to it or
pull from your other machines. I haven't used a lot of the advanced
features, so I can't say a whole lot about them.

So it doesn't have all the fancy back-end back-end plugins like Duplicity,
and Duplicity actually works great for a lot of uses--I was reasonably
happy with it for a long time. Borg is definitely worth checking out if you
have highly redundant data, and if you really don't want to have to do any
full backups after the first one. I also found that it's generally faster
than Duplicity even taking different IO sizes into account.

    --Levi


More information about the PLUG mailing list