Deduplication

In the previous post we looked at performance and benchmark statistics on a home NAS/HTPC with ZFS on Linux. Here we’ll take a deeper dive into some of the more interesting features of ZFS–Compression and Deduplication.

Parts
Part 1: Planning and Hardware
Part 2: ZFS and Software
Part 3: Performance Benchmarks
Part 4: Compression and Deduplication
Part 5: Deduplication in Practice

When I set out to build a NAS I aimed for the lowest cost with the best utility. My primary goal was to move all my precious (and the less-than-rare) files in a redundant storage that is accessible, manageable and low-cost, but reliable. The idea of extending it to HTPC came to me after comparing prices of commercial home/small-biz NAS solutions with the cost of the bare components. One can do so much more with a generic Linux box than a specialized black-box. However I wasn’t willing to pay a premium to turn what started out as NAS into a general purpose rig; It had to be low-profile.

ZFS offers enterprise-level features and performance at the cost of maintaining it. Never underestimate the overhead of maintenance though. The biggest issue I’ve had to face with ZFS on Linux was that of memory starvation. I had read that ZFS has less than humble memory requirements and it could always put more of it to good use. Recommendations of 1-2GB / TB of storage floats on the web. However I ended up supplying it with only 8GB for 18TB of storage to be shared with the OS and other applications. The experience is worth sharing, especially the lessons learnt, which should prove worthwhile for others with similar plans to mine. But first, we need to take a look at some of the internals of ZFS.

Compression vs Deduplication

Both compression and deduplication are designed to reduce disk space usage as much as possible. One should keep in mind that these features don’t come for free and there are caveat to be aware about. First and most important point to keep in mind before deciding which features to enable is that deduplication happens on the zpool level. The pool is at the volume management layer and dedup works on the block-level, away from any file structures. The filesystem dataset sits atop the zpool and that’s where compression works, before deduplication is done. When enabled, zfs compresses each block using the chosen compression algorithm and writes to the pool. While writing, the zpool compares the checksum of the incoming block with existing block checksums, and, if a match is found, the matching block’s reference count is incremented, the block reference ID passed back to the filesystem and no data is written to disk (expect the reference count increment in the deduplication table).

Neither dedup nor compression work retroactively. So a change in either setting will not propagate to existing data. When new data is written, or old data read/modified/written, the new settings will take effect. Due to the fact that compressed block data depends on the compression algorithm (and level,) identical blocks compressed with different compression algorithms virtually certainly will not match as identical (although their uncompressed bytes match). This is only an issue when the pool has identical duplicate data in different datasets with different compression algorithms or level. Datasets that have similar data that can benefit from deduplication will lose that opportunity if they utilize different compression settings. Changing the compression setting of a dataset should be done when one is fairly confident that the gains will outdo any loss of dedup opportunities with other datasets with a different compression setting.

Compression

The overhead of compression is very much contained and bounded in that the hit is taken for each block at the time of reading and writing to that block and that block alone. Compression will require no extra disk space beyond the compressed file bytes and the memory overhead is typically negligible and of fixed size (because the block size is also fixed). Do note that a block will be written uncompressed if after compression there is no saving of at least 12.5% (1/8th) of the block size. That is, for a 128KB block, if compression doesn’t save at least 16KB, the block will be written uncompressed. Future reads will not have any decompression overhead of course, but re-writing said block will go through the same compression cycle to decide whether or not to write the compressed version. Notice that the compression algorithm and level is stored with each block, so changing these values on the dataset does not affect existing data, but only apply to future modify and writes. Naturally when a block is written uncompressed the metadata of the block will mark it as uncompressed. This means that having uncompressible media files on a compressed dataset will only incur the overhead of compressing at ingestion time and, in virtually all cases, the uncompressed data will be written to disk, and no overhead when reading (I read the code to be sure). Where compression could make a net gain of 12.5% or more, those blocks will be written compressed and it’d only be fair to pay the penalty of decompression for them and them alone when reading.

By default, lz4 compression could be used, unless we know we will store incompressible data, such as already compressed video or audio files, that will not be write-once read-many (WORM). If files will be modified heavily, we have to make a very educated decisions (read: experiment/research before you commit). While for highly compressible data (such as source code, databases, XML dumps etc.) gzip will do much better than lz4, it will also have a much higher CPU cost, and for those who care about it, latency. On the other hand, if you use gzip, it wouldn’t matter much whether you use gzip-1 or gzip-9 (relatively speaking to lz4, that is). My preference is to default to gzip-7 for the root of the pool and either choose no compression (very rarely) or choose gzip-9 for all WORM data.

Unlike compression, Deduplication has to keep track of all blocks in the pool (that belong to datasets with dedup=on). The number of blocks will be at least as large as the number of files, in the rare case that each file is less than or equal to a single block size. In practice, the number of blocks will be orders of magnitude larger that the number of files. Each block will therefore have both a disk, memory and cpu overheads. All three resources will increase the more the blocks we have. On disk the dedup tables (DDT) are stored to survive reboots, which are loaded to memory to maximize the performance of finding matches when writing files, which will require more processing the more the block checksums there are to compare against. Of the three the most worrisome is the memory overhead. On disk a few extra gigabytes will not be noticeable and searching even trillions of checksums should take microseconds at worse. It’s the memory needed to hold said trillions of checksums to perform fast comparisons is the problem. It is recommended to have 1-2 GB of RAM per TB of disk for good performance. Interestingly, the more duplicate blocks in the pool, the smaller the memory requirement per entry will be. This is because the DDT stores the unique block checksums that are written to disk, which is less than or equal to the total number of blocks. So a pool that doesn’t benefit from deduplication will be wasting more resources than one that is benefiting handsomely. As such, if we do not expect a significant gain from deduplication, the overhead will not be justified and should be avoided by disabling it.

Deduplication works on the block level. It computes a checksum per block and looks up the checksums in a deduplication table (DDT) to find duplicates. The default checksum is sha256. Although sha256 is a cryptographic hash (meaning it’s designed to have great avalanche characteristics and be exceedingly difficult to synthesis,) there is still a negligible chance that two different data blocks might hash to the same value. For the paranoid, ZFS dedup supports verification by byte comparison when checksums match and before assuming the blocks to be identical (and discarding one for the other). However, and as dedup author Jeff Bonwick has pointed out, the chance that sha256 will have a collision is less than that of an undetected ECC memory error, although the actual error rate by chance is far higher than he calculates due to the birthday paradox. Because this miscalculation is parroted elsewhere, it might be worthwhile to point out why it’s wrong.

Checksums, Collision and the Birthday Paradox

Sha256 has 256 bits, and while it’s true that a single random bit flip will have a chance of 1 in 2^256 to match the same hash as another block, in reality we do not have just two hash values but millions. We are worried that a corrupted block will change the hash of that block such that it matches any other block’s hash, and of course all blocks are subject to the same possibility of corruption. To go back to the birthday paradox (which asks about the minimum number of people in a room so there’d be at least 50% chance that any two share the same birthday,) the question here isn’t about the chance that another person shares your birthday, rather that anyone shares any other’s birthday (or in this case hash-value). The chances are obviously higher than just 1 in 2^256, since we have many other blocks and any one of them is a candidate for a collision. In addition, because cryptographic hash functions (indeed, any hash function) is designed to flip on average 50% of its bits for every bit change in the input (i.e. minimum change in the input results in maximum change in the output,) whenever there is corruption, there will be more than a single bit change in the hash value. This will not affect the probability of collision, but it is an important distinction from the naive assumption. (This feature of good hash functions is called Avalanche effect.)

Considering that cryptographic hash functions are designed to fingerprint documents, a considerable research is done in this area. A particular attack on hashes which exploits the birthday paradox is called birthday attack. So for a random collision with 50% or better probability between any two 256 bit hashes, one will need to have 4 x 10^38 hash values, or 4 followed by 38 zeros. (For the interested, with 23 people there is more than 50% chance that any two will share the same birthday. A year has 365 days, which is 9-bits, and this is why it takes such a small candidates to hit 50%, compared to 256bit hashes.) This is still huge, to be sure, but orders of magnitude more likely than the miscalculated 1 in 10^77 by Bonwick. (For comparison, at about 10^31 hashes, the chances of a random collision between any two is comparable to the chances of an undetected corruption in magnetic hard drives, which have an error recovery rate of 10^-14 to 10^-15 or about a bit of error every 12-120TB of data transferred.)

The verify option is useful when used with a less secure checksum function that is much faster but will produce more collisions on average. As I’ve written in the first part of this series, Fletcher4, which has a much smaller size, is no longer enabled for deduplication purposes. And in any case using a shorter hash (such as Fletcher4) is not recommended. Unless you know what you’re doing, using weak hash functions without verify will sooner rather than later have collisions and so will corrupt your data (hence the need for verify with weak hashing). The problem with verify on weak hash functions is that you’ll have a higher number of bit-for-bit comparisons when the hashes collide. The gains of a faster hashing function vs the higher verify comparisons will probably be counter productive, or at least will diminish the performance advantage that fletcher4 and the like offers.

Personally, I’d stick with sha256+verify for deduplication and use gzip-7 or higher for compression by default. For uncompressible data with high write or modify operations I would disable compression and deduplication (unless one knows that the compressed data could be deduplicated, it most probably won’t). For highly compressible data with high write or modify operations (i.e databases or virtual machine images,) lz4 should prove a winner.

In the next part we will look at practical usage of compression and deduplication with analysis.