Jan 262014
 

In the previous post we looked at performance and benchmark statistics on a home NAS/HTPC with ZFS on Linux. Here we’ll take a deeper dive into some of the more interesting features of ZFS–Compression and Deduplication.


When I set out to build a NAS I aimed for the lowest cost with the best utility. My primary goal was to move all my precious (and the less-than-rare) files in a redundant storage that is accessible, manageable and low-cost, but reliable. The idea of extending it to HTPC came to me after comparing prices of commercial home/small-biz NAS solutions with the cost of the bare components. One can do so much more with a generic Linux box than a specialized black-box. However I wasn’t willing to pay a premium to turn what started out as NAS into a general purpose rig; It had to be low-profile.

ZFS offers enterprise-level features and performance at the cost of maintaining it. Never underestimate the overhead of maintenance though. The biggest issue I’ve had to face with ZFS on Linux was that of memory starvation. I had read that ZFS has less than humble memory requirements and it could always put more of it to good use. Recommendations of 1-2GB / TB of storage floats on the web. However I ended up supplying it with only 8GB for 18TB of storage to be shared with the OS and other applications. The experience is worth sharing, especially the lessons learnt, which should prove worthwhile for others with similar plans to mine. But first, we need to take a look at some of the internals of ZFS.

Compression vs Deduplication

Both compression and deduplication are designed to reduce disk space usage as much as possible. One should keep in mind that these features don’t come for free and there are caveat to be aware about. First and most important point to keep in mind before deciding which features to enable is that deduplication happens on the zpool level. The pool is at the volume management layer and dedup works on the block-level, away from any file structures. The filesystem dataset sits atop the zpool and that’s where compression works, before deduplication is done. When enabled, zfs compresses each block using the chosen compression algorithm and writes to the pool. While writing, the zpool compares the checksum of the incoming block with existing block checksums, and, if a match is found, the matching block’s reference count is incremented, the block reference ID passed back to the filesystem and no data is written to disk (expect the reference count increment in the deduplication table).

Neither dedup nor compression work retroactively. So a change in either setting will not propagate to existing data. When new data is written, or old data read/modified/written, the new settings will take effect. Due to the fact that compressed block data depends on the compression algorithm (and level,) identical blocks compressed with different compression algorithms virtually certainly will not match as identical (although their uncompressed bytes match). This is only an issue when the pool has identical duplicate data in different datasets with different compression algorithms or level. Datasets that have similar data that can benefit from deduplication will lose that opportunity if they utilize different compression settings. Changing the compression setting of a dataset should be done when one is fairly confident that the gains will outdo any loss of dedup opportunities with other datasets with a different compression setting.

Compression

The overhead of compression is very much contained and bounded in that the hit is taken for each block at the time of reading and writing to that block and that block alone. Compression will require no extra disk space beyond the compressed file bytes and the memory overhead is typically negligible and of fixed size (because the block size is also fixed). Do note that a block will be written uncompressed if after compression there is no saving of at least 12.5% (1/8th) of the block size. That is, for a 128KB block, if compression doesn’t save at least 16KB, the block will be written uncompressed. Future reads will not have any decompression overhead of course, but re-writing said block will go through the same compression cycle to decide whether or not to write the compressed version. Notice that the compression algorithm and level is stored with each block, so changing these values on the dataset does not affect existing data, but only apply to future modify and writes. Naturally when a block is written uncompressed the metadata of the block will mark it as uncompressed. This means that having uncompressible media files on a compressed dataset will only incur the overhead of compressing at ingestion time and, in virtually all cases, the uncompressed data will be written to disk, and no overhead when reading (I read the code to be sure). Where compression could make a net gain of 12.5% or more, those blocks will be written compressed and it’d only be fair to pay the penalty of decompression for them and them alone when reading.

By default, lz4 compression could be used, unless we know we will store incompressible data, such as already compressed video or audio files, that will not be write-once read-many (WORM). If files will be modified heavily, we have to make a very educated decisions (read: experiment/research before you commit). While for highly compressible data (such as source code, databases, XML dumps etc.) gzip will do much better than lz4, it will also have a much higher CPU cost, and for those who care about it, latency. On the other hand, if you use gzip, it wouldn’t matter much whether you use gzip-1 or gzip-9 (relatively speaking to lz4, that is). My preference is to default to gzip-7 for the root of the pool and either choose no compression (very rarely) or choose gzip-9 for all WORM data.

Deduplication

Unlike compression, Deduplication has to keep track of all blocks in the pool (that belong to datasets with dedup=on). The number of blocks will be at least as large as the number of files, in the rare case that each file is less than or equal to a single block size. In practice, the number of blocks will be orders of magnitude larger that the number of files. Each block will therefore have both a disk, memory and cpu overheads. All three resources will increase the more the blocks we have. On disk the dedup tables (DDT) are stored to survive reboots, which are loaded to memory to maximize the performance of finding matches when writing files, which will require more processing the more the block checksums there are to compare against. Of the three the most worrisome is the memory overhead. On disk a few extra gigabytes will not be noticeable and searching even trillions of checksums should take microseconds at worse. It’s the memory needed to hold said trillions of checksums to perform fast comparisons is the problem. It is recommended to have 1-2 GB of RAM per TB of disk for good performance. Interestingly, the more duplicate blocks in the pool, the smaller the memory requirement per entry will be. This is because the DDT stores the unique block checksums that are written to disk, which is less than or equal to the total number of blocks. So a pool that doesn’t benefit from deduplication will be wasting more resources than one that is benefiting handsomely. As such, if we do not expect a significant gain from deduplication, the overhead will not be justified and should be avoided by disabling it.

Deduplication works on the block level. It computes a checksum per block and looks up the checksums in a deduplication table (DDT) to find duplicates. The default checksum is sha256. Although sha256 is a cryptographic hash (meaning it’s designed to have great avalanche characteristics and be exceedingly difficult to synthesis,) there is still a negligible chance that two different data blocks might hash to the same value. For the paranoid, ZFS dedup supports verification by byte comparison when checksums match and before assuming the blocks to be identical (and discarding one for the other). However, and as dedup author Jeff Bonwick has pointed out, the chance that sha256 will have a collision is less than that of an undetected ECC memory error, although the actual error rate by chance is far higher than he calculates due to the birthday paradox. Because this miscalculation is parroted elsewhere, it might be worthwhile to point out why it’s wrong.

Checksums, Collision and the Birthday Paradox

Sha256 has 256 bits, and while it’s true that a single random bit flip will have a chance of 1 in 2^256 to match the same hash as another block, in reality we do not have just two hash values but millions. We are worried that a corrupted block will change the hash of that block such that it matches any other block’s hash, and of course all blocks are subject to the same possibility of corruption. To go back to the birthday paradox (which asks about the minimum number of people in a room so there’d be at least 50% chance that any two share the same birthday,) the question here isn’t about the chance that another person shares your birthday, rather that anyone shares any other’s birthday (or in this case hash-value). The chances are obviously higher than just 1 in 2^256, since we have many other blocks and any one of them is a candidate for a collision. In addition, because cryptographic hash functions (indeed, any hash function) is designed to flip on average 50% of its bits for every bit change in the input (i.e. minimum change in the input results in maximum change in the output,) whenever there is corruption, there will be more than a single bit change in the hash value. This will not affect the probability of collision, but it is an important distinction from the naive assumption. (This feature of good hash functions is called Avalanche effect.)

Considering that cryptographic hash functions are designed to fingerprint documents, a considerable research is done in this area. A particular attack on hashes which exploits the birthday paradox is called birthday attack. So for a random collision with 50% or better probability between any two 256 bit hashes, one will need to have 4 x 10^38 hash values, or 4 followed by 38 zeros. (For the interested, with 23 people there is more than 50% chance that any two will share the same birthday. A year has 365 days, which is 9-bits, and this is why it takes such a small candidates to hit 50%, compared to 256bit hashes.) This is still huge, to be sure, but orders of magnitude more likely than the miscalculated 1 in 10^77 by Bonwick. (For comparison, at about 10^31 hashes, the chances of a random collision between any two is comparable to the chances of an undetected corruption in magnetic hard drives, which have an error recovery rate of 10^-14 to 10^-15 or about a bit of error every 12-120TB of data transferred.)

The verify option is useful when used with a less secure checksum function that is much faster but will produce more collisions on average. As I’ve written in the first part of this series, Fletcher4, which has a much smaller size, is no longer enabled for deduplication purposes. And in any case using a shorter hash (such as Fletcher4) is not recommended. Unless you know what you’re doing, using weak hash functions without verify will sooner rather than later have collisions and so will corrupt your data (hence the need for verify with weak hashing). The problem with verify on weak hash functions is that you’ll have a higher number of bit-for-bit comparisons when the hashes collide. The gains of a faster hashing function vs the higher verify comparisons will probably be counter productive, or at least will diminish the performance advantage that fletcher4 and the like offers.

Personally, I’d stick with sha256+verify for deduplication and use gzip-7 or higher for compression by default. For uncompressible data with high write or modify operations I would disable compression and deduplication (unless one knows that the compressed data could be deduplicated, it most probably won’t). For highly compressible data with high write or modify operations (i.e databases or virtual machine images,) lz4 should prove a winner.

In the next part we will look at practical usage of compression and deduplication with analysis.

Sep 182011
 

Summary (TL;DR)

This an experimental mod of Sqlite with built-in online compression support. Design and implementation are discussed, limitation and benchmarks provided and source code as well as prebuilt DLL are included. Use the TOC to jump to the topic of interest.

Background

Both Sqlite and MySql support compressed (and encrypted) databases. Well, more or less. Sqlite’s support is limited to read-only databases that are compressed offline, while MySql’s support is limited to compressing strings (as far as I can tell.)

While working on WikiDesk, a Wikipedia browser project, I knew the database could easily grow to 100s of gigabytes. The database of choice here is Sqlite because of it’s compactness and mobility. The English Wikipedia dump is already in the range of 100s to 1000s of gigs (depending on the dump type.) WikiDesk not only supports different Wikipedia languages, but also different projects, such as Wikinews, Wikibooks and Wiktionary, among many others in all available languages, all in the same database. Theoretically, one can import all possible Wiki content into a single database.

The opportunity of compressing this highly-redundant wiki-code mixed with Unicode text was pretty much obvious. So it was reasonable to assume others must have had a similar case and added compression support to Sqlite. My search only yielded the aforementioned cases.

A part of me was happy to have found no precedent project. I was more than happy to roll-up my sleeves and get to hacking.

Design Survey

There are many ways to go about designing a compressed database file. My main purpose, however, was to have fully-transparent, online and realtime compression support. So the design must accommodate updates and deletions as well as any other modify operation supported by Sqlite.

An obvious approach is the one used by MySql, namely to compress the fields independently. This is simple and relatively speaking straight forward. However it’d mean that LIKE couldn’t be used on compressed string fields. Collation and sorting and other features would be absent as well. In fact the fields in question couldn’t be TEXT at all. In addition, one had to explicitly compress fields, remember which is compressed and remember to uncompress before using them. Very limited I thought and probably wouldn’t be worth the effort. Another approach is to do this on a low level, such that it’d be transparent to the caller. Such an extension to Sqlite exists but this will not yield much gain on small fields. I suspect NTFS compression would give better results.

NTFS has built-in compression support. It was well worth the effort of testing it. On an English SimpleWiki dump I could compress the database file down to about 57% of its original size (see benchmarks below.) Pretty decent. However I couldn’t control it at all. I couldn’t set the chunk size, compression level or anything save for enabling and disabling it. In addition, the user could disable it and lose all the benefits. Database-level compression is more promising. A similar result can be achieved using FuseCompress or compFUSEd (on Linux), albeit, the user must install such a filesystem first.

A major problem with database files, as far as online compression is concerned, is that the database logical-structure typically stores pointers to file offsets, such that there is a one-to-one mapping between the physical and logical-structures. This is reasonable as the database is really a large and complex datastructure on disk (as opposed to memory.) The btree or rtree nodes are typically page indexes, where all pages have a predefined, database-wide fixed size. Disrupting this structure would render the file corrupted. The purpose of the fixed-size pages is to simplify the allocation and management of space. This scheme is also used by memory and disk-managers alike.

If we compress a page in the database, the page would now contain two regions: data and free-space. To utilize the free-space, we could write a portion of the next page in the free-space, and the remaining in the next page, and so on for all pages. But then we’d have to keep track of each page’s fragments somehow. To avoid that, we can leave the free-space unused, but then we’d get no net saved disk space, as the free-space would still be allocated on disk.

I could store the new indexes and offsets in some allocation table appended to the file. But I’d have to do a lot of data moving, reallocation, (de)fragmentation and whatnot just to keep track of the free ranges and so on. Obviously this approach was pretty complicated and would take much more involved design and coding. Also, Sqlite page-sizes are multiple of disk sector size for atomicity. I had to be thoroughly familiar with the Sqlite design and implementation to embark on such a largish project, if I wanted it finished and working.

The ‘be lazy’ motto seems to work well for programmers who are efficiency-oriented and hate repetitive and error-prone work. What would be the simplest approach that could work? Going back to NTFS one could learn a lesson or two on transparent compression. The secret is that NTFS can simply allocate any free inode on the disk, write the compressed data to it and update the index table. Inodes are linked lists, so it is very easy to insert/remove and modify the chain. Files, on the other hand, are arrays of bytes abstracted from the disk structure. Moving bits around in an array is much more complicated and time consuming than updating nodes in a linked-list.

What is needed is the advantage of a file-system applied on the level of files.

What if we could tell the file-system that these free-space regions of the file are really unused? NTFS supports sparse files in addition to compressed files. This could be used to our advantage. All we’d have to do is mark the free-space in each page as unused and the file-system will make them available to other files on the disk, reducing the net used disk space of the database.

The Design

Sqlite supports pages of 512-65535 bytes long. Since we can’t break a single page, the smallest compression unit must be at least 64 Kbyte long. In addition, the compression-unit of NTFS compression seems to be also 64 Kbytes. This means that a sparse range must be at least as large as a compression-unit to be deallocated from disk and marked as free. This puts a clear limitation on the amount of saving we can achieve using this design; Compression won’t save any disk space unless it reduces the size in multiples of 64 Kbytes. A multiple of 64 Kbytes is used as the compression unit, internally called a chunk. Indeed, a chunk size of 64 Kbytes would be totally useless as there could be no saving at all.

When data is written it’s first written into a memory buffer. This buffer is used to track changes to the chunk, it’s offset in the file and use to compress the data. When the chunk needs flushing the data is first compressed and the compressed data written to the chunk offset. The remainder of the chunk is marked as a sparse region. NTFS deallocates any naturally-aligned compression units that are completely sparse. Partially written units are physically allocated on disk and 0-valued bytes are written to disk.

When reading data, the complete chunk of the requested byte-offset is read, decompressed and from the buffered data the requested bytes copied back to the caller. The sparse bytes are transparently read-in as 0-valued bytes. This is done by NTFS and relieves us from tracking sparse regions.

Initially very fast compression libraries were used to avoid sacrificing too much performance. FastLz, Lz4 and MiniLzo were tested but the results weren’t very promising, compression-wise. As such the current build uses Zlib.

Implementation

The compression mod is written as a VFS Shim. This has the advantage of avoiding any modifications to the Sqlite code base.

Enabling compression must be done before opening any database files. A single function is defined as follows:

int sqlite3_compress(
    int trace,
    int compressionLevel
    );

trace can be a value between 0 and 7. When 0 tracing is disabled, larger values enable tracing of increasingly lower-level operations. Trace logs are written to stderr. -1 for default.

compressionLevel can be a value between 1 and 9, where 1 gives the fastest performance at the expense of compression ratio and 9 gives the best compression at the expense of performance. -1 for default, which is typically level-6.

To enable compression this function is simply called before calling sqlite3_open. Compression level may be changed between runs, however unless a chunk is modified, the data will not be recompressed with the new level.

Only the main database is compressed. The journal or any other temporary files aren’t compressed.

Limitations

Besides the fact that the code is in an experimental state, there are some things unsupported or even unsupportable by this mod. First and foremost only this mod can read compressed databases. The original Sqlite will declare compressed databases corrupted. However, this mod can and should detect uncompressed databases and disables compression silently (but use at your own risk.)

Since NTFS sparse file support is the key to achieving compression, the mod is literally useless on non-NTFS systems.

Sqlite is known to be quite resilient in the face of file corruption. This can no longer be supported with the same level as it is with the official release. In addition, corruptions would destroy much more data than a single page. With the compression library and the new code also comes the increased risk of crashing or being unstable.

Of the untested and probably unsupported features of Sqlite are:

  • Online database backup.
  • Multiprocess read/write.
  • Vacuum.
  • Data recovery.
  • Shell and 3rd-party tools.

Performance wise, there is virtually no caching implemented beyond the current chunk. This is bare-bone caching and there is a lot of room for performance improvements.

Benchmarks

An import of an English SimpleWiki dump was used as benchmark. The main table holds an auto-increment index, timestamp, the page title and the page contents (both Unicode).

256 Kbyte Chunks and Level-6 Compression (sizes in KBytes)
Original Sqlite Compressed
NTFS Normal 204,438 (100%) 73,296 (35.85%)
NTFS Compressed 117,460 (57.45%) 57,431 (28.09%)

1024 Kbyte Chunks and Level-9 Compression (sizes in KBytes)
Original Sqlite Compressed
NTFS Normal 204,438 (100%) 67,712 (33.12%)
NTFS Compressed 117,460 (57.45%) 66,220 (32.39%)

It’s quite obvious that the savings with the modified Sqlite are substantial as compared to NTFS compression on the original file. Interestingly, NTFS compression when applied on a compressed file still yields gains. This is because of inefficiencies of the Zlib (deflate) compression (which is less so for level-6 than 9) and because NTFS can deallocate at the level of clusters, which are 4096 bytes, as opposed to the sparse method’s compression-unit of 64 Kbytes. Since the free-regions are written as zero-bytes and they aren’t deallocated unless a complete 64 Kbyte unit is completely zeroed out, it seems reasonable to assume NTFS compression is crunching these zero-padded regions and deallocating them as it’s unit is only 4096 bytes.

It should also be noted that while statistically we should get better compression with larger chunk sizes and higher compression levels, this isn’t linear. In fact, increasing the chunk size may lead to reduced net gains in file size due to the 64 Kbyte compression-unit of NTFS. That is, if two chunks could each save a single unit (64 Kbytes,) doubling the chunk size (such that both would be compressed together as one chunk) might not be able to save 128 Kbytes, in which case the savings would be reduced from two units to a single, resulting in a 64 Kbyte larger file than we had with the original chunk-size. This heavily depends on both the data and the compression, of course.

Performance

A synthetic test done using generated text from an alphabet consisting of alpha-numerical plus symbol with random lengths of <1MB were done. Zlib seems to perform slowly on this random data (although the number of possible codes is small.) Chunk size of 256 Kbytes and compression-level of 6 was used. 50 random rows are generated and inserted with incremental Ids (two-column table,) the 50 rows are selected using the Ids and the texts compared to the original, new texts are generated with new lengths, this time of length <2MB and the rows updated. Again the 50 rows are selected by Id and compared to the updated-originals. The resultant database file is 50,686 Kbytes.

The original Sqlite code run the test in 13.3 seconds, while using default compression and no tracing (to avoid any overheads) the same test finished in 64.7 seconds (4.86x slower) resulting in a 41,184 KByte file. Both tests ran on the same generated data. The file was on a RAMDisk to minimize disk overhead.

Considering that the data was random and synthetic and insert/update rate was equal to select rates, the results are reasonable. In practice, reads are typically more frequent than writes. With proper caching this should reduce the performance overhead significantly.

Download

The code holds the same copyright claims as Sqlite, namely none. The code is experimental. Use it at your own risk.

Download the code and prebuilt DLL. This sqlite3.dll is version 3.7.7.1 amalgamation created with the default settings/flags from the amalgamation created from original sources by the original configure and make files. The compression code is added and it’s built using VS2010 Sp1 and statically liked to the runtime libraries, as such it has no dependencies.

Building

To build the code, first download a recent Sqlite version. The 3.7.7.1 amalgamation is perfect. The latest Zlib must also be downloaded and built.

Add the Zlib headers to the include path, copy the vfs_compress.c file next to sqlite sources and build. Next, build sqlite3.c amalgamation (or the original sources) and link the binaries of sqlite3, vfs_compress and Zlib to create the executable.

Future Plans

A good percentage of the official Sqlite tests pass successfully. But the corruption and format-validating tests unsurprisingly fail. Increasing the supported cases is a prime goal at this point. Getting the mod to “stable with known-limitation” status would be a major milestone. Improving performance is another goal that isn’t very difficult to attain. Having the ability to enable/disable compression on any database is also beneficial and will add more protection against misuse. It’d also be interesting to attempt supporting compression without NTFS sparse files support. This, while much more complicated, would work on any system and not on NTFS alone.

As a bonus, it’s almost trivial to add encryption on top of the compression subsystem.

Any comments, ideas, feedback and/or constructive criticism are more than welcome.

QR Code Business Card