In the previous post we dived into Compression and Deduplication in ZFS. Here we’ll look at Deduplication in practice based on my experience.
Part 1: Planning and Hardware
Part 2: ZFS and Software
Part 3: Performance Benchmarks
Part 4: Compression and Deduplication
Part 5: Deduplication in Practice
To Dedup or Not to Dedup
Compression almost always offers a win-win situation. The overhead of compression is contained (not storage-wide) and except where there is heavy read/modify/write, it will not have unjustified overhead. It’s only when we have heavy writes on compressed datasets that we need to benchmark and decide which compression algorithm we should use, if any.
Deduplication is more demanding and a weightier decision to make. When there is no benefit from deduplication, the overhead will bring the system performance to a grinding halt. It is best to have a good idea of the data to be stored and do some research before enabling dedup by default (i.e. on the root). When the benefit from deduplication is minimal, the overhead will not only be in terms of higher memory usage, lower performance, but the DDT will be using disk space in the dozens, if not hundreds, of GBs. This means that there is a minimum dedup gain below which the overhead will simply negate the benefit, even if our hardware is powerful enough.
Regardless on the data, there are four scenarios that I can think of where deduplication makes sense. These are when:
- Duplicates are between multiple users.
- Duplicates are between multiple datasets.
- Duplicates are spread over millions of small files.
- Need copy-on-write functionality.
To see why the above make sense, let’s take a hypothetical case where we have one user and one dataset with thousands of GB-sized files. If we have duplicate files in this case, we can simply find them and hardlink duplicates. It should be clear that this will work perfectly, and it’s practical to frequently rediscover duplicates and hardlink them, except for the above four cases. That is, hardlinks don’t work across users (I’ll leave aside the issue of privacy and consider what happens when one user decides to modify a hardlinked file and inadvertently ends up modifying the other user’s version as well, since they are hardlinked,) they don’t work across datasets (i.e. filesystems) by definition (because inodes are only meaningful in a single filesystem) and hardlinking will be more complicated and a maintenance issue if our duplicate data is mostly a large number of small files. Finally, hardlinks simply means the filesystem is pointing multiple files to the same data. Modifying one will modify the other, which might not be desirable even in a single-user scenario, rather than create a new copy. ZFS dedup feature works using copy-on-write, so modifying a deduplicated block will force ZFS to copy it first before writing the modified data, thereby preserving the other previously-duplicate files to the one being modified.
If after ingesting data in a dedup=on dataset we decide to remove deduplication, we will need to create another dataset with dedup=off, and copy our data over (one can also use send/receive,
which are beyond the scope of this write up see below). If you are in a situation where you have dedup enabled (at least on some of the datasets) but aren’t sure whether or not performance is suffering, there are a few tools I can suggest.
zpool iostat [interval [count]] works very much like
iostat except it is on the ZFS-pool level. I use
zpool iostat 10 to see the IOPS and bandwidth on the pool averaged at 10 second periods.
zpool iostat doesn’t include cache hits and misses, which are very important if we are interested in DDT thrashing. Fortunately, there is a python script (direct link) in the ZFS on Linux sources that does that.
arcstat.py [interval] dumps hit and miss statistics as well as ARC size and more (the columns are customizable, but the default should be sufficient). For those interested,
arcstat uses raw numbers with some fields computed on the fly directly from
/proc/spl/kstat/zfs/arcstats, which can be read and parsed as one wishes. Try
cat /proc/spl/kstat/zfs/arcstats for example.
Here is a sample from my zpool showing the output of both tools at the same time while reading 95,000 files with an average of 240KB each. The dataset is
dedup=verify and has a large percentage of duplicates in it:
capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- ¦ time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c tank 8.58T 7.67T 239 40 28.9M 101K ¦ 15:05:27 1.1K 238 21 14 1 224 97 7 2 3.0G 3.0G tank 8.58T 7.67T 240 41 28.6M 103K ¦ 15:05:37 1.1K 236 21 16 1 220 98 8 3 3.0G 3.0G tank 8.58T 7.67T 230 36 27.4M 91.7K ¦ 15:05:47 1.1K 229 21 17 2 212 97 8 3 3.0G 3.0G tank 8.58T 7.67T 239 41 28.7M 102K ¦ 15:05:57 1.1K 234 20 15 1 218 100 7 3 3.0G 3.0G tank 8.58T 7.67T 205 39 24.5M 101K ¦ 15:06:07 1.0K 215 21 17 2 198 99 5 2 3.0G 3.0G tank 8.58T 7.67T 266 37 32.3M 97.3K ¦ 15:06:17 1.2K 255 21 15 1 240 100 5 1 3.0G 3.0G tank 8.58T 7.67T 261 40 31.6M 104K ¦ 15:06:27 1.6K 272 16 13 0 259 99 7 1 3.0G 3.0G tank 8.58T 7.67T 224 52 26.7M 239K ¦ 15:06:37 1.6K 229 14 16 1 213 96 8 1 3.0G 3.0G tank 8.58T 7.67T 191 37 22.7M 93.0K ¦ 15:06:47 1.4K 200 14 17 1 183 100 8 2 3.0G 3.0G tank 8.58T 7.67T 182 40 21.6M 103K ¦ 15:06:57 1.2K 180 15 15 1 165 100 7 2 3.0G 3.0G tank 8.58T 7.67T 237 41 28.4M 98.9K ¦ 15:07:07 1.5K 241 15 13 1 227 100 7 1 3.0G 3.0G tank 8.58T 7.67T 187 38 22.1M 96.1K ¦ 15:07:17 1.2K 183 14 16 1 166 99 7 2 3.0G 3.0G tank 8.58T 7.67T 207 39 24.5M 99.6K ¦ 15:07:27 1.3K 201 15 18 1 182 100 9 2 3.0G 3.0G tank 8.58T 7.67T 183 47 21.5M 227K ¦ 15:07:37 1.3K 199 15 18 1 181 100 9 2 3.0G 3.0G tank 8.58T 7.67T 197 37 23.4M 95.4K ¦ 15:07:47 1.2K 183 15 16 1 167 97 7 2 3.0G 3.0G tank 8.58T 7.67T 212 60 25.3M 370K ¦ 15:07:57 1.4K 210 14 16 1 193 100 8 1 3.0G 3.0G tank 8.58T 7.67T 231 39 27.5M 104K ¦ 15:08:07 1.5K 223 15 17 1 206 100 8 1 3.0G 3.0G
The columns are as follows:
pool : The zpool name alloc : Allocated raw bytes free : Free raw bytes (ops) read : Average read I/O per second (ops) write : Average write I/O per second (bw) read : Average read bytes per second (bw) write : Average write bytes per second time : Time read : Total ARC accesses per second miss : ARC misses per second miss% : ARC miss percentage dmis : Demand Data misses per second dm% : Demand Data miss percentage pmis : Prefetch misses per second pm% : Prefetch miss percentage mmis : Metadata misses per second mm% : Metadata miss percentage arcsz : ARC Size c : ARC Target Size
Of important note is the IOPS relative to the bandwidth. Specifically, if the bandwidth divided by IOPS is <128KB, it means we are reading small files, which will affect performance compared to reading large files with sequential 128KB blocks. In other words, an average 25MB/s looks poor until we take into account that it is during reading 100s of small files (<128KB each) per second. For large files the average bandwidth is typically 10x higher.
ARC reads is rather high, betraying the fact that we are accessing deduplicated data. With ~15% ARC misses, we can assume that about 15% of our reads are wasted to ARC reads. Prefetching is virtually useless here (and if this were a typical usage, we should disable prefetching altogether, but for larger file reads prefetching pays dividends handsomely).
Another very important thing we learn from the above numbers is the low data and metadata misses. This is extremely important, arguably more important than ARC misses and DDT thrashing, as having high metadata miss-rate will translate in far higher latency in statting files and virtually any file operation, however trivial, will also suffer significantly. For that, it’s best to make sure the metadata cache is large enough for our storage.
During the above run my
/etc/modprobe.d/zfs.conf had the following options:
options zfs zfs_arc_max=3221225472
options zfs zfs_arc_meta_limit=1610612736
We can see that the ARC size above is exactly at the max I set and I know from experience that the default ARC Meta limit of ~900MB gives poor performance on my data. For an 8GB RAM system, 3GB ARC size and 1.5GB ARC Meta size gives the most balanced performance across my dataset.
If you already have dedup enabled at all, a very handy command that shows deduplication statistics is
zdb -DD . The result is extremely useful to understand the distribution of data and DDT overhead. Here are two examples from my pool taken months apart and with very different data loads.
#zdb -DD DDT-sha256-zap-duplicate: 3965882 entries, size 895 on disk, 144 in core DDT-sha256-zap-unique: 35231539 entries, size 910 on disk, 120 in core DDT histogram (aggregated over all DDTs): bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 33.6M 4.18T 4.09T 4.10T 33.6M 4.18T 4.09T 4.10T 2 3.52M 431G 409G 412G 7.98M 976G 924G 931G 4 250K 26.0G 23.0G 23.6G 1.20M 128G 113G 116G 8 15.5K 1.07G 943M 1001M 143K 9.69G 8.29G 8.83G 16 1.16K 46.0M 15.9M 23.2M 23.5K 855M 321M 470M 32 337 7.45M 2.64M 4.86M 13.7K 285M 102M 195M 64 127 1.20M 477K 1.33M 10.1K 104M 45.9M 116M 128 65 1.38M 85.5K 567K 11.5K 258M 14.6M 100M 256 22 154K 26.5K 184K 7.22K 45.5M 10.2M 61.1M 512 11 133K 5.50K 87.9K 7.71K 101M 3.85M 61.6M 1K 3 1.50K 1.50K 24.0K 4.42K 2.21M 2.21M 35.3M 2K 2 1K 1K 16.0K 4.27K 2.14M 2.14M 34.1M 4K 3 1.50K 1.50K 24.0K 12.8K 6.38M 6.38M 102M 8K 2 1K 1K 16.0K 21.6K 10.8M 10.8M 173M 16K 1 128K 512 7.99K 20.3K 2.54G 10.2M 162M Total 37.4M 4.62T 4.51T 4.52T 43.1M 5.27T 5.11T 5.13T dedup = 1.13, compress = 1.03, copies = 1.00, dedup * compress / copies = 1.16
The first two lines give us footprint information. First we get the number of entries with at least 2 references (refcnt >= 2) and the size per entry on disk and in RAM. In my case 3.96m deduplicated entries are taking 3385MB on disk and 545MB in RAM. The second line represents the same information for unique entries (entries that are not benefiting from deduplication). I had 35.2m unique entries consuming 30575MB (29.9GB) on disk and 4032MB (3.9GB) in RAM. That’s a total of 33960MB (33.2GB) on disk and 4486MB (4.4GB) in RAM in 39.2m entries, for the benefit of saving 12-13% of 5.13TB of data, or a little over 600GB.
This is not that bad, but it’s not great either, considering the footprint of the DDT, the benefit of saving 12-13% comes at a high cost. Notice that
compression is independent of dedup gains and so I’m not accounting for it. Although it’s nice to see the overall gains from
compression combined. We’ll get to
copies in a bit.
The histogram shows the number of references per block size. For example, the second to last line shows that there are 16-32 thousand references to a single block that has a logical size (LSIZE) of 128KB taking 512 bytes physically (PSIZE,) implying that this is a highly compressed block in addition to being highly redundant on disk. I’m assuming DSIZE is the disk size, meaning the actual bytes used in the array, including parity and overhead, for this block. For this particular entry there are in fact 20.3 thousand references with a total logical size of 2.54GB but compressed they collectively weigh at only 10.2MB, however, they seem to be taking 162MB of actual disk real estate (which is great, considering without dedup and compression they’d consume 2.54GB + parity and overhead).
Powered by this valuable information, I set out to reorganize my data. The following is the result after a partial restructuring. There is certainly more room for optimization, but let’s use this snapshot for comparison.
#zdb -DD DDT-sha256-zap-duplicate: 1382250 entries, size 2585 on disk, 417 in core DDT-sha256-zap-unique: 11349202 entries, size 2826 on disk, 375 in core DDT histogram (aggregated over all DDTs): bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 10.8M 1.33T 1.27T 1.27T 10.8M 1.33T 1.27T 1.27T 2 1.07M 124G 108G 110G 2.50M 292G 253G 258G 4 206K 22.5G 11.4G 12.4G 1021K 112G 50.9G 56.1G 8 18.7K 883M 529M 634M 177K 9.37G 5.47G 6.43G 16 31.0K 3.80G 1.67G 1.81G 587K 71.8G 31.9G 34.5G 32 976 93.7M 41.6M 46.6M 42.2K 4.09G 1.80G 2.02G 64 208 13.0M 5.30M 6.45M 17.0K 1.04G 433M 531M 128 64 2.94M 1.30M 1.73M 11.1K 487M 209M 286M 256 20 280K 25K 168K 7.17K 81.0M 10.0M 60.7M 512 8 132K 4K 63.9K 5.55K 80.5M 2.77M 44.4M 1K 1 512 512 7.99K 1.05K 536K 536K 8.36M 2K 4 2K 2K 32.0K 11.9K 5.93M 5.93M 94.8M 4K 3 1.50K 1.50K 24.0K 19.8K 9.89M 9.89M 158M Total 12.1M 1.47T 1.39T 1.40T 15.2M 1.81T 1.60T 1.62T dedup = 1.16, compress = 1.13, copies = 1.01, dedup * compress / copies = 1.29
A quick glance at the numbers and we see major differences. First and foremost the DDT contains only 12.7m entries (down from 39.2m above,) while the dedup ratio is up at 15-16% (a net gain of ~3%). Compression is way up at 13% from 3% with a slight overhead of extra copies at 1%. The extra copies showed up when I manually “dedpulicated” identical files by hard-linking them. Normally copies “are in addition to any redundancy provided by the pool, for example, mirroring or RAID-Z. The copies are stored on different disks, if possible. The space used by multiple copies is charged to the associated file and dataset, changing the used property and counting against quotas and reservations.” according to the ZFS man page, but I’m not entirely clear as why they showed when I hard linked large files and not, say, when I already had highly-redundant files which could benefit from extra redundancy in case of an unrecoverable corruption.
What really counts in the above numbers is the relative effectiveness of deduplication. That is, the higher the dedup percentage, the lower the overhead of unique blocks becomes. It’s true that I reduced the number of duplicate blocks, but that’s mostly because I either deleted duplicate entries or I hardlinked them. So they weren’t really beneficial to me. Meanwhile, I reduced the number of unique entries substantially, increasing the utility of deduplication. This means the overhead, which is now far lower than it used to be, is being utilized better than before. This gives me an overall net gain that could be quantified by the
dedup * compress / copies formula, which went from 16% to 29%, which is almost double.
I still have more work to do in optimizing my data and deduplicated datasets. Ideally, we should only have dedup enabled on datasets that either gain at least 20% from deduplication (although some would put that number far higher) or the unique blocks are potential duplicate entries pending future ingestion. Unique data that is deemed to remain unique has no place in deduplicated datasets and should be moved into a separate dataset with
dedup=off. Similarly, duplicate that that is best manually deduplicated by hardlinks, or duplicates deleted, should be done so to reduce undue overhead and waste.
Once we decide that a dataset is not benefiting from deduplication (typically by finding duplicates across the full zpool or by doing other statistical analyses,) we can set
dedup=off on the dataset. However this will not remove blocks from the DDT until and unless they are re-written. A fast and easy method is to use send and receive commands.
First, we need to either rename the dataset or create a new one with a different name (and rename after we destroy the first).
#Rename old dataset out of the way. zfs rename tank/data tank/data_old #Take a recursive snapshot which is necessary for zfs send. zfs snapshot -r tank/data_old@head #Now let's create the new dataset without dedup. zfs create -o dedup=off -o compression=gzip-9 tank/data #And let's copy our data in its new home, which will not be included in the DDT. zfs send -R tank/data_old@head | zfs recv tank/data #Remove the snapshot from the new dataset. zfs destroy tank/data@head #Validate by comparing the data to be identical between data and data_old. #After validating everything, destroy the source. #zfs destroy -r tank/data_old
The end result of the above is to remove a dataset form the DDT.
Note that a -F on recv will force it, which will not fail if data has some data! This is useful if we need to start all over after moving some data into the new dataset. There is a -d and -e options, which are useful for recreating the tree structure of the source. These are typically needed when datasets move between pools. To see the target tree with a dry run, add -nv to recv. Note that the target dataset will be locked during the receive and will show empty using ls. The output of send can be redirected to a file or piped into gzip for backup.
Hope you found this helpful. Feel free to share your thoughts and experience in the comments.