Feb 152014

Download as ebook

In the previous post we dived into Compression and Deduplication in ZFS. Here we’ll look at Deduplication in practice based on my experience.

To Dedup or Not to Dedup

Compression almost always offers a win-win situation. The overhead of compression is contained (not storage-wide) and except where there is heavy read/modify/write, it will not have unjustified overhead. It’s only when we have heavy writes on compressed datasets that we need to benchmark and decide which compression algorithm we should use, if any.

Deduplication is more demanding and a weightier decision to make. When there is no benefit from deduplication, the overhead will bring the system performance to a grinding halt. It is best to have a good idea of the data to be stored and do some research before enabling dedup by default (i.e. on the root). When the benefit from deduplication is minimal, the overhead will not only be in terms of higher memory usage, lower performance, but the DDT will be using disk space in the dozens, if not hundreds, of GBs. This means that there is a minimum dedup gain below which the overhead will simply negate the benefit, even if our hardware is powerful enough.

Regardless on the data, there are four scenarios that I can think of where deduplication makes sense. These are when:

    1. Duplicates are between multiple users.
    2. Duplicates are between multiple datasets.
    3. Duplicates are spread over millions of small files.
    4. Need copy-on-write functionality.

To see why the above make sense, let’s take a hypothetical case where we have one user and one dataset with thousands of GB-sized files. If we have duplicate files in this case, we can simply find them and hardlink duplicates. It should be clear that this will work perfectly, and it’s practical to frequently rediscover duplicates and hardlink them, except for the above four cases. That is, hardlinks don’t work across users (I’ll leave aside the issue of privacy and consider what happens when one user decides to modify a hardlinked file and inadvertently ends up modifying the other user’s version as well, since they are hardlinked,) they don’t work across datasets (i.e. filesystems) by definition (because inodes are only meaningful in a single filesystem) and hardlinking will be more complicated and a maintenance issue if our duplicate data is mostly a large number of small files. Finally, hardlinks simply means the filesystem is pointing multiple files to the same data. Modifying one will modify the other, which might not be desirable even in a single-user scenario, rather than create a new copy. ZFS dedup feature works using copy-on-write, so modifying a deduplicated block will force ZFS to copy it first before writing the modified data, thereby preserving the other previously-duplicate files to the one being modified.

If after ingesting data in a dedup=on dataset we decide to remove deduplication, we will need to create another dataset with dedup=off, and copy our data over (one can also use send/receive, which are beyond the scope of this write up see below). If you are in a situation where you have dedup enabled (at least on some of the datasets) but aren’t sure whether or not performance is suffering, there are a few tools I can suggest.

Monitoring Tools

zpool iostat [interval [count]] works very much like iostat except it is on the ZFS-pool level. I use zpool iostat 10 to see the IOPS and bandwidth on the pool averaged at 10 second periods.

Unfortunately, zpool iostat doesn’t include cache hits and misses, which are very important if we are interested in DDT thrashing. Fortunately, there is a python script (direct link) in the ZFS on Linux sources that does that.

arcstat.py [interval] dumps hit and miss statistics as well as ARC size and more (the columns are customizable, but the default should be sufficient). For those interested, arcstat uses raw numbers with some fields computed on the fly directly from /proc/spl/kstat/zfs/arcstats, which can be read and parsed as one wishes. Try cat /proc/spl/kstat/zfs/arcstats for example.

Here is a sample from my zpool showing the output of both tools at the same time while reading 95,000 files with an average of 240KB each. The dataset is dedup=verify and has a large percentage of duplicates in it:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----    ¦        time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
tank        8.58T  7.67T    239     40  28.9M   101K    ¦    15:05:27  1.1K   238     21    14    1   224   97     7    2   3.0G  3.0G
tank        8.58T  7.67T    240     41  28.6M   103K    ¦    15:05:37  1.1K   236     21    16    1   220   98     8    3   3.0G  3.0G
tank        8.58T  7.67T    230     36  27.4M  91.7K    ¦    15:05:47  1.1K   229     21    17    2   212   97     8    3   3.0G  3.0G
tank        8.58T  7.67T    239     41  28.7M   102K    ¦    15:05:57  1.1K   234     20    15    1   218  100     7    3   3.0G  3.0G
tank        8.58T  7.67T    205     39  24.5M   101K    ¦    15:06:07  1.0K   215     21    17    2   198   99     5    2   3.0G  3.0G
tank        8.58T  7.67T    266     37  32.3M  97.3K    ¦    15:06:17  1.2K   255     21    15    1   240  100     5    1   3.0G  3.0G
tank        8.58T  7.67T    261     40  31.6M   104K    ¦    15:06:27  1.6K   272     16    13    0   259   99     7    1   3.0G  3.0G
tank        8.58T  7.67T    224     52  26.7M   239K    ¦    15:06:37  1.6K   229     14    16    1   213   96     8    1   3.0G  3.0G
tank        8.58T  7.67T    191     37  22.7M  93.0K    ¦    15:06:47  1.4K   200     14    17    1   183  100     8    2   3.0G  3.0G
tank        8.58T  7.67T    182     40  21.6M   103K    ¦    15:06:57  1.2K   180     15    15    1   165  100     7    2   3.0G  3.0G
tank        8.58T  7.67T    237     41  28.4M  98.9K    ¦    15:07:07  1.5K   241     15    13    1   227  100     7    1   3.0G  3.0G
tank        8.58T  7.67T    187     38  22.1M  96.1K    ¦    15:07:17  1.2K   183     14    16    1   166   99     7    2   3.0G  3.0G
tank        8.58T  7.67T    207     39  24.5M  99.6K    ¦    15:07:27  1.3K   201     15    18    1   182  100     9    2   3.0G  3.0G
tank        8.58T  7.67T    183     47  21.5M   227K    ¦    15:07:37  1.3K   199     15    18    1   181  100     9    2   3.0G  3.0G
tank        8.58T  7.67T    197     37  23.4M  95.4K    ¦    15:07:47  1.2K   183     15    16    1   167   97     7    2   3.0G  3.0G
tank        8.58T  7.67T    212     60  25.3M   370K    ¦    15:07:57  1.4K   210     14    16    1   193  100     8    1   3.0G  3.0G
tank        8.58T  7.67T    231     39  27.5M   104K    ¦    15:08:07  1.5K   223     15    17    1   206  100     8    1   3.0G  3.0G

The columns are as follows:

       pool : The zpool name
      alloc : Allocated raw bytes
       free : Free raw bytes
 (ops) read : Average read I/O per second
(ops) write : Average write I/O per second
  (bw) read : Average read bytes per second
 (bw) write : Average write bytes per second

       time : Time
       read : Total ARC accesses per second
       miss : ARC misses per second
      miss% : ARC miss percentage
       dmis : Demand Data misses per second
        dm% : Demand Data miss percentage
       pmis : Prefetch misses per second
        pm% : Prefetch miss percentage
       mmis : Metadata misses per second
        mm% : Metadata miss percentage
      arcsz : ARC Size
          c : ARC Target Size

Of important note is the IOPS relative to the bandwidth. Specifically, if the bandwidth divided by IOPS is <128KB, it means we are reading small files, which will affect performance compared to reading large files with sequential 128KB blocks. In other words, an average 25MB/s looks poor until we take into account that it is during reading 100s of small files (<128KB each) per second. For large files the average bandwidth is typically 10x higher.

ARC reads is rather high, betraying the fact that we are accessing deduplicated data. With ~15% ARC misses, we can assume that about 15% of our reads are wasted to ARC reads. Prefetching is virtually useless here (and if this were a typical usage, we should disable prefetching altogether, but for larger file reads prefetching pays dividends handsomely).

Another very important thing we learn from the above numbers is the low data and metadata misses. This is extremely important, arguably more important than ARC misses and DDT thrashing, as having high metadata miss-rate will translate in far higher latency in statting files and virtually any file operation, however trivial, will also suffer significantly. For that, it’s best to make sure the metadata cache is large enough for our storage.

During the above run my /etc/modprobe.d/zfs.conf had the following options:
options zfs zfs_arc_max=3221225472
options zfs zfs_arc_meta_limit=1610612736

We can see that the ARC size above is exactly at the max I set and I know from experience that the default ARC Meta limit of ~900MB gives poor performance on my data. For an 8GB RAM system, 3GB ARC size and 1.5GB ARC Meta size gives the most balanced performance across my dataset.

DDT Histogram

If you already have dedup enabled at all, a very handy command that shows deduplication statistics is zdb -DD . The result is extremely useful to understand the distribution of data and DDT overhead. Here are two examples from my pool taken months apart and with very different data loads.

#zdb -DD
DDT-sha256-zap-duplicate: 3965882 entries, size 895 on disk, 144 in core
DDT-sha256-zap-unique: 35231539 entries, size 910 on disk, 120 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    33.6M   4.18T   4.09T   4.10T    33.6M   4.18T   4.09T   4.10T
     2    3.52M    431G    409G    412G    7.98M    976G    924G    931G
     4     250K   26.0G   23.0G   23.6G    1.20M    128G    113G    116G
     8    15.5K   1.07G    943M   1001M     143K   9.69G   8.29G   8.83G
    16    1.16K   46.0M   15.9M   23.2M    23.5K    855M    321M    470M
    32      337   7.45M   2.64M   4.86M    13.7K    285M    102M    195M
    64      127   1.20M    477K   1.33M    10.1K    104M   45.9M    116M
   128       65   1.38M   85.5K    567K    11.5K    258M   14.6M    100M
   256       22    154K   26.5K    184K    7.22K   45.5M   10.2M   61.1M
   512       11    133K   5.50K   87.9K    7.71K    101M   3.85M   61.6M
    1K        3   1.50K   1.50K   24.0K    4.42K   2.21M   2.21M   35.3M
    2K        2      1K      1K   16.0K    4.27K   2.14M   2.14M   34.1M
    4K        3   1.50K   1.50K   24.0K    12.8K   6.38M   6.38M    102M
    8K        2      1K      1K   16.0K    21.6K   10.8M   10.8M    173M
   16K        1    128K     512   7.99K    20.3K   2.54G   10.2M    162M
 Total    37.4M   4.62T   4.51T   4.52T    43.1M   5.27T   5.11T   5.13T

dedup = 1.13, compress = 1.03, copies = 1.00, dedup * compress / copies = 1.16

The first two lines give us footprint information. First we get the number of entries with at least 2 references (refcnt >= 2) and the size per entry on disk and in RAM. In my case 3.96m deduplicated entries are taking 3385MB on disk and 545MB in RAM. The second line represents the same information for unique entries (entries that are not benefiting from deduplication). I had 35.2m unique entries consuming 30575MB (29.9GB) on disk and 4032MB (3.9GB) in RAM. That’s a total of 33960MB (33.2GB) on disk and 4486MB (4.4GB) in RAM in 39.2m entries, for the benefit of saving 12-13% of 5.13TB of data, or a little over 600GB.

This is not that bad, but it’s not great either, considering the footprint of the DDT, the benefit of saving 12-13% comes at a high cost. Notice that compression is independent of dedup gains and so I’m not accounting for it. Although it’s nice to see the overall gains from dedup and compression combined. We’ll get to copies in a bit.

The histogram shows the number of references per block size. For example, the second to last line shows that there are 16-32 thousand references to a single block that has a logical size (LSIZE) of 128KB taking 512 bytes physically (PSIZE,) implying that this is a highly compressed block in addition to being highly redundant on disk. I’m assuming DSIZE is the disk size, meaning the actual bytes used in the array, including parity and overhead, for this block. For this particular entry there are in fact 20.3 thousand references with a total logical size of 2.54GB but compressed they collectively weigh at only 10.2MB, however, they seem to be taking 162MB of actual disk real estate (which is great, considering without dedup and compression they’d consume 2.54GB + parity and overhead).

Powered by this valuable information, I set out to reorganize my data. The following is the result after a partial restructuring. There is certainly more room for optimization, but let’s use this snapshot for comparison.

#zdb -DD
DDT-sha256-zap-duplicate: 1382250 entries, size 2585 on disk, 417 in core
DDT-sha256-zap-unique: 11349202 entries, size 2826 on disk, 375 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    10.8M   1.33T   1.27T   1.27T    10.8M   1.33T   1.27T   1.27T
     2    1.07M    124G    108G    110G    2.50M    292G    253G    258G
     4     206K   22.5G   11.4G   12.4G    1021K    112G   50.9G   56.1G
     8    18.7K    883M    529M    634M     177K   9.37G   5.47G   6.43G
    16    31.0K   3.80G   1.67G   1.81G     587K   71.8G   31.9G   34.5G
    32      976   93.7M   41.6M   46.6M    42.2K   4.09G   1.80G   2.02G
    64      208   13.0M   5.30M   6.45M    17.0K   1.04G    433M    531M
   128       64   2.94M   1.30M   1.73M    11.1K    487M    209M    286M
   256       20    280K     25K    168K    7.17K   81.0M   10.0M   60.7M
   512        8    132K      4K   63.9K    5.55K   80.5M   2.77M   44.4M
    1K        1     512     512   7.99K    1.05K    536K    536K   8.36M
    2K        4      2K      2K   32.0K    11.9K   5.93M   5.93M   94.8M
    4K        3   1.50K   1.50K   24.0K    19.8K   9.89M   9.89M    158M
 Total    12.1M   1.47T   1.39T   1.40T    15.2M   1.81T   1.60T   1.62T

dedup = 1.16, compress = 1.13, copies = 1.01, dedup * compress / copies = 1.29

A quick glance at the numbers and we see major differences. First and foremost the DDT contains only 12.7m entries (down from 39.2m above,) while the dedup ratio is up at 15-16% (a net gain of ~3%). Compression is way up at 13% from 3% with a slight overhead of extra copies at 1%. The extra copies showed up when I manually “dedpulicated” identical files by hard-linking them. Normally copies “are in addition to any redundancy provided by the pool, for example, mirroring or RAID-Z. The copies are stored on different disks, if possible. The space used by multiple copies is charged to the associated file and dataset, changing the used property and counting against quotas and reservations.” according to the ZFS man page, but I’m not entirely clear as why they showed when I hard linked large files and not, say, when I already had highly-redundant files which could benefit from extra redundancy in case of an unrecoverable corruption.

What really counts in the above numbers is the relative effectiveness of deduplication. That is, the higher the dedup percentage, the lower the overhead of unique blocks becomes. It’s true that I reduced the number of duplicate blocks, but that’s mostly because I either deleted duplicate entries or I hardlinked them. So they weren’t really beneficial to me. Meanwhile, I reduced the number of unique entries substantially, increasing the utility of deduplication. This means the overhead, which is now far lower than it used to be, is being utilized better than before. This gives me an overall net gain that could be quantified by the dedup * compress / copies formula, which went from 16% to 29%, which is almost double.

I still have more work to do in optimizing my data and deduplicated datasets. Ideally, we should only have dedup enabled on datasets that either gain at least 20% from deduplication (although some would put that number far higher) or the unique blocks are potential duplicate entries pending future ingestion. Unique data that is deemed to remain unique has no place in deduplicated datasets and should be moved into a separate dataset with dedup=off. Similarly, duplicate that that is best manually deduplicated by hardlinks, or duplicates deleted, should be done so to reduce undue overhead and waste.


Once we decide that a dataset is not benefiting from deduplication (typically by finding duplicates across the full zpool or by doing other statistical analyses,) we can set dedup=off on the dataset. However this will not remove blocks from the DDT until and unless they are re-written. A fast and easy method is to use send and receive commands.

First, we need to either rename the dataset or create a new one with a different name (and rename after we destroy the first).

#Rename old dataset out of the way.
zfs rename tank/data tank/data_old

#Take a recursive snapshot which is necessary for zfs send.
zfs snapshot -r tank/data_old@head

#Now let's create the new dataset without dedup.
zfs create -o dedup=off -o compression=gzip-9 tank/data

#And let's copy our data in its new home, which will not be included in the DDT.
zfs send -R tank/data_old@head | zfs recv tank/data

#Remove the snapshot from the new dataset.
zfs destroy tank/data@head

#Validate by comparing the data to be identical between data and data_old.

#After validating everything, destroy the source.
#zfs destroy -r tank/data_old

The end result of the above is to remove a dataset form the DDT.

Note that a -F on recv will force it, which will not fail if data has some data! This is useful if we need to start all over after moving some data into the new dataset. There is a -d and -e options, which are useful for recreating the tree structure of the source. These are typically needed when datasets move between pools. To see the target tree with a dry run, add -nv to recv. Note that the target dataset will be locked during the receive and will show empty using ls. The output of send can be redirected to a file or piped into gzip for backup.

Hope you found this helpful. Feel free to share your thoughts and experience in the comments.

Download as ebook


  2 Responses to “18TB Home NAS/HTPC with ZFS on Linux (Part 5)”

  1. I did deduplication on 4Gb/16TB Raidz2 system. What a terrible mistake. System freezes every time, i cannot download my 3Tb data set… I upgraded to 8Gb, but still no luck

    Like or Dislike: Thumb up 0 Thumb down 0 (0)

    • I have overwhelmed my system as well and know your feeling. However 4GB is extremely low. You have hardly enough room for the kernel and other applications, let alone file caching and ZFS overhead.

      Here are a few suggestions (use with care and forethought please!).

      First, you need to boot with a boot disk. Download a minimal distro (Ubuntu minimal at 30MB would do) and make a bootable USB (or CD) from it. Boot your box from the bootable drive. Once in, break into a shell.

      In /etc/modprobe.d/zfs.conf set the following and tune them to your system (cp zfs.conf zfs.conf.bak first):

      options zfs zfs_arc_max=2147483648
      options zfs zfs_arc_meta_limit=536870912
      options zfs zfs_dedup_prefetch=0
      options zfs zfs_prefetch_disable=1
      options zfs zfs_free_max_blocks=50000
      options zfs zfs_read_history=1000
      options spl spl_kmem_cache_slab_limit=16384

      You need to reboot. These settings can be set without reboot if you write to their respective entries under /proc.
      You should also have the latest ZFS modules as they have improved the performance and stability quite a bit. In fact, a couple of the settings above do not exist in 0.6.2 version.

      I have written about removing dedup from a dataset. You can look into that. Essentially create another dataset with dedup=off and send/recv the data to it. Once that is done, deleting the old dataset should not be done in one step if your memory is low. Instead, use ‘find -size -XX -delete’ to delete smaller files first, then larger ones until you are done. You should choose a size that you know beforehand will not freeze your system.

      Hope this helps.

      Like or Dislike: Thumb up 1 Thumb down 0 (+1)

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


(required, not public)


QR Code Business Card