Monday, January 14, 2013

To DEDUP or not to DEDUP, that is the question



To Dedup or Not to Dedup??

In additional to compression, ZFS offers deduplication functionality. It (potentially) allows you to use less disk space to store the same amount of data. Nothing comes for free, as the say, and in this case deduplication does not either.
You need to provide ZFS with massive amounts of RAM to store its deduplication tables in memory or you can kiss any kind of write performance goodbye (and in worst cases make your fileserver unresponsive for significant periods of time).
Even if you throw plenty of RAM at deduplication, it is still best to treat it as a “backup” device like a tape drive. That is to say, you do backups to it and:

DO NOT USE IT FOR VM STORAGE. LZJB is probably fine, but deduplication will kill VMs responsiveness.
DO NOT USE IT FOR A PRODUCTION DATABASE. Testing databases where response time/processing speed are not a factor could be a possible use.

There, I warned you.

Now, here are some (half-assed) numbers you can use as a rough guide to
Pre-allocating your RAM for a deduplicating pool.

I’m testing with ZFS on Ubuntu. For ZFS on Solaris or Openindiana, the procedure
to change the zfs_meta_limit is slightly different.

Testing Setup:

root@ubuntuzfs03:~# uname -a
Linux ubuntuzfs03 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu-zfs
spl-0.6.0-rc13
zfs-0.6.0-rc13

VM with 2 virtual CPUs and 8GB of RAM


Set our ARC max size to ~3Gb
Set our ARC meta limit to ~2Gb

Edit /etc/modprobe.d/zfs.conf, the number is # of bytes:

root@ubuntuzfs03:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=3000000000 zfs_arc_meta_limit=2900000000

YOU MUST REBOOT for those changes to take effect.

I use a small script to check the status of my config and running values:
zfs_show.sh
------------------
#!/bin/bash

echo "config"
echo "---------------------"
cat /etc/modprobe.d/zfs.conf
echo " "
echo " "
echo "runtime values"
echo "---------------------"
grep -E arc\|c_max /proc/spl/kstat/zfs/arcstats



now check that your settings (AFTER REBOOTING) have actually taken effect:

root@ubuntuzfs03:~# ~/zfs_show.sh
config
---------------------
options zfs zfs_arc_max=3000000000 zfs_arc_meta_limit=2900000000



runtime values
---------------------
c_max                           4    3000000000
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    132929416
arc_meta_limit                  4    2900000000
arc_meta_max                    4    1457235528
root@ubuntuzfs03:~#

The config options in /etc/modprobe.d/zfs.conf should match the
runtime values, or something went wrong.

Now, for our test, we explicitly set the recordsize to 4K:

root@ubuntuzfs03:~# zfs set recsize=4K zpool1/cifs/dedup_storage
root@ubuntuzfs03:~# zfs set dedup=on zpool1/cifs/dedup_storage


I’m copying the files over 100mbit LAN connection, so the 11.1 MB/s is reasonable:



Total size of files (same 2 large files copied multiple times)




Math
----
47698814976 bytes of data (files and folders included)
47698814976 / 4096 = 11,645,219 data blocks

1279193040 bytes arc_meta_used
1279193040 meta bytes / 11645219 data blocks           =  109 meta bytes/per 4k data block
1279193040 meta bytes / 47698814976 bytes of data      =  0.0268181304010097 bytes of meta /per bytes of data  (2.7%)

So by that math, 2TB of disk space
(depending on how manufacturer calculates 1TB)
2,199,023,255,552 bytes of data would require AT LEAST 59,373,627,900 bytes or about 55.3 Gigabytes of RAM allocated to arc_meta_limit.
(owch)

Deleting files (for some reason) sends the arc_meta_used up even higher.
I highlight all the files and delete them at once and arc_meta_used goes up to
1457235528 bytes arc_meta_used

So to recalculate,
1457235528 meta bytes / 11645219 data blocks           =  126 meta bytes/per 4k data block
1457235528 meta bytes / 47698814976 bytes of data      =  0.0305507700502249 bytes of meta /per bytes of data  (3.1%)

Just to be safe, if you are going to use 4k blocks, I would go with 5%
2,199,023,255,552 bytes of data would require AT LEAST 109,951,162,777.6 bytes or about 102.4 Gigabytes of RAM allocated to arc_meta_limit.
(super owch)

Change the record size to 16K:
root@ubuntuzfs03:~# zfs set recsize=16K zpool1/cifs/dedup_storage

After copying the files, dedup ratio and AVAIL space report the same values, but that is to be expected if you are just
copying the same files over and over again.

same total bytes in files, but arc_meta_used comes out different:
47698814976 bytes of data (files and folders included)
47698814976 / 4096 = 11,645,219 data blocks

bytes arc_meta_used
522642072 meta bytes / 11645219 data blocks            =  45 meta bytes/per 16k data block
522642072 meta bytes / 47698814976 bytes of data       =  0.0109571290662666 bytes of meta /per bytes of data  (1.1%)

Interestingly, the arc_meta_used value did not spike as much during the delete of the 16K blocks, it reported 524992920.

The math I’m using is a little flawed, because it does not take into account how much of the arc_meta_used is not related
to dedup. Probably I could turn off dedup, copy the same files again, and use that as a baseline, but I feel fairly comfortable
saying:

TLDR summary:
  • you need 5%   of total storage (or more) for 4K block size
  • you need 2.5% of total storage (or more) for 16K block sizes
  • deletes will always be slow, even if you have enough RAM. Imagine slow, and then multiply that times 10. Luckily it happens in the background, but if you are on a quad core or less machine it will slow everything to a crawl. If you don't have enough RAM it is slow times 100. NOTE: I have experienced complete system locks deleting multi GB files on Ubuntu. I believe this was supposed to be fixed on Solaris, but apparently hasn't made it into the Ubuntu ZFS yet, so be warned. 
  • 97% of the time you will be better off using ONLY compression


Your mileage will definitely vary, so you will have to keep an eye on the reserved vs. runtime values and adjust accordingly.

ZFS supports larger block sizes, up to 128K, but the larger your block size the lower dedup ratio you are
going to see. (Unless you are ONLY storing multiple copies of the same large files, in which case just
go with 128K the max.)

<edit 2013-08-07> as David pointed out, you can see memory usage reported by ZFS by using 'zpool status -D'


1 comment:

  1. Thanks for this.
    I have the same terrible performance when using dedup.

    You do know that 'zpool status -D ' gives you DDT information broken down to in memory and spilled over to disk.

    Oh and last bullet point... should be off not of.

    Thanks

    ReplyDelete