Hey just a quick note.. be careful what you ask for.
(this applies to the zfsonlinux kernel mode zfs, not the fuse variety)
From a couple different servers, it seems you need to leave (at a rough guess) 4GB of memory free for ubuntu or your zpool scrub can hang your system if you manually set your zfs_arc_max parameter too high.
Safe config (has been working for me anyway) for a 10GB VM:
root@ubuntuzfs03:~# ~/zfs_show.sh
config
---------------------
options zfs zfs_arc_max=6000000000 zfs_arc_meta_limit=4900000000 zfs_arc_min=5900000000
runtime values
---------------------
c_min 4 5900000000
c_max 4 6000000000
size 4 43729016
hdr_size 4 1011296
data_size 4 41999872
other_size 4 717848
anon_size 4 16384
mru_size 4 14393344
mru_ghost_size 4 0
mfu_size 4 27590144
mfu_ghost_size 4 16384
l2_size 4 0
l2_hdr_size 4 0
duplicate_buffers_size 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 36782200
arc_meta_limit 4 4900000000
arc_meta_max 4 36782200
root@ubuntuzfs03:~#
I was running with 8GB of ARC and my zpool scrub was crashing. Anyway, just wanted to share, as its not obvious why the zpool scrub was locking up the system, but it seems to be something to do with the kernel not being able to allocate memory.
Tuesday, July 30, 2013
Wednesday, July 3, 2013
Breaking up with your tape drive
Dear tape... it’s not you it’s me. I want to see other storage
If you’re like me, backing up to tape for small to
medium sized businesses (SMB’s) just doesn’t make sense anymore. The high cost
of a tape drive, even higher for a tape library, and the high cost per MB for
each tape make backups an expensive (but necessary) job with $0 ROI.
If you had a
satellite office with a reasonably fast VPN connection between them, you could
easily consider replicating your data (one way). This would have the advantage
of having offsite backups and disaster recovery, but it is assuming that you
have the money for 2 of every piece of hardware it takes to run your production
systems. And two datacenters operating 24/7 is an additional expense.
You can pay for
cloud storage and replicate your VM’s offsite to Amazon or one of those file
hosting services, but at the price you pay per MB that’s more expensive than
tape, and doesn’t work well for large volumes of data anyway (SQL Server
backups, mail backups, fileserver backups, etc).
So, if you work in
a SMB with one office and a limited budget, but you still want to be rid of the
hassle of tapes, consider a removable disk storage alternative paired with a
non removable drive. More specifically: any machine (desktop/server) with a
PCI-e 1X slot and empty SATA drive bay will work, but gigabit NIC as close to
the production server(s) you are backing up is a definite requirement. If you
are backing up large quantities of data, you might go for option 2.
Backup Server O/S options:
Option 1: Ubuntu 12.04 with 1gb NIC
·
free
·
rsync
·
ZFS support for mirroring (www.zfsonlinux.org), send/receive
replication, snapshots
·
supported by Veeam 6.5 B&R as a backup
repository
Option 2: Hyper-V server 2012
·
free
·
NIC teaming in either switch independent or LACP
modes with different brand NICs or even a mixture of plug-in NIC cards and MB
NIC ports
·
can run Ubuntu 12.04 as a VM (option 1) to
handle backups, plus other VM’s to get more use out of server grade hardware
·
also use as for VM replicas/disaster recovery
Option 3: Windows Server 2012
·
not free. Standard edition (only 1 VM included)
will run you $900 or so.
·
NIC teaming in either switch independent or LACP
modes with different brand NICs or even a mixture of plug-in NIC cards and MB
NIC ports
·
Can run in parallel with existing tape backup
jobs or as a supplement to tape backup jobs (if for some reason you are not
able to replace all of them)
If you are at a SMB you probably would chose option 1 or option 2 as
they are the most cost effective.
The strategy here is to run a backup job (say through Veeam B&R, or
Backup Exec, or whatever backup software you are using) to the internal hard
drive on the backup server. Then (at a later time) rsync or robocopy any
changed blocks/files to the USB external storage. The external storage can be
removed for storage in a safe, offsite/whatever. You can even swap out external
storage like you do tapes. For our experiment, we went with a 4TB USB 3.0 drive
paired with a PCI-e 1X controller card. Both are recognized with no problems in
Ubuntu and seem to deliver reasonable speeds. We do a “monthend” backup where
we take the 4TB drive out of the safe, do a reverse incremental backup with
Veeam, and put the tape back in the safe. The total cost of this setup
(considering we re-used a desktop PC as the Ubuntu server option 1) at the time
of writing this blog:
$159.99 for the 4TB seagate drive (as listed on newegg)
$26.99 for the StarTech 2 Port PCI Express SuperSpeed USB 3.0 Card Adapter Model
PEXUSB3S2
$186.98
Even if you wanted to buy multiple 4TB drives and rotate them out on a
weekly or daily basis to a safe or offsite location its still cheaper than
buying a server, windows server license, backup exec (or other) license, and a
tape drive or tape library. I will update this blog at a later date if we run
into any issues with this setup.
Thursday, March 14, 2013
Using VM Replication to create a test environment
Often
in IT you find yourself needing to test something. The requirements
for that can range from the
simple example of needing to install or upgrade software on one
server to the seriously complex side of full regression tests using a
software test suite that encompasses client machines, multiple
servers and databases.
Before
the advent of virtualization, this usually meant the system
administrator had to support multiple servers for the same function.
In other words, you might have 3 or 4 servers for a single database
because of having to support development, qa, and user acceptance
testing requirements.
Now, with hyper-v (or ESX), Veam Backup and Replication, and ZFS those tasks have been made significantly easier. Additionally, you can accomplish all of those things with less hardware and operating expenses.
We
use Veam as our replication and backup software for our production
Hyper-V virtual machines (VM's). The nice part of that setup (aside
from the fact that its a solid and reliable product) is the way Veeam
licenses their software. You pay per socket on the server you are
backing up from.
You do not
pay for replication or backup targets. So, in other words, if you
license your production server, you can back it up to as many
destinations as you care to. That works out especially well for
making a testing/qa/acceptance testing copy of every single one of
your VM's to a (usually less expensive) test hypervisor. The only
significant requirement for the test hypervisor is adequate memory.
You don't even necessarily need a RAID array, a number of VM's can
(slowly) run off the same single SATA drive.
Creating
the replication job in Veeam is pretty straightforward. You select
the VM's you want to replicate, the machine you want to replicate
them to, and the default suffix to add (it puts _replica by default).
I would recommend changing that to _development or _qatesting or
something more indicative of what you're going to use it for.
Not
a requirement, but it would be very beneficial (as you will see in a
minute) if you could also make the target replication directory an
iSCSI target on a ZFS datastore. The reason for this, is that
managing snapshots on one test VM can easily be managed in Hyper-V
manager. But
trying to synchronize the snapshots and performing multiple rollbacks
and restarts on 25 VM's would be a pain to say the least. If you had
the option of snapshoting and rolling back the entire iSCSI target
(easily done with a ZFS SAN backend) you can do multiple regression
tests in a quick and painless (at least less painful) way.
The
only “gotcha” or issue to work around here, is that you have to
use a private
virtual
switch in the test hypervisor. This will keep you from having to
reassign IP addresses, computer names, leave the domain, re-join a
test domain, etc. If you live in an AD (active directory)
environment, you really need a VM copy of your domain controller that
is part of the replicated VM's. Not having to change a single setting
on any of the servers or client machines is really really nice.
Because you are using a private virtual switch, you have to connect
to a client test VM or test server running your application or test
software through hyper-v manager. Incovenient yes, but to me a small
price to pay to get that much bang for your buck.
If
you need further explanations or step by steps with screenshots of
any part of that, leave me a comment and I can expand this blog post.
No, I don't work for Veeam I just happen to love their B&R
product :)
Tuesday, January 15, 2013
compression, dedup, and compression + dedup test results
So, I ran some test results with some VHD and VHDX vm files I had from a backup, and
it was interesting to see the results of deduplication vs compression vs both at the same time.
I did 3 tests, each time copying the same set of 166 GB worth of VHD and VHDX backup files.
First option was dedup only, RECSIZE=16K
This required at least 2.6 GB of RAM in your arc_meta_limit and had a poor dedup ratio.
Second option was compression only, COMPRESSION=LZJB
This does use arc_meta_limit, obviously, but its not imperative that you be able to fit all of it in memory at once.
Third option was dedup on AND compression on. You can see that the compression interfered with the deduplication ratio. I would assume that is partly because the parts of the VHD that are highly compressible are also the ones that are dedup-able. The interesting thing here is that turning compression AND dedup on resulted in a faster write speed than just dedup. I would assume because it is trying to dedup 77.4G of data instead of deduping 166G of data. The deletion time was also faster.
You can see the detailed results from the XLS screenshot:
The conclusion here (imho) is that dedup is VERY situational and typically is not going to be worth your while compared to LZJB or GZIP-X compression.
I supposed if you are storing multiple copies of the exact same files dedup + compression would come in handy, but I can't think of any situations that would come into play where a snapshot + clone wouldn't work better.
If you have a specific situation where dedup or dedup + compression wins over just compression for you, please let me know what that was.
Monday, January 14, 2013
To DEDUP or not to DEDUP, that is the question
To Dedup or Not to Dedup??
In additional to compression, ZFS offers deduplication functionality.
It (potentially) allows you to use less disk space to store the same amount of
data. Nothing comes for free, as the say, and in this case deduplication does
not either.
You need to provide ZFS with massive amounts of RAM to store its
deduplication tables in memory or you can kiss any kind of write performance
goodbye (and in worst cases make your fileserver unresponsive for significant
periods of time).
Even if you throw plenty of RAM at deduplication, it is still
best to treat it as a “backup” device like a tape drive. That is to say, you do
backups to it and:
DO NOT USE IT FOR VM STORAGE. LZJB is probably fine, but
deduplication will kill VMs responsiveness.
DO NOT USE IT FOR A PRODUCTION DATABASE. Testing databases where
response time/processing speed are not a factor could be a possible use.
There, I warned you.
Now, here are some (half-assed) numbers you can use as a rough
guide to
Pre-allocating your RAM for a deduplicating pool.
I’m testing with ZFS on Ubuntu. For ZFS on Solaris or
Openindiana, the procedure
to change the zfs_meta_limit is slightly different.
Testing Setup:
root@ubuntuzfs03:~# uname -a
Linux ubuntuzfs03 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5
17:42:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu-zfs
spl-0.6.0-rc13
zfs-0.6.0-rc13
VM with 2 virtual CPUs and 8GB of RAM
Set our ARC max size to ~3Gb
Set our ARC meta limit to ~2Gb
Edit /etc/modprobe.d/zfs.conf, the number is # of bytes:
root@ubuntuzfs03:~#
cat /etc/modprobe.d/zfs.conf
options zfs
zfs_arc_max=3000000000 zfs_arc_meta_limit=2900000000
YOU MUST REBOOT for those changes to take effect.
I use a small script to check the status of my config and
running values:
zfs_show.sh
------------------
#!/bin/bash
echo
"config"
echo
"---------------------"
cat
/etc/modprobe.d/zfs.conf
echo "
"
echo "
"
echo
"runtime values"
echo
"---------------------"
grep -E
arc\|c_max /proc/spl/kstat/zfs/arcstats
now check that your settings (AFTER REBOOTING) have actually taken
effect:
root@ubuntuzfs03:~# ~/zfs_show.sh
config
---------------------
options zfs zfs_arc_max=3000000000
zfs_arc_meta_limit=2900000000
runtime values
---------------------
c_max
4 3000000000
arc_no_grow
4 0
arc_tempreserve
4 0
arc_loaned_bytes
4 0
arc_prune
4 0
arc_meta_used
4 132929416
arc_meta_limit
4 2900000000
arc_meta_max
4 1457235528
root@ubuntuzfs03:~#
The config options in /etc/modprobe.d/zfs.conf should match the
runtime values, or something went wrong.
Now, for our test, we explicitly set the recordsize to 4K:
root@ubuntuzfs03:~# zfs set recsize=4K zpool1/cifs/dedup_storage
root@ubuntuzfs03:~# zfs set dedup=on zpool1/cifs/dedup_storage
I’m copying the files over 100mbit LAN connection, so the 11.1
MB/s is reasonable:
Total size of files (same 2 large files copied multiple times)
Math
----
47698814976 bytes of data (files and folders included)
47698814976 / 4096 = 11,645,219 data blocks
1279193040 bytes arc_meta_used
1279193040 meta bytes / 11645219 data blocks =
109 meta bytes/per 4k data block
1279193040 meta bytes / 47698814976 bytes of data = 0.0268181304010097
bytes of meta /per bytes of data (2.7%)
So by that math, 2TB of disk space
(depending on how manufacturer calculates 1TB)
2,199,023,255,552 bytes of data would require AT LEAST 59,373,627,900
bytes or about 55.3 Gigabytes of RAM allocated to arc_meta_limit.
(owch)
Deleting files (for some reason) sends the arc_meta_used up even
higher.
I highlight all the files and delete them at once and
arc_meta_used goes up to
1457235528 bytes arc_meta_used
So to recalculate,
1457235528 meta bytes / 11645219 data blocks =
126 meta bytes/per 4k data block
1457235528 meta bytes / 47698814976 bytes of data = 0.0305507700502249
bytes of meta /per bytes of data (3.1%)
Just to be safe, if you are going to use 4k blocks, I would go
with 5%
2,199,023,255,552 bytes of data would require AT LEAST 109,951,162,777.6
bytes or about 102.4 Gigabytes of RAM allocated to arc_meta_limit.
(super owch)
Change the record size to 16K:
root@ubuntuzfs03:~# zfs set recsize=16K
zpool1/cifs/dedup_storage
After copying the files, dedup ratio and AVAIL space report the
same values, but that is to be expected if you are just
copying the same files over and over again.
same total bytes in files, but arc_meta_used comes out
different:
47698814976 bytes of data (files and folders included)
47698814976 / 4096 = 11,645,219 data blocks
bytes arc_meta_used
522642072 meta bytes / 11645219 data blocks =
45 meta bytes/per 16k data block
522642072 meta bytes / 47698814976 bytes of data = 0.0109571290662666
bytes of meta /per bytes of data (1.1%)
Interestingly, the arc_meta_used value did not spike as much
during the delete of the 16K blocks, it reported 524992920.
The math I’m using is a
little flawed, because it does not take into account how
much of the arc_meta_used is not related
to dedup. Probably I could turn off dedup, copy the same files
again, and use that as a baseline, but I feel fairly comfortable
saying:
TLDR summary:
- you need 5% of total storage (or more) for 4K block size
- you need 2.5% of total storage (or more) for 16K block sizes
- deletes will always be slow, even if you have enough RAM. Imagine slow, and then multiply that times 10. Luckily it happens in the background, but if you are on a quad core or less machine it will slow everything to a crawl. If you don't have enough RAM it is slow times 100. NOTE: I have experienced complete system locks deleting multi GB files on Ubuntu. I believe this was supposed to be fixed on Solaris, but apparently hasn't made it into the Ubuntu ZFS yet, so be warned.
- 97% of the time you will be better off using ONLY compression
Your mileage will definitely vary, so you will have to
keep an eye on the reserved vs. runtime values and adjust accordingly.
ZFS supports larger block sizes, up to 128K, but the larger your
block size the lower dedup ratio you are
going to see. (Unless you are ONLY storing multiple copies of
the same large files, in which case just
go with 128K the max.)
<edit 2013-08-07> as David pointed out, you can see memory usage reported by ZFS by using 'zpool status -D'
<edit 2013-08-07> as David pointed out, you can see memory usage reported by ZFS by using 'zpool status -D'
Subscribe to:
Comments (Atom)




