Thursday, July 7, 2011

cwrsync and openindiana -> windows server 2008 R2

[2011-05-15]
Today I ran into an issue with cwrsync going from openindiana to windows server 2008 r2. If you’re not familiar with it, cwrsync is a port of rsync to windows by the folks at
http://www.itefix.no/i2/cwrsync
I’m sure there are other ports, but cwrsync is free, and seems to support all the functionality (including ssh transport and rsync daemon connections) that the linux/unix versions do, so I’ve decided to go with that.
If you’re not familiar with rsync, its well worth your effort to look into it, if you do any kind of file or directory synchronization across servers. (Yes there are a number of uses even locally but that’s not my focus). It has a windows version robocopy but robocopy does not play as well with linux and unix. (You can use robocopy to/from linux but it requires a samba share)
Anyway, my issue was that I was really stressing the server by trying to do 3 massive directory syncronizations from three source openindiana servers (hosting the esxi vm’s) to a windows server 2008 r2 machine via rsync sender / cwrsync server receiver.
I was getting a network error at seemingly random places on one or more of the OI (openindiana) boxes.
rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
rsync: read error: Connection reset by peer (131)
rsync error: error in rsync protocol data stream (code 12) at io.c(759) [sender=3.0.6]
I did some research and all I could come up with was setting the timeout= in the rsyncd.conf file on the windows server 2008 r2.
rsyncd.conf
——————-
use chroot = false
strict modes = false
hosts allow = *
log file = rsyncd.log
uid = 0
gid = 0
timeout = 3000
contimeout = 3000
[ydrive]
path = /cygdrive/y/esxi
read only = false
transfer logging = no
timeout = 3000
contimeout = 3000
[zdrive]
path = /cygdrive/z/esxi
read only = false
transfer logging = no
timeout = 3000
contimeout = 3000
I tried values of 30 (which is supposed to be 30 seconds) which didn’t work, then 300 (which didn’t work) and then finally (keeping my fingers crossed) 3000 seconds. Don’t ask me why if the value is in seconds you have to set the timeout= to such a high value to get it to work, but we’ll see if that was indeed the case. (The file transfer in question takes 20+ hours complete over gigabit)
Why on earth are we transferring all that data off OI onto 2008 R2 you ask? (well, I would be) we have a Neo tape drive and Backup Exec software only installs on windows now. (Not referring to the remote agent i mean the machine that drives the physical tape drive) so… we need to
from each OI box:
create snapshot
clone snapshot
rsync —progress —times —update —recursive —delete —z —compress-level=1 —inplace /zpool1/nfs/vm/esxibackup/* XXX.XXX.XXX.XXX::ydrive
from the three openindiana boxes that host our VM’s. (dedicated fileservers)
This is the only way I’m aware of to (easily) backup all your vm’s while they’re still running without manually copying VMX files, then creating snapshots, then copying the VMDK files minus deltas, then deleting the snapshot of every single VM. I realize some people have scripts to do this, but for me trying to get that to work flawlessly on a weekly basis for 40+ VM’s is not a good solution. The snapshot->clone->rsync solution guarantees at least that we get a “exact moment in time” copy. Probably might be issues with a VM with a *non* journaling filesystem, but we don’t have anything without one, so it works for us.  (note: we don’t use vm’s for production databases or mail servers)
Anyway, i hope the timeout=3000 idea helps if you run into a similar situation.
I only experienced it when the server was getting really hammered from three other machines rsync’ing to it simultaneously, but your mileage may vary.
(I’m not at work right now, I’ll log on and post the rysncd.conf bits and the results of the timeout=3000 change tomorrow)
[2011-05-17] followup:
The timeout= didn’t help. Two of the rsync’s still crashed with similar errors
rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
rsync: write failed on “FHQ/FHQ_1-flat.vmdk” (in zdrive): Permission denied (13)
rsync error: error in file IO (code 11) at receiver.c(322) [receiver=3.0.8]
rsync: read error: Connection reset by peer (131)
rsync error: error in rsync protocol data stream (code 12) at io.c(759) [sender=3.0.6]
later that night. I am trying various options. It almost seems as if I am either overloading the gigabit switch we’re using or the target server. I am experimenting now with using —compression-level=9 and —sparse (—spare and —inplace are mutually exclusive) to see if that helps and will update this blog tomorrow. (I believe using —sparse instead of —inplace would actually make the difference. There might be an issue with cwrsync and trying to do in place updates to existing files, we’ll see)
[2011-05-18]
still getting similar error messages even after changing to —sparse and (thus) not doing an —inplace anymore. I thought that maybe it had to do with enabling compression on the target windows directory (windows built in compression) causing too much stress on the server (with three simultaneous rsync’s going) but that doesn’t really seem to be the case either (after re-testing with compression off). So I’m left with some problem inherent to cwrsync specific (?) to server 2008 r2 and doing multiple inbound rsyncs at the same time. My solution for now is to stagger the backups from the three NFS servers during the week so only one is running at a time such that we can have a weekly backup to tape on sundays. I’m sticking a fork in it, cause I’m done messing with cwrsync to get it to work. Shame really cause it would be a much more elegant solution to be able to schedule all three servers to backup during overlapping time windows. c’est la vie.
[2011-05-23]
No problems since staggering the backups so that only one is running at a time.
Works out to be a fairly hands free (hopefully) trouble free backup solution for all the VM’s to disk and tape backup.
[2011-05-27]
It appears —whole-file (essentially turning off the file delta algorithm built into rsync) works much better with cwrsync than trying to let it update only the parts that changed.
[2011-07-07]
Just a quick update. Since switching to --whole-file, the backups are working like a charm. They run from bash scripts set up in crontab. I also installed "mutt" to enable emailing myself and another system admin when the backup is finished with a confirmation message and the rsync log file as an attachment. (cuts down on logging in the check on backup status)

No comments:

Post a Comment