Tag Archives: VMware

Bareos/Bacula VMware backup part 2

Today I added the components that create a logfile and cleans up the working directory when done. The idea behind the logfile is that using the information in it a person with no knowledge about the original backup could use the files to create a running restore of the VM. I may someday create a restore script, but not today. The cleanup portion is not working 100%, but good enough that I will use the script in my production starting today. I will debug and fix it later. Here is the Bareos job log for my first full & successful running of the script/backup combo:

bareos-dir Job vmguest-FullImage.2013-10-21_16.02.35_06 waiting 50 seconds for scheduled start time.
bareos-dir shell command: run BeforeJob "/usr/lib/bareos/scripts/vmprep.py -v vmguest.domain.local"
bareos-dir BeforeJob: Found the VMX file and copied it to the backup location /mnt/vmbackup/
 BeforeJob: Successfully created a snapshot for your VM
 BeforeJob: successfully backed up /vmfs/volumes/datastore1/vmguest.domain.local/vmguest.domain.local.vmdk to the backup location /mnt/vmbackup/
 BeforeJob: successfully backed up /vmfs/volumes/550a2145-64112148/vmguest.domain.local/vmguest.domain.local_1.vmdk to the backup location /mnt/vmbackup/
 BeforeJob: I deleted the snapshot I took earlier, all is good.
 Start Backup JobId 226, Job=vmguest-FullImage.2013-10-21_16.02.35_06
 Using Device "FileStorage" to write.
bareos-sd Volume "VM0015" previously written, moving to end of data.
 Ready to append to end of Volume "VM0015" size=64551931043
bareos-sd User defined maximum volume capacity 107,374,182,400 exceeded on device "FileStorage" (/home/bareos/storage).
bareos-sd End of medium on Volume "VM0015" Bytes=107,374,157,986 Blocks=1,664,406 at 21-Oct-2013 16:23.
bareos-dir Created new Volume "VM0016" in catalog.
bareos-sd Labeled new Volume "VM0016" on device "FileStorage" (/home/bareos/storage).
 Wrote label to prelabeled Volume "VM0016" on device "FileStorage" (/home/bareos/storage)
 New volume "VM0016" mounted on device "FileStorage" (/home/bareos/storage) at 21-Oct-2013 16:23.
bareos-sd Elapsed time=00:17:42, Transfer rate=44.48 M Bytes/second
bareos-dir Bareos bareos-dir 12.4.4 (12Jun13):
 Build OS: x86_64-unknown-linux-gnu redhat CentOS release 6.2 (Final)
 JobId: 226
 Job: vmguest-FullImage.2013-10-21_16.02.35_06
 Backup Level: Full
 Client: "bareos-fd" 12.4.4 (12Jun13) x86_64-unknown-linux-gnu,redhat,CentOS release 6.2 (Final)
 FileSet: "VM Image Backup NFS Folder" 2013-10-19 16:56:07
 Pool: "VMImage" (From command line)
 Catalog: "MyCatalog" (From Pool resource)
 Storage: "File" (From command line)
 Scheduled time: 21-Oct-2013 16:03:25
 Start time: 21-Oct-2013 16:07:24
 End time: 21-Oct-2013 16:25:08
 Elapsed time: 17 mins 44 secs
 Priority: 10
 FD Files Written: 7
 SD Files Written: 7
 FD Bytes Written: 47,245,718,876 (47.24 GB)
 SD Bytes Written: 47,245,719,792 (47.24 GB)
 Rate: 44403.9 KB/s
 Software Compression: None
 VSS: no
 Encryption: no
 Accurate: no
 Volume name(s): VM0015|VM0016
 Volume Session Id: 18
 Volume Session Time: 1382202217
 Last Volume Bytes: 4,458,527,606 (4.458 GB)
 Non-fatal FD errors: 0
 SD Errors: 0
 FD termination status: OK
 SD termination status: OK
 Termination: Backup OK
 shell command: run AfterJob "/usr/lib/bareos/scripts/vmprep.py -v vmguest.domain.local -p"
bareos-dir AfterJob: I couldn't find file /mnt/vmbackup/vmguest.domain.local.vmdk!
 AfterJob: You may want to look at /mnt/vmbackup/
 AfterJob: Cleaned out the backup location, ready for the next round.

Per the request below I’ve attached my vmprep.py script (rename vmprep.py.txt to vmprep.py). I’m not a programmer, so don’t hate me if it blows up your stuff.

vmprep.py

VMware image backup with Bareos – More free backup

Bareos (Bacula if you like) does a great job of backing up files. In the event of a total meltdown I really would prefer the ability to restore an entire VM as opposed to rebuilding and installing agents prior to restore. Let’s see if I can make this work.

Brainstorming:

In the grand scheme, the server to be backed up will be localhost. The files will exist on an NFS volume accessible to both the VMware host VMkernel and localhost.

We will take a snapshot of the running VM, then copy the VMDK out to that NFS location using a run-before script. We will be able to put it in location predictable to Bareos and use the appropriate fileset definition to go out and grab that set of files for each job/vm. We will the use a run-after script to delete the snapshot and the backed up files out on that NFS.

To test how realistic this is at all I’m going to use a “junk” vm to copy a snapshotted VMDK and associated vmx file and try to see if I can get that portion up and running.

To create the snapshot in the busybox console:

vim-cmd vmsvc/snapshot.create 17 "bareos_backup" "Temporary snapshot for Backup system. This should not exist if a backup isn't currently running."

The ’17’ in that command references a vmid. That will have to be parsed using the command:

vim-cmd vmsvc/getallvms

To be dealt with as I script it out.

I started the copy of my 40GB vmdk at 1:29PM…

off for coffee…

Done by 1:54PM, possibly sooner but I wasn’t looking. Now I’ll copy the vmx file and see if I can mangle it enough to make the thing boot.

— next morning —

The bad news is that I couldn’t get the copied disk to work easily. A bit of research learned me that I should have used vmkfstools to copy the snapshotted file, so I tried again that way. Here was my command:

vmkfstools -i
 source.vmdk /vmfs/volumes/dst_datastore/restoretest/restoretest.vmdk -d thin

After running that command and also copying the vmx file, I imported the vmx in the new location, removed the existing disk and added a new disk using the newly relocated vmdk – it booted. Another bonus came from using vmkfstools instead of cp, that being I was able to specify to create a thin disk on the destination end. This cut the copy time down to about 4:32 and I have a smaller file to backup. Now that I know the whole process is relatively possible, I’ll do the pre and post-job scripts in Python.

— next evening —

I spent the entire day creating the before backup job and am right now running my first end to end trial. The Bareos definitions read like this:

JobDefs {
  Name = "VM"
  Type = Backup
  Level = Full
  FileSet = "VM Image Backup NFS Folder"
  Storage = File
  Messages = Standard
  Priority = 10
  Pool = VMImage
}
Job {
  Name = "vmguest1-FullImage"
  JobDefs = "VM"
  Client = bacula-srv-fd
  Schedule = "Monthly-VMImage-vmguest1"
  RunBeforeJob = "/usr/lib/bareos/scripts/vmprep.py -v vmguest1.gsellc.local"
}
FileSet {
  Name = "VM Image Backup NFS Folder"
  Include {
    Options {
    signature = MD5
    }
  File = "/mnt/vmbackup"
  }
}

/mnt/vmbackup is an NFS mounted directory that both my ESXi hosts and my Bareos director can access. It’s the handoff point, ESX copies the VMDKs there, then Bareos picks them up and stuffs them onto backup media. The before-backup script identifies the VM we want to use, takes a snapshot then copies it to the staging location.

Unfortunately it would seem that Bareos likes to backup sparse files, not disk blocks. This means that while my test VM uses about 35 GB on disk, Bareos is transferring 160 (compressed) GB to tape, so the backup will take awhile. At the end of the day it takes the same amount of space on tape, it just increases the backup window.

I have yet to write the cleanup job that will delete the files, this is an important component and will be what I do next. As it stands, I have something that kind of works to polish and shine into something totally usable. The other big ToDo is I want to leave traces of what the backup is in the backup. Meaning I want to add a backup logfile that can be used at restore time to see what the guest’s name was, what ESX host it lived on, where it kept its VMDKs and all that. All of the information is already stored in the before job script, it just needs to be put together in a pretty file and left in the staging directory. I also am considering adding options for quiescing, but that is low on my priority list.

My first backup on my 160 GB test machine took just about 2 hours – a little more. It looks like in my environment my backups are going to take about 45-50 for each GB of ALLOCATED disk. I can tolerate this as I only plan on backing up whole VM images once a month or so, maybe once a week for VERY dynamic machines or machines that are less about data and more about application. I will not be relying on this as a substitute for traditional agent based backups.

I think that’s enough of a knowledge dump on this topic for 1 post. More to come.

New Nagios Plugin

Last Friday going into the weekend I ran across a snapshot on one of my VMware hosts almost 160 days old, OUCH. The right tool to keep that from happening is definitely Nagios. NagiosExchange didn’t really have a solution for my problem that I could find. Somebody has written a snapshot age tool in PowerShell but I’m not interested in having plugins run on hosts that aren’t my main Nagios server. I was given a fun project to work on.

The vSphere Command Line Interface (formerly the PERL toolkit if I’m not mistaken) was of little help. It didn’t really give me any interface into snapshot data at all. I decided the simplest solution would be to work right on the BusyBox console. I started Friday around noon and working on it here and there over a couple days came up with a usable product yesterday morning:

[jrdalrymple@nagios ~]$ /usr/local/nagios/libexec/check_snapshot.py
No password specified
usage: check_snapshot.py -H hostname [-U username] <-P password | -f PasswordFile>
[jrdalrymple@nagios ~]$ sudo /usr/local/nagios/libexec/check_snapshot.py -H 172.16.100.11 -U nagioschk -f /home/nagios/.check_esxi_hw.pw -w 10 -c 20

3 VMs are CRITICAL
Guest example1.domain.local has snapshot 24 days old!
Guest example2.domain.local has snapshot 28 days old!
Guest example3.domain.local has snapshot 26 days old!

Clipboard03

The results between my command line run and the Nagios GUI aren’t the same because I gave the Nagios check different thresholds.

I’ll probably put it up on Nagios Exhange at some point. For now I’ll just feel accomplished.

XIV <--> VMware LUN ID mapping

I have to dig far too hard to find this information anytime I want it, so I’m putting it here. I just one-lined a CSV file in PowerCLI to list all of my attached LUNs across my entire VMware environment (only 1 vCenter  makes it easy). Each LUN “CanonicalName” that is attached to an XIV array starts with “eui.00173800″ regardless of which array the LUN actually lives on. There are 8 hex digits remaining, of them it appears that first 4 are used to identify the array and the last 4 used to identify the LUN serial number.

Verifying that succeeds. I can see that the serial number of the XIV for which I have the most LUNs presented on shows up as the 4 hex digits numbered 9-12 in the euis in my list.

The last 4 don’t jive up initially when I look at LUN serial numbers, but that’s because I chose the very first one to look at, which is 0000 for digits 13-16. It appears that ESXi must create some sort of dummy LUN 0 for each array, all the following arrays make sense. It’s also worth noting that my RDMs show up in this list.

So the final number looks like this:

[00000000][1111][2222]
[IBM-XIV!][ARSN][LUN#]

In the real world if I have a LUN Conanic eui.0017380035bc001d it’s referencing an XIV array S/N 13756 (35bc hex -> dec) and the LUN S/N 29 (001d hex -> dec).

Last note, if you need this reference you probably already are aware of this and even noticed it in the example above, but in VMware land we typically use hexadecimal to reference storage addresses, by default though XIV GUI lists everyting (including array S/N) in decimal. It all has to be converted. The LUN Serial Numbers can be done right in the GUI under Tools>Management, but you’ll have to convert array S/N either by hand or in your head if you can hex like a boss.