Tuesday, June 26, 2012

Network Disruption Causes vCenter DB Corruption


First off, I am NOT a VMware expert by any stretch of the imagination.  I AM however learning a lot working with some smart folks in virtualized servers and desktops.  

A network engineer (who shall remain nameless) was making some changes to the network infrastructure last night and unfortunately experienced an outage. Due to an ongoing network migration from Cat6500 to Nexus 7k/5k/2k, all ESX hosts are now connected to Nexus FEX but iSCSI storage is still on old Cat6500. Outage basically cut connectivity between Nexus-connected hosts and iSCSI storage. 

As users started trying to login to their desktops in the morning, we started getting reports of problems. Our VDI vCenter showed 4 of our 20+ hosts disconnected or not responding. We ended up power-cycling those, one at a time, and once they came up we were able to re-connect them back into vCenter.  

The next big problem was that the profile server, which runs as a VM in the VDI infrastructure, was hung while attempting to migrate. We rebooted vCenter which orphaned the profile server, but we found we were unable to browse the particular LUN where that VM's datastore existed to add it back into vCenter. At that point, we engaged VMware support and spent several hours on WebEx troubleshooting storage connectivity problems (tail -f /var/log/vmkernel and some other stuff). By the time I left in the early afternoon we had identified half a dozen hosts that seemed to be having iSCSI problems based on what VMware Support was seeing in the logs, and we rebooted those hosts one at a time to minimize end-user impact.

I had to leave before the fun was all over, but found out afterwards that apparently a couple of the hosts got duplicates of the datastore IDs on them when they recovered from the outage overnight. Once that happened, the database was somehow corrupted with the wrong datastore information. It was apparently cleared by removing the two particular hosts from vCenter and adding them back in, thus giving them new datastore information.

Like I said, I'm not a VMware expert but I'm learning more each day. You ever experience something like this? Who else is doing VDI? Leave your comments below or find me on Twitter (@swackhap).

No comments:

Post a Comment