VMWare VMotion problems have been reported many times over the years. If you search on Google or on the VMWare online communities you can see that this is a fairly common problem and in most cases points to misconfiguration in user environments. In other cases it has been a bug inside VMWare ESX/ESXi, but those were usually addressed by VMWare over time. Recently a new series of VMotion problems has shown up. VMotion fails at 9 percent. And sure enough, one of my environments was affected by this problem, too. VMotion failed or timed out at 9 percent and it was almost impossible to free up a host. I could power off VMs and then move them, but nothing worked in a powered-on state of many VMs. Here is how I fixed it.
First of all I have to say that this new VMotion problem is not as wide-spread as one might think and I assume (at least at this point) that this has to do with ESXi5 still being fairly new. I am not 100% sure what triggers this 9% VMotion problem and I assume it is a combination of at least 2 items. In my specific environment the problem became more visible during a time frame when several changes were made to my environment – including changes to VMWare and the network.
Due to this problem being fairly new I was not able to find a lot of information on the Internet about this problem. Most VMotion problems being talked about are related to the well-known 10 Percent VMotion problem, but the 9 percent VMotion problem is different as none of the recommended steps for the 10% VMotion problem actually works. I even went so far to open a ticket with VMware and I was actually really disappointed, because the technician was not correctly looking at the facts and actually completely went into the wrong direction. He concentrated on the wrong part of my infrastructure and requested to make changes to my iSCSi configuration on both ESX hosts and network switches. At a certain point I started ignoring the technician and stalled the troubleshooting process because it was too obvious that he was not going to fix the problem. I could have escalated the ticket to a higher level, but I just did not have the warm fuzzy feeling with VMWare support in this case. SO, I did the final troubleshooting myself and then stumbled over the solution by accident.
Before I will tell you what fixed the issue I need to remind you about having your environment configured correctly. You will need to follow best practices for setting up host management network and VMotion. Also, this problem can appear with ESX 4.1 and ESXi 5. One common thing that seems to be related however is vCenter 5 and I think vCenter 5 might be less forgiving on certain things whereas vCenter 4.1 was (that is a wild guess on my end).
So, how did I fix the 9 percent VMotion problem?
Very easy fix actually. While doing research on the problem I stumbled across one guy talking about having a similar issue in his environment. In his particular case he was using Dell R710 servers and using the built-in Broadcom NICs (4 NICs) for the host management and VMotion. My VMWare hosts are Dell R710’s as well. His network was made of HP ProCurve switches while my environment is made of Cisco Nexus 2000 and Nexus 5000 switches. He saw a lot of dropped packages on his switches. My network engineer report none of that on our switches, but I am not sure if that can be considered a “user error”. But the fact that Dell R710 servers with Broadcom NICs were involved intrigued me.
This was a 10 Node cluster with hundreds of VMs and it took me a while to free up a host. I then upgraded the Broadcom NICs to the latest available firmware. With some VM admin magic and some luck I was able to work my way through my cluster and upgrade all ESX hosts with the latest Broadcom firmware. Half-way through the process of upgrading all ESX hosts I started seeing improvements with my VMotions. But it took all the way to having the last ESX host upgraded before all was good. I randomly moved all VMs off certain ESX hosts and overall did over 250 VMotions and none of them failed anymore.
The 9 Percent VMotion problem was fixed by upgrading the firmware for the NICs on my ESX hosts.