VMWare ESXi announce High Availability (HA) for NVIDIA GRID vGPU VMs with vSphere 6.5

I was very pleased yesterday to see Pat Lee from VMware’s PM team tweet about this yesterday…

patleetweet

It’s something we knew VMware had added to vSphere 2016, vSphere 2016 supported in the GRID 4.1 (Nov 2016) release. As a VMware implemented feature this was something we at NVIDIA had to wait for them to announce. I think there have been a few problems with the documentation update staging which is why this has been a rather quiet feature release. I’ll update this blog with links to the documentation when it becomes available which should be soon!

But since Pat has let the cat out of the bag…. Probably best to answer a few basic questions straing away.

What is High Availability (HA)?

Basic HA is a feature to ensure VMs are up and running as soon as possible in the event of host failure. The VM will automatically restart as soon as possible on another host if one is available with sufficient resources. So for vGPU enabled VMs that means on a host with an appropriate GPU etc. Although the user will experience some down-time where possible this is minimized without the need for manual intervention by a system administrator.

Guaranteed High Availability…

This can be provided by HA features by allowing resources to be resourced such as RAM/CPU on hosts e.g. maybe 15% of a hosts capacity, which allows a guarantee that resource will be available to restart VMs upto a certain number of host failures. I believe that VMware’s configuration does not extend to configuring GPU resource reservation and so the support announced today will not offer guaranteed HA. It is a feature VMware could add in the future though if they saw sufficient demand, it is not a feature engineered by NVIDIA.

Can HA provide continual up-time?

No, not alone. Many hypervisors though offer Fault Tolerance (FT) which can provide such support, this is a very expensive feature to use as it relies on running essentially a duplicate VM on mirrored hardware which is phase-locked to the original (i.e. milliseconds behind), in the event of failure the user is switched to the duplicate with only a momentary glitch in user experience. It’s a feature essentially only used in a few safety / mission critical use cases as it’s so costly to implement.

So is Fault Tolerance (FT) supported for vGPU?

No not today, the technology to continually essentially snapshot a live GPU is not available. This is also a pre-requisite for live migration/motion e.g. vMotion and also regular snapshots.

The Future

NVIDIA and all the partners such as Citrix and VMware appreciate that live motion and snapshotting are key enterprise datacenter needs so we continue to work towards making such technology happen (it’s very technically hard I’m told!). We all know what you want and what you want our priorities to be!!!

NVIDIA GRID is architected with a software model which gives us the ability to add additional support for new OSs for customers existing hardware allowing them to pick up new features.