VMWare ESXi announce High Availability (HA) for NVIDIA GRID vGPU VMs with vSphere 6.5

I was very pleased yesterday to see Pat Lee from VMware’s PM team tweet about this yesterday…


It’s something we knew VMware had added to vSphere 2016, vSphere 2016 supported in the GRID 4.1 (Nov 2016) release. As a VMware implemented feature this was something we at NVIDIA had to wait for them to announce. I think there have been a few problems with the documentation update staging which is why this has been a rather quiet feature release. I’ll update this blog with links to the documentation when it becomes available which should be soon!

But since Pat has let the cat out of the bag…. Probably best to answer a few basic questions straing away.

What is High Availability (HA)?

Basic HA is a feature to ensure VMs are up and running as soon as possible in the event of host failure. The VM will automatically restart as soon as possible on another host if one is available with sufficient resources. So for vGPU enabled VMs that means on a host with an appropriate GPU etc. Although the user will experience some down-time where possible this is minimized without the need for manual intervention by a system administrator.

Guaranteed High Availability…

This can be provided by HA features by allowing resources to be resourced such as RAM/CPU on hosts e.g. maybe 15% of a hosts capacity, which allows a guarantee that resource will be available to restart VMs upto a certain number of host failures. I believe that VMware’s configuration does not extend to configuring GPU resource reservation and so the support announced today will not offer guaranteed HA. It is a feature VMware could add in the future though if they saw sufficient demand, it is not a feature engineered by NVIDIA.

Can HA provide continual up-time?

No, not alone. Many hypervisors though offer Fault Tolerance (FT) which can provide such support, this is a very expensive feature to use as it relies on running essentially a duplicate VM on mirrored hardware which is phase-locked to the original (i.e. milliseconds behind), in the event of failure the user is switched to the duplicate with only a momentary glitch in user experience. It’s a feature essentially only used in a few safety / mission critical use cases as it’s so costly to implement.

So is Fault Tolerance (FT) supported for vGPU?

No not today, the technology to continually essentially snapshot a live GPU is not available. This is also a pre-requisite for live migration/motion e.g. vMotion and also regular snapshots.

The Future

NVIDIA and all the partners such as Citrix and VMware appreciate that live motion and snapshotting are key enterprise datacenter needs so we continue to work towards making such technology happen (it’s very technically hard I’m told!). We all know what you want and what you want our priorities to be!!!

NVIDIA GRID is architected with a software model which gives us the ability to add additional support for new OSs for customers existing hardware allowing them to pick up new features.

A few FAQs on Azure N-series inc. Do Azure N-Series VMs include NVIDIA GRID software license?

Last week I was at the UK Citrix User Group in London on December 1st 2016, the day Azure launched a new series of graphics focused. There is an NVIDIA blog as well as a Microsoft one overviewing the technology. Whilst I was there CTP Thomas Poppelgaard and others asked a few questions that I couldn’t answer on the spot, so this blog is a bit of an update for them.

The interest in Azure N-Series which use NVIDIA GRID GPUs via DDA (Direct Device Assignment – essentially pass-through) has been incredible and quite a few questions have come up already.

What regions have N-Series availability?

Check at the time of reading, but initially on 1st Dec 2016 the N-series was launched in 4 regions: South Central US, East US, West Europe and South East Asia. You can check the availability and pricings here: https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/

Are NVIDIA GRID Software Licenses included in the pricing?

I went back and asked the Product Manager for this release and I’m told that if using VDI (e.g. for Citrix XenDesktop) then yes they are included within the Azure pricing and there is no need to buy additional NVIDIA GRID software licenses.

However if using a RDSH solution such as XenApp. End-users are required to purchase NVIDIA GRID vApps licenses per session, I believe this is because currently Azure doesn’t have the mechanism to bill for this in their current implementation, that may of course change in the future though. You can find out more about NVIDIA GRID Licensing in the guides available under “Resources->Deployment Guides” on the NVIDIA GRID home page, here.

I have struggled to find documentation on this though – so if anyone has a link! That would be useful! I’ll update this blog when I find one!

So I can use XenApp with Azure N-series?

Yes, although not the requirement to purchase vApps Licenses. Citrix themselves are very enthusiastic about this option and have published some excellent blogs, deployment guides etc. See:

Can I use Teradici Cloud Access Software on Azure N-series?

I believe so and NGCA Marius Sandbu has been trialing this already http://msandbu.org/test-run-of-teradici-cloud-access-software-on-azure-n-series/; Marius works intensely with Azure and has been part of the N-series early access program so I’d highly recommend reading his other blogs.

Can I use fra.me with Azure N-Series?

Yes, fra.me who specialize in CAD and graphical application delivery have done some great work in conjunction with NVIDIA’s own performance engineering teams to investigate ESRi ArcGIS and AutoDesk Revit performance on N-Series. Read here: https://blog.fra.me/a-first-look-microsoft-azure-n-series-vs-aws-g2-instance-type-9930c5b1a644#.2a2yh15so