VMWare ESXi announce High Availability (HA) for NVIDIA GRID vGPU VMs with vSphere 6.5

I was very pleased yesterday to see Pat Lee from VMware’s PM team tweet about this yesterday…

patleetweet

It’s something we knew VMware had added to vSphere 2016, vSphere 2016 supported in the GRID 4.1 (Nov 2016) release. As a VMware implemented feature this was something we at NVIDIA had to wait for them to announce. I think there have been a few problems with the documentation update staging which is why this has been a rather quiet feature release. I’ll update this blog with links to the documentation when it becomes available which should be soon!

But since Pat has let the cat out of the bag…. Probably best to answer a few basic questions straing away.

What is High Availability (HA)?

Basic HA is a feature to ensure VMs are up and running as soon as possible in the event of host failure. The VM will automatically restart as soon as possible on another host if one is available with sufficient resources. So for vGPU enabled VMs that means on a host with an appropriate GPU etc. Although the user will experience some down-time where possible this is minimized without the need for manual intervention by a system administrator.

Guaranteed High Availability…

This can be provided by HA features by allowing resources to be resourced such as RAM/CPU on hosts e.g. maybe 15% of a hosts capacity, which allows a guarantee that resource will be available to restart VMs upto a certain number of host failures. I believe that VMware’s configuration does not extend to configuring GPU resource reservation and so the support announced today will not offer guaranteed HA. It is a feature VMware could add in the future though if they saw sufficient demand, it is not a feature engineered by NVIDIA.

Can HA provide continual up-time?

No, not alone. Many hypervisors though offer Fault Tolerance (FT) which can provide such support, this is a very expensive feature to use as it relies on running essentially a duplicate VM on mirrored hardware which is phase-locked to the original (i.e. milliseconds behind), in the event of failure the user is switched to the duplicate with only a momentary glitch in user experience. It’s a feature essentially only used in a few safety / mission critical use cases as it’s so costly to implement.

So is Fault Tolerance (FT) supported for vGPU?

No not today, the technology to continually essentially snapshot a live GPU is not available. This is also a pre-requisite for live migration/motion e.g. vMotion and also regular snapshots.

The Future

NVIDIA and all the partners such as Citrix and VMware appreciate that live motion and snapshotting are key enterprise datacenter needs so we continue to work towards making such technology happen (it’s very technically hard I’m told!). We all know what you want and what you want our priorities to be!!!

NVIDIA GRID is architected with a software model which gives us the ability to add additional support for new OSs for customers existing hardware allowing them to pick up new features.

NVIDIA GRID: More info on vApps and VPC/vWS Licensing

lukeblog
Check out Luke Wignall’s blog on NVIDIA GRID licensing and other GRID topics!

I wrote a blog on RDSH (including XenApp) licensing and the options available with NVIDIA GRID vGPU and GPU-passthrough a few weeks ago, which you can read – here (including support for multi-monitor and resolutions). Since then my colleague Luke has added some more information in a blog where he outlines various case studies including many on vApps, which is worth a read here:

Luke answers how many licenses and what type you will need for various use cases, answering questions such as:

  • Q: I am deploying Citrix XenDesktop for 5000 global users, using two data centers, to meet a follow the sun productivity goal.  The data centers are also backup sites to each other.  I expect at most 1200 users at each of our three regional areas to be on during their workday, connecting to their closest data center, but there is some overlap (people working late or starting early) so I am architecting with a buffer for a total of 1500 virtual desktops.  I need to be able to run all users from either data center of one should go down. My users are all engineers and their apps require Quadro.
  • Q:  I am deploying virtual desktops but using XenApp to do so, and am looking for improved end user experience, for 1000 users.  At any given time I expect no more than 850 users to be connected.  I have no other desktop delivery method.
  • Q:  I chose to run XenApp on a bare metal host, so no hypervisor (I would question the decision to forgo the flexibility and manageability of virtualization), delivering three Microsoft Office applications so .  I have 500 users but expect no more than 350 of them to be connected at any given time.  I have no Virtual desktops for these users.
  • Q:  I have 250 engineers using CATIA and similar apps, they must have Quadro drivers, but usually only 200 of them are working at any given time.  I also have 1000 knowledge workers that range from sales to support, their apps do not need Quadro but perform much better with GPU (=happy users), of those I typically see 800 actively on their desktops.  I am deploying VMware Horizon.  We have a set of web apps that all 1250 employees use for time keeping, expenses, and safety training, these I am delivering with XenApp.

 

There is a lot of information on GRID licensing in our knowledge base – just search on “GRID licensing” on our KB home page here:

Highlights include:

Licensing Documentation:

Of course one of the best references is the official licensing guides on the GRID resources page (under deployment guides) here: http://www.nvidia.com/object/grid-enterprise-resources.html. In particular these two are useful:

 

Questions

Any questions – ask below or on the support NVIDIA GRID forums at https://gridforums.nvidia.com

Significant leaps in virtualized NVIDIA vGPU monitoring

managesdk
Read the documentation – the User Guide provided alongside the managmeent SDK is really comprehensive!

Today NVIDIA announced a new monitoring SDK / API incorporated into its GRID vGPU products as part of their GRID August 2016 (4.0) release. This will be available from Friday 26th August 2016 as a software release for existing hardware, greatly enhancing the functionality for existing as well as new customers. (You can read the announcement here).

NVIDIA has broken ranks with traditional hardware-only GPU models and recognized enterprises needs software to manage and monitor GPUs as a component of the data centre. Software licensing has enabled existing customers to benefit from new features with fully supported software, directly supported by NVIDIA (you wouldn’t run your Microsoft OS or CAD software unsupported!). Continue reading Significant leaps in virtualized NVIDIA vGPU monitoring

Optimising TCP for Citrix HDX/ICA including Netscaler

MArius
Marius Sandbu – NGCA (NVIDIA GRID Community Advisor)  aka Clever Viking!

The TCP implementation within Citrix HDX/ICA protocol used by XenDesktop and XenApp and also Citrix Netscaler is pretty Vanilla to the original TCP/IP standards and definition and the out-of-the-box configuration usually does a good job on LAN. However, for WAN scenarios particularly with higher latencies and certain kinds of data (file transfers), Citrix deployments can benefit greatly from some tuning.

 

One of our new NGCAs (NVIDIA GRID Community Advisors) Marius Sandbu has written a must-read blog on how to optimize TCP with a Citrix Netscaler in the equation: http://msandbu.org/tag/netscaler-tcp-profile/Marius highlights some of the configuration optimisations hidden away in the Netscaler documentation and you’ll probably want to refer to that  documentation too (https://docs.citrix.com/en-us/netscaler/11-1/system/TCP_Congestion_Control_and_Optimization_General.html).

Citrix HDX TCP is not optimized for many WAN scenarios but at the moment it can also be tuned manually following this advice: CTX125027 – How to Optimize HDX Bandwidth Over High Latency Connections. This is one configuration I’d love to see Citrix automate as having to tune and configure the receiver is fiddly and also not possible in organisations/scenarios where the end-points and server/network infrastructure might be provided by different teams or even companies (e.g. IaaS).

 

For Citrix NVIDIA GRID vGPU customers with looking at high network latency scenarios – it really is worth investigating the potential and benefits of TCP window tuning. I’d be really interested to hear feedback if you have tried this and what your experience / thoughts are too!

 

Norwegian, Marius Sandbu was recently awarded NGCA status by NVIDIA for his work with our community through his Netscaler, remoting protocols and experience with technologies such as UDP and TCP/IP. You can follow him on twitter @msandbu and of course do follow his excellent blog on http://msandbu.org/ !!!

More Lenovo Servers Support NVIDIA GPUs Including the M60

Lenovo have recently qualified and announced support for more NVIDIA GPUs for several servers including the x3650 M5 (E5-2600 v4), details can be found on Lenovo’s site, here:

Also recently listed is the x3500 M5:

This means Lenovo have worked with NVIDIA to test and certify that both parties hardware, firmware and software is fully-compatible, thermally and electrically stable.

Lenovo and vGPU/GPU-passthrough

Lenovo’s “redbook” site with server specifications and support also carries a wealth of information about Lenovo’s investment and joint development to support GPU technologies and virtualization including NVIDIA GRID vGPU. In particular their reference architecture designs including considerations for GPU usage are excellent and available for both VMware and Citrix infrastructures. You can read them here:

I’ve found the best place to start a search on Lenovo’s site is here: https://lenovopress.com/redpxref-system-x-reference and here:

 

Hypervisor Support

The GRID M60 card is now supported on more bare-metal/physical servers. Customers looking to use the M60 card with GRID vGPU in conjunction with a hypervisor such as Citrix XenServer or VMware ESXi should verify that the server OEM has also certified with the hypervisor by checking the VMware/Citrix HCL (Hardware compatibility list), details of how to do this can be found in these NVIDIA Support articles:

GPU Sizer – Community tool seeks Beta Testers

A few lucky folks at E2EVC, a couple of weeks ago in Las Vegas, got a sneak preview of a couple of new community tools for analyzing application usage of NVIDIA GPUS. I have already blogged about Jeremy Main’s GPU Profiler (read about it – here).

newtoolse2evc

The other tool is one from community GPU and virtualisation expert Magnar Johnsen from Norway, who is well-known in the Virtualisation communities for his GPU-enabled deployments and tools. Magnar was in fact one of the community users who we invited to NVIDIA to speak to our engineers and product managers about the future direction of our products and user needs.

Magnar has released this tantalizing screen shot of his new tool and is actively inviting beta testers and GPU users to try it out and input into its development. You can sign up for the beta program here: http://virtualexperience.us13.list-manage.com/subscribe?u=efedd1e2c3378132102c90273&id=3875dd956b

gpusizer

One particularly interesting feature is the tools ability to monitor if applications are using APIs to use the GPU for DirectX (DX9, DX10, DX11) and OpenGL, OpenCl, CUDA etc.

Magnar Johnsen is a EUC solution specialist, blogger, speaker, and community tool developer with +15 years experience in End User Computing. Magnar works as a consultant in Bergen in Norway. He has worked with Citrix, Microsoft and VMware products since 1999 and with NVIDIA products since 2012. Magnar has a passion for technology, computer visualization and virtual reality. He has basic experience with 3D modeling, graphic manipulation and video effects which helps him better design and implement 3D and graphical applications in a virtual environmet. He has assessed, designed, implemented and supported many virtual graphics solutions based on NVIDIA techology for small to large companies in Oil and Gas industry in Norway. Magnar shares his knowledge, tools and experience on his blog http://www.virtualexperience.no and speaks at several industry conferences like Citrix Synergy, Briforum and Citrix User Group. You can follow Magnar for updates on his blog and GPU Sizer on twitter @MagnarJohnsen.

GPU Profiler – NVIDIA Community Tool

gpuprofilerJust a quick blog to highlight a new community tool written as a hobby project by one of our GRID Solution Architects, Jeremy Main.  As a community tool this isn’t supported by NVIDIA and is provided as is. The advantages of releasing this in this way is that Jeremy has provided the tool on github where partners, customers and the community can access it, discuss enhancements and report bugs. Continue reading GPU Profiler – NVIDIA Community Tool