Today NVIDIA announced a new monitoring SDK / API incorporated into its GRID vGPU products as part of their GRID August 2016 (4.0) release. This will be available from Friday 26th August 2016 as a software release for existing hardware, greatly enhancing the functionality for existing as well as new customers. (You can read the announcement here).
NVIDIA has broken ranks with traditional hardware-only GPU models and recognized enterprises needs software to manage and monitor GPUs as a component of the data centre. Software licensing has enabled existing customers to benefit from new features with fully supported software, directly supported by NVIDIA (you wouldn’t run your Microsoft OS or CAD software unsupported!).
Hardware vendors investing in data centre management and monitoring software is nothing new – think of Cisco UCS, NetApp VSC, Tivoli Monitoring, HP’s System Management…. But for some reason the GPU market has been stuck in the dark ages with an unsupported hardware only model. NVIDIA have long cracked the true hardware performance aspect but like any data center component it needs killer software to leverage and manage the hardware and I think this release has cracked this for the GPU market.
There are a wealth of new metrics exposed including per VM vGPU usage, a much longed for request. NVIDIA Management Library (NVML) is a C-based API for monitoring and managing various states of NVIDIA GPU devices. NVML is delivered in the GRID Management SDK which also includes a runtime version.
Also provided alongside the GRID SDK is the command line nvidia-smi tool that calls NVML and NVML-based python bindings are also available.
Basically this new release adds most of the metrics already available for physical GPUs to vGPUs. It’s worth reading the NVML reference API Guide, here. Of course there are some that don’t really make sense for vGPU and graphics such as ECC and fan speed, so you need to read the documentation included in the GRID SDK.
The NVIDIA team have done a very good job of exposing (and not exposing) a wide range of metrics both on host and within VMs. So an individual end-user will be able to see their own framebuffer and vGPU utilization but not metrics that could compromise cloud security by inferring information about other VMs/users such as GPU temperature or vGPU usage by other VMs. Whilst a sys admin will be able to access host information about all VMs individual usage.
There is also a wide range of infrastructure queries included so more information about what vGPUs available and their properties etc. All information needed to build really good monitoring and management tools around.
You can read more details on the NVIDIA blog on the functionality that includes:
- Query the supported vGPUs types, creatable vGPU types and currently active vGPU types on a physical GPU.
- vGPU Properties — Gain insight into the properties of a vGPU profile, such as name, number of displays supported, maximum resolution supported, frame buffer size, current license status and more.
- Utilization Reports — For an active vGPU/virtual machine, these metrics report average 3D engine, frame buffer, encode engine and decode engine utilization since the last monitoring cycle.
The SDK exposes metrics via WMI and many also via PDH, allowing existing tools such as Windows perfmon GUI and GPU-z to automatically pick them up. In addition the metrics have been added to the updated nvidia-smi command line tool provided in the new release.
Third-party monitoring tools
Hypervisor such as Citrix XenServer and VMware ESXi already integrate with the underlying NVML libraries within the SDK and I’d expect them to add the additional metrics in a future release. Details of how to build a monitoring product for XenServer against the Citrix integration with the NVIDIA SDK are covered in an earlier blog, here.
NVIDIA are working with a number of popular monitoring vendors to accelerate the adoption into popular existing products of the new metrics but the SDK and APIs are open so any monitoring vendor can adopt GPU monitoring rapidly. Ask your monitoring vendor of choice about availability. The NVIDIA blog details NVIDIA is currently working with Citrix, VMware, EG Innovation, Lakeside Software and Liquidware Labs on integrations.
Where monitoring adds value and cuts costs
It is a bit of a no-brainer that the more insight you have into you data center infrastructure and behavior the quicker it is to provision and optimize your hardware usage and the quicker it is to debug issues / bottlenecks. The better the monitoring and management capabilities the lower the costs of down-time and also staff-training. These hidden costs are rarely accounted for in a hardware only costing model. The insight to avoid issues but also have 24/7 support available if you do have issues really should be accounted for. I read an interesting new blog from Bill Alderson (now Apalytics and a former Network Architect at CA technologies) titled Corporate End Users Cost Effective IT Monitoring Tools? Quantifying to some extent that if you don’t have decent pro-active monitoring you end up effectively using your end-users and their helpdesk calls as a reactive monitoring tool, with costly implications on staff productivity and also enthusiasm! I particularly liked this comment:
- What happens to a human when they have to wait long durations between attaching a document to an email, logging in for 15 minutes or when the system is down? “Mush Brain” that’s what, slow IT systems turn otherwise motivated thinking professionals into frustrated people who give up on getting things done rapidly due to slow data responses.
What this means for the future
This release has added significant functionality to aid those deploying and testing vGPU as well as managing day to day but it also has added another key component that is needed for automated intelligent workload balancing and VM migration (vMotion/XenMotion). So onwards I would expect to see GRID customers benefit from further enhancements on their existing hardware…. Watch this space!
If you have further questions the GRID SDK forum is a good place to ask or search for answers: http://www.gridforums.nvidia.com