Monitoring NVIDIA vGPU for Citrix XenServer including with XenCenter

hands-on
Real customers setting up GRID in the GTC 2016 hands-on; the following week the SA team tried it out on their colleagues including novices to GRID!

I had some fun at NVIDIA GTC 2016 taking part in a hands-on lab run by the SA (Solution Architecture) organisation of which I am a part. These labs are proving really useful for walking new-users to GRID through key operations on both VMware and Citrix stacks. The guys running it mooted adding more on monitoring once you have got set-up and I kind of volunteered to have a crack at a bonus chapter for the hands-on around monitoring on Citrix.

Having worked at Citrix this is was an easy one, I may well set myself the bigger challenge of getting more familiar with VMware metrics in the future…. if this proves useful and depending on the feedback from you the reader!

I often get asked how many vCPUs/how much RAM etc… a VM should be provisioned with for the most random applications and even when I am familiar with the applications, many are used so differently by different users that it’s hard to say. For example AutoCAD or SolidWorks have a vast range of functions from 2D to 3D to rendering.

However a user can tell for themselves if they may have a problem in the provisioning by developing a little knowledge of the XenServer/XenCenter metrics especially those not on by default! I’ve included some information below that I’m hoping will guide the reader to working out if the number of vCPUs allocated is causing a problem or not…. let’s see how it goes…. and do look out for those hands-on labs at GTC and other NVIDIA events.

Problem

Customers aren’t always aware of all the metrics available on XenServer / within XenCenter. Particularly to help them assess if they have provisioned resources like vCPU and RAM for the VM optimally for the applications they are using.

Solution

This article is to help new users become more familiar with metrics available, how to view them in XenCenter. I’m hoping it can be incorporated into a hands-on lab or user guide, so please add suggestions for improvements.

XenServer Monitoring

Citrix XenServer has a good deal of metrics which can be accessed from a command prompt in the hypervisor or from within the XenCenter management console. Many metrics are off by default to avoid unnecessary system load where they not normally be needed. There is a very detailed guide on which metrics are available, how to configure thresholds for alerts and how to trigger email alerts within Chapter 9 of The XenServer Administrator Guide. Always consult the version of the guide pertaining to the version of XenServer you are using e.g. for XS6.5 – Citrix XenServer® 6.5 Administrator’s Guide.

The metrics for monitoring GPU usage though are not documented in the administrator guide as this is a set of metrics currently associated with the NVIDIA vGPU feature rather than available for any GPU vendor. This guide does contain all the information on how to add graphs for metrics such as those for GPUs and how to set up alerts etc. I’ve blogged about the availability of these metrics: https://www.citrix.com/blogs/2014/01/22/xenserverxendesktop-vgpu-new-metrics-available-to-monitor-nvidia-grid-gpus/

For NVIDIA vGPU the main metrics of interest are

Class Name Units Description Enabled by default? Condition for existence
Host gpu_memory_free_<pci-bus-id> Bytes Unallocated framebuffer memory  No A supported GPU is installed on the host
Host gpu_memory_used_<pci-bus-id> bytes Allocated framebuffer memory  No A supported GPU is installed on the host
Host gpu_power_usage_<pci-bus-id> mW Power usage of this GPU  No A supported GPU is installed on the host
Host gpu_temperature_<pci-bus-id> °C Temperature of this GPU  No A supported GPU is installed on the host
Host gpu_utilisation_compute_<pci-bus-id> (fraction) Proportion of time over the past sample period during which one or more kernels was executing on this GPU  No A supported GPU is installed on the host
Host gpu_utilisation_memory_io_<pci-bus-id> (fraction) Proportion of time over the past sample period during which global (device) memory was being read or written on this GPU  No A supported GPU is installed on the host

 

Note: GPU metrics are available in XenCenter for GPU-passthrough but because of the nature of PCIe pass-through the hypervisor has no access to the actual data (pass-through means only the VM can see/access the GPU) and so these graphs and metrics will be zero (i.e. equal to 0).

 

If you are trouble-shooting a performance issue it is important that you identify which resource is the bottleneck. Often it may not be the GPU. Metrics that are particularly worth checking include:

 

  • Those pertaining to CPU usage on the Host
Class Name Description Condition for existence XenCenter Name
Host cpu<cpu>-C<cstate>

 

Time CPU<cpu> spent in C-state <cstate> in miliseconds. C-state

exists on CPU

CPU <cpu>C-state<cstate>

 

 

Host cpu<cpu>-P<pstate> Time CPU <cpu> spent in P-state <pstate> in miliseconds. P-state

exists on

CPU

CPU <cpu>

P-state

<pstate>

 

Host cpu<cpu>

 

Utilisation of physical CPU <cpu> (fraction). Enabled

by default.

pCPU

<cpu>

exists

 

CPU <cpu>
Host cpu_avg Mean utilisation of physical CPUs (fraction). Enabled

by default.

None

 

Average

CPU

 

C-State and P-state information is particularly insightful in the context of bursty (CAD applications often are) applications where peak vs. average usage can vary. Many servers are shipped in power saving mode rather than for maximum performance. This needs to be changed in the BIOS to allow the hypervisor and hence app to use the full range of P/C-States. I wrote a guide to C/P-states a long time ago: http://xenserver.org/partners/developing-products-for-xenserver/19-dev-help/138-xs-dev-perf-turbo.html I’m not sure whether the information is correct with respect to the XenServer commands to optimally configure a system but the monitor instructions should be correct.

 

Many CAD/3D applications can be highly single-threaded and benefit from using turbo mode. Catia is one such application that has often been like this. P-state (P0) the highest mode is traditionally used to indicate if turbo is in use but you must be very careful if using XenCenter to note the convention that if turbo is in use, P0 will be turbo mode and P1 the highest non-turbo mode.  There is a convention of labelling turbo-mode with a frequency +1MHz above normal maximum frequency means that XenCenter does not reflect the true frequency of the turbo mode and as such users may interpret it that turbo mode is not occurring. E.g. on a 3400MHz Intel system, P0 will be logged as 3401MHz, where the maximal non-turbo mode is P1 with 3400MHz.

 

  • Those pertaining to CPU usage on the VM
Class Name Description Condition for existence XenCenter Name
VM cpu<cpu>

Enabled by default

Utilisation of vCPU <cpu> (fraction). vCPU<cpu> exists CPU <cpu>

 

VM memory Memory currently allocated to VM (Bytes).Enabled by default None Total Memory

 

VM memory_target Target  of  VM  balloon  driver  (Bytes).  Enabled  by default None

 

Memory target

 

VM memory_internal_free

 

Memory used as reported by the guest agent (KiB).

Enabled by default

None Free Memory
VM runstate_fullrun

 

Fraction of time that all VCPUs are running.

 

None VCPUs full run

 

 

VM runstate_full_contention

 

Fraction  of  time  that  all  VCPUs  are  runnable  (i.e., waiting for CPU)

 

None

 

VCPUs  full contention

 

 

VM runstate_concurrency_hazard

 

Fraction of time that some VCPUs are running and some are runnable None VCPUs concurrency hazard

 

VM runstate_blocked

 

Fraction of time that all VCPUs are blocked or offline

 

None

 

VCPUs idle

 

VM runstate_partial_run Fraction of time that some VCPUs are running, and some are blocked

 

None

 

VCPUs partial run

 

VM runstate_partial_contention Fraction of time that some VCPUs are runnable and some are blocked

 

None VCPUs partial contention

 

 

 

  • If you are interested in measuring the vCPU overprovisioning from the point of view of the host, you can use the host’s cpu_avg metric to see if it’s too close to 1.0 (rather than 0.8, i.e. 80%): If you are interested in measuring the vcpu overprovisioning from the point of view of a specific VM, you can use the VM’s runstate_*  metrics, especially the ones measuring runnable, which should be less than 0.01 or so. These metrics can be investigated via the command line or XenCenter.

 

XenServer metrics are stored by a mechanism of RRD (Round Robin Database) which means that data stored is limited by degrading historical data in granularity. E.g. the last 10 minutes of data can be accessed at a sample interval of 5s as collected, older data is sample-binned and so becomes increasingly averaged. This means the graphs in XenCenter will become smoother and data on short-lived events is lost.  Each archive in the database samples its particular metric on a specified granularity:

  • Every 5 seconds for the duration of 10 minutes
  • Every minute for the past two hours
  • Every hour for the past week
  • Every day for the past year

 

XenCenter contains a very generic interface to metric data, which means that any available metric can be graphed and plotted. Knowing the GPU metrics the guide will show you how to add those GPU metrics into XenCenter graphs.

 

Exercise: Adding P-state graphs to XenCenter

Find the section “Configuring Performance Graphs” with in the XenServer Administrators Guide and follow the steps:

To Add A New Graph

  1. On the Performance tab, click Actions and then New Graph. The New Graph dialog box will be displayed.
  2. In the Name field, enter a name for the graph.
  3. From the list of Datasources, select the check boxes for the datasources you want to include in the graph, i.e. those with the format CPU<cpu>P-state<pstate>:
  4. Add all available P-states for the first CPU
  5. What C-states are available?
  6. Click Save.
  7. Now view the graph:
  8. Is turbo-boost in use, can you tell? (hover over the graph)

 

Exercise: Check whether vCPU contention is occurring using XenCenter

  • Hint: you may need to add a graph for certain runstate_ metric
  • Hint: you may also need to check a CPU metric, which one?

 

Checking your GPU configuration

The XenServer CLI (Command Line Interface) offers many commands to probe your XenServer environment. Again these are documented in the Administrators guide but in an Appendix sub-section titled “GPU Commands”. The CLI has good, if esoteric, tab completion.

Exercise: Check what vGPU types are used on each pGPU (physical GPU) in the system)

Use

  • xe pgpu-list

to get a list of the pGPUs use the output from this as input to the xe command:

  • resident-VGPUs

to find out what vGPUs have been configured on each pGPU.

Caveats

4 thoughts on “Monitoring NVIDIA vGPU for Citrix XenServer including with XenCenter

Add yours

  1. As with all your articles this is all very insightful and reflects my findings in the field spot on!

    A fun fact about cad cam is that the GPU is far less important then one would think. Most applications benefit more from high cpu clock speeds then a bigger nvidia grid profile.

    One thing I’d like to add is that although the turbo mode is excellent one should not size on turbo mode because it’s not guaranteed to kick in, the way I advise it is to size properly and have turbo mode as a sort of icing on the cake.

    It’s great that xenserver offers these metrics to be able to troubleshoot or assess performance. I always start with Lakeside Systrack to make sure I advise my clients the proper CPUs.

    One thing I also learned is that IOPS are very important for engineers. Autodesk for example benefits enormously from fast io access and low latency storage. The faster the better so I prefer Atlantis over SSD

    Liked by 1 person

  2. IOPS is indeed critical. With a SDS solution, things are massively better and we get around 90% cache read rates for our XenDesktop VMs, which is fantastic. Interestingly, I rarely see turbo mode kicking in much, even though in recent releases of XenServer it’s the recommended setting. Just got a couple of Dell 730 servers with the new Broadwell v4 Xeon CPUs so will see how they perform.
    As to GPU vs. CPU, a lot depends. I often see four CPUs kicking in for a GPU passthrough session, so in this case, the CPU is still getting a lot of the load.
    Assuming the GPU is going to take on the brunt of the work is indeed erroneous, so you’re spot on with that point, Barry. And even with vGPU, the load is split in many cases — you can’t have a gone one without the other.

    Liked by 1 person

  3. Great and thorough article, Rachel! Anyone that is about to build their GRID-enabled XenServer should pay close attention, and also read the previous articles too. It’s important to get your BIOS CPU and power settings right for GRID.

    When I did an investigation into XenServer vCPU contention, I focused on the following metrics:

    -vCPU Full Contention
    -vCPU Partial Contention
    -vCPU Concurrency Hazard

    As you’ve noted above, the “vCPU Concurrency Hazard” metric is defined as the “fraction of time that some VCPUs are running and some are runnable.” If you think about this definition, 1 core could be runnable or 10 cores could be runnable. Both fit the definition of “some” cores. I hope that this metric will become more accurate in the future to understand what percentage of cores are runnable.

    I’d also like to see GPU contention metrics in the future to monitor the time-slicing of the cores.

    Thanks for putting all this info together, Rachel!

    -Richard

    Liked by 1 person

  4. Richard,
    The time-slice metric would indeed be interesting, but you’d of course have to do some sort of average as this happens so fast that XenCenter can only update I believe every 5 seconds. But nevertheless, this would be very useful to see how things are time-sliced at least trend-wise or if a GPU is over-loaded.
    -=Tobias

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: