Benchmarking virtualized NVIDIA GRID GPU cards using HPC methodologies! Don’t wear shiny green high-heels in the farmyard!

amandas-shoes
Follow GRID team’s @AmandaMSaunders and the adventures of the magical GRID shoes!

I’ve been cc:d on a clutch of strange enquiries recently where people are trying to evaluate the NVIDIA GRID GPU cards using HPC (High-performance computing) methods, benchmarks and comparisons. Hypervisors aren’t usually fully compatible with HPC application architecture and as such, although the newest NVIDIA Tesla cards (M60/M6) can be repurposed between a “compute mode” and a specialist “graphics” mode is provided for virtualized graphics designed with hypervisors and graphical application architecture in mind.

The previous (Kepler) generation of NVIDIAGPUs for virtualized graphics (GRID K1/K2) are designed solely for grahical workloads.

Case study: An attempt to benchmark GRID K2 cards using oclHashCat

We had a user recently looking to use an application called oclHashCat to evaluate the performance of CUDA/OpenCL on a K2 card in GPU pass-through mode by brute forcing some passwords. They then queried why they found better results with a consumer graphics card in a laptop.

The choice of benchmark was highly puzzling, the application footprint of a password cracker is so very different from any a user virtualizing graphics would encounter that the results would have little relevance to what any user would experience. If you are looking to remote graphics you really should to look at a more realistic benchmark particularly when virtualizing as hypervisors simply aren’t fully compatible with HP applications and the impact of the graphical application footprint can affect hypervisor behavior. Basically if you are planning to virtualize graphics comparisons to physical HPC workloads are pretty much meaningless.

Nevertheless, I think it’s worth explaining why the user was right to query why a consumer card appeared to be “better”. I think it was a probably a reasonable assumption on that user’s part that behavior with the hashCat application would correlate well with single-precision calculation performance and that is an important property of GPUs used when accelerating graphical workloads. To a large extent this is really a Tesla performance question. The GRID K2 has excellent single precision performance (approx. 2x 1.8TFlops) which is comparable to a Telsa K10 and it would be very reasonable to expect the K2 to out-perform in this regard any GPU found in a reasonable laptop.

We like to investigate any anomaly like this and understand why a particular application behaves differently to how a user may expect and this usually involves delving in into the specific algorithm itself of number-crunchers like hashCat to explain the differences. The Kepler GRID cards weren’t particularly well suited to this type of password cracking, we recognized this and invested with Maxwell to improve this. hashCat themselves acknowledged, in this announcement, our now leading technologies, saying:

  • To make it short, in the past AMD had a strong advantage over NVidia with their instructions BFI_INT and BIT_ALIGN. Those instructions are very useful in crypto. NVidia draw the level with the new LOP3.LUT and the SHF instruction (that was already introduced with sm_35).
  • “Actually, the LOP3.LUT has advantage over BFI_INT. It’s much more flexible and can be used in other cases as well. Additionally, NVidia added another instruction “IADD3” that can add 3 integers all at once and store the result in a fourth integer. This instruction is also very useful in crypto.”

As above, I don’t think this is a worthwhile study to evaluate the cards intended for use in VDI graphics but the newer M60 / M6 cards built on top of a compute platform were architected to perform better than the K2, which was specifically designed and configured specifically for graphics only. If you want to do password cracking or bitcoin mining etc. you probably want to look at one of the specialist cards designed for those (I’m not an expert but am told a Radeon or NVIDIA Titan might be a good option). hashCAT is simply a very specific benchmark highly dependent on a few very specific bit shift like operations (something unlikely to be used by any graphical workload).

TCC / WDDM in the nvdia-smi

One user also noticed an option within the nvidia-smi to change the driver model from WDDM to TCC (Tesla Compute Cluster). The NVIDIA-smi is a low–level interface that really isn’t designed for end-user configuration. With the M60 cards we provided a mode switcher tool to enable a switch between HPC (compute) mode and graphics (e.g. for VDI) mode; to ensure users could set up their M60 cards best for graphics. If you don’t understand what the nvidia-smi does you probably shouldn’t touch it, even if you do you probably shouldn’t as it is highly likely to be untested or unsupported.

This is really a fault on our part at NVIDIA for leaving configuration options exposed and I think it is quite reasonable to explain what this option is, why it is there as after that it becomes logical as to why it’s best left alone.

The background of the TCC driver

It’s a Microsoft OS thing! In Windows XP and earlier Microsoft OSs used the XPDM driver model but later moved with win 8.x to the WDDM model. The Microsoft OS model doesn’t allow XPDM (or similar) and the WDDM drivers to co-exist; this is an issue that pops up for most virtualization vendors (e.g. causing unexpected yellow bangs for Citrix XenServer). Windows was never really designed for HPC and heavy CUDA workloads where the drivers could be tied up by applications doing large amounts of compute without graphical output. When virtualized the windows OSs and WDDM driver expects a handshake and without it assumes that the driver has crashed and restarts it (not what you want a few seconds into a heavy compute workload).

For users wishing to use cards for HPC with windows 7 we used to suggest that they use the older XPDM but divert it to the on-board VGA/matrix card. The WDDM model however overrode this and prevented the user from this workaround. So NVIDIA introduced a TCC driver, because the WDDM can’t co-exist with other drivers this prevents the loading of the problematic WDDM driver. The WDDM driver is designed for graphics though so if you are using NVIDIA cards for virtualized graphics it is what you should be using.

The TCC is designed for a very small niche of users with very bespoke HPC applications. Also, in TCC mode, the cards do not drive any display, there’s no WDDM support and so you need an additional adaptor to drive the display (whether that be a virtual adaptor, or another physical GPU).

Conclusions

Really if you are looking to virtualize graphics – please stick with recommended supported configurations for graphics, use drivers designed for graphics and benchmarks that reflect graphical workloads (I maintain a list of links – here; Ron Grass’s Benchmark list a good start)…. Otherwise it’s a bit like reviewing sparkly high-heels for fitness for puddle-jumping in a farmyard when you already have a pair of wellington boots! If on the other hand you are a ballroom dancer you might want to choose accordingly!

Back at Citrix I ended up maintaining a support article (CTX202160 –“HDX Benchmarking: Known incompatibilities and caveats with third-party benchmarks”) as a number of benchmarks have quirks when virtualized – it took me and the user who started all this quite a bit of time to delve into the specifics of this and I am rather minded that we should have a similar support article at NVIDIA…. One for my long to do list….

This isn’t a subject I knew much on and am learning, this blog is kind of what I’ve learned or understand so far…. so shout if there are gaps, weirdness etc! You can write what I know about HPC benchmarking on a postage stamp!amandas-shoes

 

One thought on “Benchmarking virtualized NVIDIA GRID GPU cards using HPC methodologies! Don’t wear shiny green high-heels in the farmyard!

Add yours

  1. It’s also important to note that some GPU devices that do support CUDA and other computational modes only do so in single-precision and the majority of applications that make use of such computations run them using double precision floating point calculations. While these can be emulated in many cases, the toll taken to do so is enormous and hence, so extremely inefficient as to render such hardware pretty much useless for such purposes. So, be wary of a device that can do everything (see Rachel’s shoe analogy) as it’s likely to contain compromises. The M6 and M60 mentioned above are really foreseen to be used primarily for graphics applications. If you need an NVIDIA card for computations, consider rather a K40 or K80.

    Like

Leave a comment

Blog at WordPress.com.

Up ↑