Re: [libvirt] [RFC] libvirt vGPU QEMU integration

Friday, 19 August 2016

On 18.08.2016 18:41, Neo Jia wrote:
...
 Hi libvirt experts, 
Hi, welcome to the list.

...

 I am starting this email thread to discuss the potential solution / proposal of
 integrating vGPU support into libvirt for QEMU.

 Some quick background, NVIDIA is implementing a VFIO based mediated device
 framework to allow people to virtualize their devices without SR-IOV, for
 example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
 VFIO API to process the memory / interrupt as what QEMU does today with passthru
 device. 
So as far as I understand, this is solely NVIDIA's API and other vendors
(e.g. Intel) will use their own or is this a standard that others will
comply to?

...

 The difference here is that we are introducing a set of new sysfs file for
 virtual device discovery and life cycle management due to its virtual nature.

 Here is the summary of the sysfs, when they will be created and how they should
 be used:

 1. Discover mediated device

 As part of physical device initialization process, vendor driver will register
 their physical devices, which will be used to create virtual device (mediated
 device, aka mdev) to the mediated framework.

 Then, the sysfs file "mdev_supported_types" will be available under the
physical
 device sysfs, and it will indicate the supported mdev and configuration for this 
 particular physical device, and the content may change dynamically based on the
 system's current configurations, so libvirt needs to query this file every time
 before create a mdev. 
Ah, that was gonna be my question. Because in the example below, you
used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And
I
was wondering where does the number 20 come from. Now what I am
wondering about is how libvirt should expose these to users. Moreover,
how it should let users to chose.
We have a node device driver where I guess we could expose possible
options and then require some explicit value in the domain XML (but what
value would that be? I don't think taking vgpu_type_id-s as they are
would be a great idea).

...

 Note: different vendors might have their own specific configuration sysfs as
 well, if they don't have pre-defined types.

 For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
 NVIDIA specific configuration on an idle system.

 For example, to query the "mdev_supported_types" on this Tesla M60:

 cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
 # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
 max_resolution
 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160

 2. Create/destroy mediated device

 Two sysfs files are available under the physical device sysfs path : mdev_create
 and mdev_destroy

 The syntax of creating a mdev is:

     echo "$mdev_UUID:vendor_specific_argument_list" >
 /sys/bus/pci/devices/.../mdev_create

 The syntax of destroying a mdev is:

     echo "$mdev_UUID:vendor_specific_argument_list" >
 /sys/bus/pci/devices/.../mdev_destroy

 The $mdev_UUID is a unique identifier for this mdev device to be created, and it
 is unique per system. 
Ah, so a caller (the one doing the echo - e.g. libvirt) can generate
their own UUID under which the mdev will be known? I'm asking because of
migration - we might want to preserve UUIDs when a domain is migrated to
the other side. Speaking of which, is there such limitation or will
guest be able to migrate even if UUID's changed?

...

 For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
 above Tesla M60 output), and a VM UUID to be passed as
 "vendor_specific_argument_list". 
I understand the need for vgpu_type_id, but can you shed more light on
the VM UUID? Why is that required?

...

 If there is no vendor specific arguments required, either "$mdev_UUID" or
 "$mdev_UUID:" will be acceptable as input syntax for the above two commands.

 To create a M60-4Q device, libvirt needs to do:

     echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
 /sys/bus/pci/devices/0000\:86\:00.0/mdev_create

 Then, you will see a virtual device shows up at:

     /sys/bus/mdev/devices/$mdev_UUID/

 For NVIDIA, to create multiple virtual devices per VM, it has to be created
 upfront before bringing any of them online.

 Regarding error reporting and detection, on failure, write() to sysfs using fd
 returns error code, and write to sysfs file through command prompt shows the
 string corresponding to error code.

 3. Start/stop mediated device

 Under the virtual device sysfs, you will see a new "online" sysfs file.

 you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
 of this virtual device (0 or 1), and to start a virtual device or stop a virtual 
 device you can do:

     echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online

 libvirt needs to query the current state before changing state.

 Note: if you have multiple devices, you need to write to the "online" file
 individually.

 For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
 them "online" before starting QEMU. 
This is a valid requirement, indeed.

...

 4. Launch QEMU/VM

 Pass the mdev sysfs path to QEMU as vfio-pci device:

     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0 
One question here. Libvirt allows users to run qemu under different
user:group than root:root. If that's the case, libvirt sets security
labels on all files qemu can/will touch. Are we going to need to do
something in that respect here?

...

 5. Shutdown sequence 

 libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
 virtual device

 6. VM Reset

 No change or requirement for libvirt as this will be handled via VFIO reset API
 and QEMU process will keep running as before.

 7. Hot-plug

 It optional for vendors to support hot-plug.

 And it is same syntax to create a virtual device for hot-plug. 

 For hot-unplug, after executing QEMU monitor "device del" command, libvirt
needs
 to write to "destroy" sysfs to complete hot-unplug process.

 Since hot-plug is optional, then mdev_create or mdev_destroy operations may
 return an error if it is not supported. 
Thank you for very detailed description! In general, I like the API as
it looks usable from my POV (I'm no VFIO devel though).

Michal

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [RFC] libvirt vGPU QEMU integration