On Fri, Aug 19, 2016 at 03:22:48PM -0400, Laine Stump wrote:
On 08/18/2016 12:41 PM, Neo Jia wrote:
> Hi libvirt experts,
> I am starting this email thread to discuss the potential
solution / proposal of
> integrating vGPU support into libvirt for QEMU.
Thanks for the detailed description. This is very helpful.
> Some quick background, NVIDIA is implementing a VFIO
based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> VFIO API to process the memory / interrupt as what QEMU does today with passthru
> device.
> The difference here is that we are introducing a set of
new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
> Here is the summary of the sysfs, when they will be
created and how they should
> be used:
> 1. Discover mediated device
> As part of physical device initialization process,
vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.
We've discussed this question offline, but I just want to make sure I
understood correctly - all initialization of the physical device on the host
is already handled "elsewhere", so libvirt doesn't need to be concerned
with
any physical device lifecycle or configuration (setting up the number or
types of vGPUs), correct?
Hi Laine,
Yes, that is right, at least for NVIDIA vGPU.
Do you think this would also be the case for other
vendors using the same APIs? I guess this all comes down to whether or not
the setup of the physical device is defined within the bounds of the common
infrastructure/API, or if it's something that's assumed to have just
magically happened somewhere else.
I would assume that is the case for other vendors as well, although this common
infrastructure doesn't put any restrictions about the physical device setup or
initialization, so actually vendor can have options to defer some of them till
the point when virtual device gets created.
But if we just look at from the API level which gets exposed to libvirt, it is
the vendor driver's responsibility to ensure that the virtual device will be
available in a reasonable amount of time after the "online" sysfs file is set
to
1. But where to hide the HW setup is not enforced in this common API.
In NVIDIA case, once our kernel driver registers the physical devices that he
owns to the "common infrastructure", all the physical devices are already fully
initialized and ready for virtual device creation.
> Then, the sysfs file "mdev_supported_types"
will be available under the physical
> device sysfs, and it will indicate the supported mdev and configuration for this
> particular physical device, and the content may change dynamically based on the
> system's current configurations, so libvirt needs to query this file every time
> before create a mdev.
I had originally thought that libvirt would be setting up and managing a
pool of virtual devices, similar to what we currently do with SRIOV VFs. But
from this it sounds like the management of this pool is completely handled
by your drivers (especially since the contents of the pool can apparently
completely change at any instant). In one way that makes life easier for
libvirt, because it doesn't need to manage anything.
The pool (vgpu type availabilities) will only subject to change when virtual
devices get created or destroyed, as for now we don't support heterogeneous vGPU
type on the same physical GPU. Even in the future we have added such support,
the point of change is still the same.
On the other hand, it makes thing less predictable. For example, when
libvirt defines a domain, it queries the host system to see what types of
devices are legal in guests on this host, and expects those devices to be
available at a later time. As I understand it (and I may be completely
wrong), when no vGPUs are running on the hardware, there is a choice of
several different models of vGPU (like the example you give below), but when
the first vGPU is started up, that triggers the host driver to restrict the
available models. If that's the case, then a particular vGPU could be
"available" when a domain is defined, but not an option by the time the
domain is started. That's not a show stopper, but I want to make sure I am
understanding everything properly.
Yes, your understanding is correct as I talked about no heterogeneous vGPU
support yet. But this will open up another interesting point of vGPU placement
policy that libvirt might need to consider.
Also, is there any information about the maximum number of vGPUs that can be
handled by a particular physical device (I think that changes based on which
model of vGPU is being used, right?)
Yes, that is the "max_instance" in the example.
Or maybe what is the current "load" on
a physical device, in case there is more than one and libvirt (or
management) wants to make a decision about which one to use?
If you refer "load" as "physical GPU utilization", we do have a tool
to allow
you to find out such information, but it is not exposed via this mdev sysfs.
Here is the link of NVIDIA NVML high level overview:
https://developer.nvidia.com/nvidia-management-library-nvml
If you want to know more info/details about the integration with NVML, I am very
happy to talk to you and probably connect you with our nvml experts for vGPU
related topics.
> Note: different vendors might have their own specific
configuration sysfs as
> well, if they don't have pre-defined types.
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here
registered, and here is
> NVIDIA specific configuration on an idle system.
> For example, to query the
"mdev_supported_types" on this Tesla M60:
> cat
/sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600
> 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600
> 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600
> 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600
> 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600
> 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600
> 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160
> 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
> 2. Create/destroy mediated device
> Two sysfs files are available under the physical device
sysfs path : mdev_create
> and mdev_destroy
> The syntax of creating a mdev is:
> echo
"$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create
> The syntax of destroying a mdev is:
> echo
"$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
> The $mdev_UUID is a unique identifier for this mdev
device to be created, and it
> is unique per system.
Is there any reason to try to maintain the same UUID from one run to the
next? Or should we completely think of this as a cookie for this time only
(so more like a file handle, but we get to pick the value)? (Michal has
asked about this in relation to migration, but the question also applies in
the general situation of simply stopping and restarting a guest).
You don't have to maintain the same UUID from one run to the next. Yes, it is
more like a file handle, and you are going to pick the value.
Also, is it enforced that "UUID" actually be a 128 bit UUID, or can it be
any unique string?
Yes, it is enforced by the framework that the UUID is a 128 bit UUID format.
> For NVIDIA vGPU, we require a vGPU type identifier
(shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".
> If there is no vendor specific arguments required,
either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two
commands.
> To create a M60-4Q device, libvirt needs to do:
> echo
"$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> Then, you will see a virtual device shows up at:
> /sys/bus/mdev/devices/$mdev_UUID/
> For NVIDIA, to create multiple virtual devices per VM,
it has to be created
> upfront before bringing any of them online.
> Regarding error reporting and detection, on failure,
write() to sysfs using fd
> returns error code, and write to sysfs file through command prompt shows the
> string corresponding to error code.
> 3. Start/stop mediated device
> Under the virtual device sysfs, you will see a new
"online" sysfs file.
> you can do cat /sys/bus/mdev/devices/$mdev_UUID/online
to get the current status
> of this virtual device (0 or 1), and to start a virtual device or stop a virtual
> device you can do:
> echo "1|0" >
/sys/bus/mdev/devices/$mdev_UUID/online
> libvirt needs to query the current state before changing
state.
> Note: if you have multiple devices, you need to write to
the "online" file
> individually.
> For NVIDIA, if there are multiple mdev per VM, libvirt
needs to bring all of
> them "online" before starting QEMU.
> 4. Launch QEMU/VM
> Pass the mdev sysfs path to QEMU as vfio-pci device:
> -device
vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0
1) I have the same question as Michal - you're passing the path to the sysfs
directory for the device to qemu, which implies that the qemu process will
need to open/close/read/write files in that directory. Since libvirt is
running as root, it can easily do that, but libvirt then runs the qemu
process under a different uid and with a different selinux context. We need
to make sure that we can change the uid/selinux labelling of the items in
sysfs without adverse effect elsewhere.
Also it's important that qemu doesn't need to access anything else outside
of this device-specific directory (each qemu process is running with
different selinux labeling and potentially a different uid:gid, so if there
is any common file/device node that must be accessed directly by qemu, it
would need to be safely globally readable/writable.
Similar response to Michal here:
As long as QEMU uses VFIO API and doesn't do anything extra for any particular
vendor, there shouldn't be any problem at QEMU side. So I don't see any issues
here.
But I would like to test it out with the proper setting for NVIDIA vGPU case.
Currently all our testing is using sysfs and launch QEMU directly, if I just
mimic how libvirt launches QEMU for normal VFIO passthru device, will that
cover the selinux label concerns?
How does this device show up in the guest? guess it's a PCI device (since
you're using vfio-pci :-), and all the standard options for setting PCI
address apply. And is this device legacy PCI, or PCI Express? (Or perhaps it
changes behavior depending on the type of slot used in the guest?)
It depends on how vendor driver emulates capabilities in config space. For
NVIDIA vGPU we are defining it as PCI device. But other vendor could define
PCIe capabilities in config space and that would show PCIe device in guest.
For IBM solution, it's not even a PCI device, that is a channel IO device. It
all depends in how vendor driver simulates the device.
> 5. Shutdown sequence
> libvirt needs to shutdown the qemu, bring the virtual
device offline, then destroy the
> virtual device
> 6. VM Reset
> No change or requirement for libvirt as this will be
handled via VFIO reset API
> and QEMU process will keep running as before.
> 7. Hot-plug
> It optional for vendors to support hot-plug.
> And it is same syntax to create a virtual device for
hot-plug.
> For hot-unplug, after executing QEMU monitor
"device del" command, libvirt needs
> to write to "destroy" sysfs to complete hot-unplug process.
> Since hot-plug is optional, then mdev_create or
mdev_destroy operations may
> return an error if it is not supported.
From what I understand here, it sounds like what's needed from libvirt is
1) exposing enough info in the output of nodedev-dumpxml for an application
to use it to determine which devices are capable of creating vGPUs, and
which models of vGPU they can create.
2) to create+start (then later stop+destroy) individual vGPUs based on
[something] in the domain XML. So the question that remains is how to put it
in the domain config. My first instinct was to use some variation of
<hostdev> (since the backend of it is vfio-pci), but on the other hand
hostdev is usually used to take one device that could be used by the host,
take it away from the host, and give it to the guest, and that's not exactly
what's happening here. So I wonder if there would be any advantage to making
this another model of <video> instead.
hostdev can be a sysfs path now right?
Thanks,
Neo