Am Fri, 28 Oct 2016 11:25:55 -0400
schrieb Laine Stump <laine(a)redhat.com>:
On 10/28/2016 07:28 AM, Henning Schild wrote:
> Hey,
>
> i am running an unusual setup where i assign pci devices behind the
> back of libvirt. I have two options to do that:
> 1. a wrapper script for qemu that takes care of suid-root and
> appends arguments for pci-assign
> 2. virsh qemu-monitor-command ... 'device_add pci-assign...'
With any reasonably modern version of Linux/qemu/libvirt, you should
not be using pci-assign, but should use vfio-pci instead. pci-assign
is old, unmaintained, and deprecated (and any other bad words you can
think of).
Also, have you done anything to lock the guest's memory in host RAM?
This is necessary so that the source/destination of DMA reads/writes
is always present. It is done automatically by libvirt as required
*when libvirt knows that a device is being assigned to the guest*,
but if you're going behind libvirt's back, you need to take care of
that yourself (or alternately, don't go behind libvirt's back, which
is the greatly preferred alternative!)
Memory locking is taken care of with "-realtime mlock=on".
>
> I know i should probably not be doing this,
Yes, that is a serious understatement :-) And I suspect that it isn't
necessary.
I know, but that was never the question ;).
> it is a workaround to
> introduce fine-grained pci-assignment in an openstack setup, where
> vendor and device id are not enough to pick the right device for a
> vm.
libvirt selects the device according to its PCI address, not vendor
and device id. Is that not "fine-grained" enough? (And does OpenStack
not let you select devices based on their PCI address?)
The workaround is indeed for the version of OpenStack we are using.
Recent versions might have support for more fine-grained assignment,
but updating OpenStack is not something i would like to do right now.
Another item on the TODO-list that i would like to keep seperate from
the problem at hand.
>
> In both cases qemu will crash with the following output:
>
>> qemu: hardware error: pci read failed, ret = 0 errno = 22
> followed by the usual machine state dump. With strace i found it to
> be a failing read on the config space file of my device.
> /sys/bus/pci/devices/0000:xx:xx.x/config
> A few reads out of that file succeeded, as well as accesses on
> vendor etc.
>
> Manually launching a qemu with the pci-assign works without a
> problem, so i "blame" libvirt and the cgroup environment the qemu
> ends up in. So i put a bash into the exact same cgroup setup - next
> to a running qemu, expecting a dd or hexdump on the config-space
> file to fail. But from that bash i can read the file without a
> problem.
>
> Has anyone seen that problem before?
No, because nobody else (that I've ever heard) is doing what you are
doing. You're going around behind the back of libvirt (and
OpenStack) to do device assignment with a method that was replaced
with something newer/better/etc about 3 years ago, and in the process
are likely missing a lot of the details that would otherwise be
automatically handled by libvirt.
Sure, and my question was aiming at what exactly i could be missing.
That is just to fix a system that used to work and get a better
understanding of "a lot of the details that would otherwise be
automatically handled by libvirt".
> Right now i do not know what i
> am missing, maybe qemu is hitting some limits configured for the
> cgroups or whatever. I can not use pci-assign from libvirt, but if i
> did would it configure cgroups in a different way or relax some
> limits?
>
> What would be a good next step to debug that? Right now i am
> looking at kernel event traces, but the machine is pretty big and
> so is the trace.
My recommendation would be this:
1) look at OpenStack to see if it allows selecting the device to
assign by PCI address. If so, use that (it will just tell libvirt
"assign this device", and libvirt will automatically use VFIO for the
device assignment if it's available (which it will be))
The version currently in use does not allow that.
2) if (1) is a deadend (i.e. OpenStack doesn't allow you to
select
based on PCI address), use your "sneaky backdoor method" to do "virsh
attach-device somexmlfile.xml", where somexmlfile.xml has a proper
<hostdev> element to select and assign the host device you want.
Again, libvirt will automatically figure out if VFIO can be used, and
will properly setup everything necessary related to cgroups, locked
memory, etc.
Thanks! I will try the sneaky .xml method, in that case i will only
have to play tricks on OpenStack and hopefully get all the libvirt
details.
>
> That assignment used to work and i do not know how it broke, i have
> tried combinations of several kernels, versions of libvirt and qemu.
> (kernel 3.18 and 4.4, libvirt 1.3.2 and 2.0.0, and qemu 2.2.1 and
> 2.7) All combinations show the same problem, even the ones that
> work on other machines. So when it comes to software versions the
> problem could well be caused by a software update of another
> component, that i got with the package manager and did not compile
> myself. It is a debian 8.6 with all recent updates installed. My
> guess would be that systemd could have an influence on cgroups or
> limits causing such a problem.
That you would need to think of such things points out that your
current setup is fragile and ultimately unmaintainable. Please
consider "coloring inside the lines" :-) (We'd be happy to help if
there are any hangups along the way, either on the libvirt-users
mailing list or in the #virt channel on
irc.oftc.net).
It is a legacy reference/demo/proof-of-concept setup for
realtime-enabled VMs, that somehow broke. PCI assignment was used for
NICs when guests did not support virtio.
https://archive.fosdem.org/2016/schedule/event/virt_iaas_real_time_cloud/
Since it is a hack and unmaintainable and does not scale, we do not use
it anymore. But i was curious why it suddenly stopped working in that
old demo setup.
regards,
Henning