On 1/18/21 2:53 PM, Daniel Henrique Barboza wrote:
Gitlab issue #72 [1] reports that removing SR-IOVs VFs before
removing the devices from the running domains can have strange
consequences.
Resolution: "Don't do that!!" :-)
Seriously, though, there are any number of things that someone with root
access to the host could do that would completely confuse libvirt and
break everything. For example, assigning a PF to a guest after having
assigned one or more of that PF's VFs to guests.
But I guess there's no harm in trying to mitigate the damage. As long as
it doesn't complicate the code to the extent that it becomes more
confusing, difficult to maintain, and liable to other regressions.
QEMU might be able to hotunplug the device inside the
guest, but Libvirt will not be aware of that,
If qemu knows enough to unplug the device from the guest, then I'm
presuming that it must also be sending a DEVICE_DELETED event up to
libvirt. Of course libvirt isn't expecting this, and so probably throws
it away, right?
and then the guest is
now inconsistent with the domain definition.
There's also the possibility of the VFs removal not succeeding
while the domain is running but then, as soon as the domain
is shutdown, all the VFs are removed. Libvirt can't handle
the removal of the PCI devices while trying to reattach the
hostdevs, and the Libvirt daemon can be left in an inconsistent
state (see [2]).
This patch starts to address the issue related in Gitlab #72, most
notably the issue described in [2]. When shutting down a domain
with SR-IOV hostdevs that got missing, virHostdevReAttachPCIDevices()
is failing the whole process and failing to reattach all the
PCI devices, including the ones that aren't related to the VFs that
went missing. Let's make it more resilient with host changes by
changing virHostdevGetPCIHostDevice() to return an exclusive error
code '-2' for this case. virHostdevGetPCIHostDeviceList() can then
tell when virHostdevGetPCIHostDevice() failed to find the PCI
device of a hostdev and continue to make the list of PCI devices.
virHostdevReAttachPCIDevices() will now be able to proceed reattaching
all other valid PCI devices, at least. The 'ghost hostdevs' will be
handled later on.
[1]
https://gitlab.com/libvirt/libvirt/-/issues/72
[2]
https://gitlab.com/libvirt/libvirt/-/issues/72#note_459032148
Signed-off-by: Daniel Henrique Barboza <danielhb413(a)gmail.com>
---
src/hypervisor/virhostdev.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/src/hypervisor/virhostdev.c b/src/hypervisor/virhostdev.c
index bd35397f2c..dbba36193b 100644
--- a/src/hypervisor/virhostdev.c
+++ b/src/hypervisor/virhostdev.c
@@ -220,7 +220,8 @@ virHostdevManagerGetDefault(void)
* is returned.
*
* Returns: 0 on success (@pci might be NULL though),
- * -1 otherwise (with error reported).
+ * -1 otherwise (with error reported),
+ * -2 PCI device not found. @pci will be NULL
*/
static int
virHostdevGetPCIHostDevice(const virDomainHostdevDef *hostdev,
@@ -235,6 +236,9 @@ virHostdevGetPCIHostDevice(const virDomainHostdevDef *hostdev,
hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI)
return 0;
+ if (!virPCIDeviceExists(&pcisrc->addr))
+ return -2;
What happens if the device disappeared and then came back before you got
here? (i.e. by setting sriov_numvfs = 0, and then sriov_numvfs = 128 (or
whatever)?
It's too late in the day for my to process this too much, so I'm going
to come back to it tomorrow. I will continue on to see if there is other
"simple" stuff in subsequent patches though - even if we can't get all
of this pushed right away, there's a significant amount that can be pushed.
+
actual = virPCIDeviceNew(&pcisrc->addr);
if (!actual)
@@ -270,7 +274,7 @@ virHostdevGetPCIHostDeviceList(virDomainHostdevDefPtr *hostdevs, int
nhostdevs)
virDomainHostdevDefPtr hostdev = hostdevs[i];
g_autoptr(virPCIDevice) pci = NULL;
- if (virHostdevGetPCIHostDevice(hostdev, &pci) < 0)
+ if (virHostdevGetPCIHostDevice(hostdev, &pci) == -1)
return NULL;
if (!pci)