[libvirt] [PATCH 0/2] qemu: Update next usable PCI slot on domain reconnect

When libvirtd restarted and reconnected to existing domains, we didn't update nextslot so that hotplug starts with a slot next to the last used one. Jiri Denemark (2): qemu: Use structured PCI address as hash payload qemu: Update next usable PCI slot on domain reconnect src/qemu/qemu_conf.c | 118 +++++++++++++++++++++++++++++++++++++----------- src/qemu/qemu_conf.h | 1 + src/qemu/qemu_driver.c | 2 + 3 files changed, 94 insertions(+), 27 deletions(-) -- 1.7.2

Qemu driver stores a per-domain hash of used PCI addresses. Currently, payload of each hash entry is exactly the same as its key, i.e., string representation of a PCI address. This patch changes payload to be virDomainDevicePCIAddressPtr. Along the road, return values of all qemuPCIAddressAsString() calls are properly checked and error returned if required. --- src/qemu/qemu_conf.c | 78 ++++++++++++++++++++++++++++++++++++-------------- 1 files changed, 56 insertions(+), 22 deletions(-) diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index fb85220..38d28bf 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -2093,19 +2093,33 @@ static int qemuCollectPCIAddress(virDomainDefPtr def ATTRIBUTE_UNUSED, void *opaque) { qemuDomainPCIAddressSetPtr addrs = opaque; + virDomainDevicePCIAddressPtr pci = NULL; + char *addr = NULL; if (dev->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI) { - char *addr = qemuPCIAddressAsString(dev); + if (!(addr = qemuPCIAddressAsString(dev))) + goto error; + + if (VIR_ALLOC(pci) < 0) { + virReportOOMError(); + goto error; + } + *pci = dev->addr.pci; VIR_DEBUG("Remembering PCI addr %s", addr); - if (virHashAddEntry(addrs->used, addr, addr) < 0) { - VIR_FREE(addr); - return -1; - } + if (virHashAddEntry(addrs->used, addr, pci) < 0) + goto error; + + VIR_FREE(addr); } return 0; + +error: + VIR_FREE(addr); + VIR_FREE(pci); + return -1; } @@ -2134,25 +2148,31 @@ error: int qemuDomainPCIAddressReserveAddr(qemuDomainPCIAddressSetPtr addrs, virDomainDeviceInfoPtr dev) { - char *addr; + char *addr = NULL; + virDomainDevicePCIAddressPtr pci = NULL; addr = qemuPCIAddressAsString(dev); if (!addr) - return -1; + goto error; + + if (VIR_ALLOC(pci) < 0) { + virReportOOMError(); + goto error; + } + *pci = dev->addr.pci; VIR_DEBUG("Reserving PCI addr %s", addr); if (virHashLookup(addrs->used, addr)) { qemuReportError(VIR_ERR_INTERNAL_ERROR, _("unable to reserve PCI address %s"), addr); - VIR_FREE(addr); - return -1; + goto error; } - if (virHashAddEntry(addrs->used, addr, addr)) { - VIR_FREE(addr); - return -1; - } + if (virHashAddEntry(addrs->used, addr, pci)) + goto error; + + VIR_FREE(addr); if (dev->addr.pci.slot > addrs->nextslot) { addrs->nextslot = dev->addr.pci.slot + 1; @@ -2161,6 +2181,11 @@ int qemuDomainPCIAddressReserveAddr(qemuDomainPCIAddressSetPtr addrs, } return 0; + +error: + VIR_FREE(addr); + VIR_FREE(pci); + return -1; } int qemuDomainPCIAddressReserveSlot(qemuDomainPCIAddressSetPtr addrs, @@ -2226,11 +2251,12 @@ int qemuDomainPCIAddressSetNextAddr(qemuDomainPCIAddressSetPtr addrs, { int i; int iteration; + char *addr = NULL; + virDomainDevicePCIAddressPtr pci = NULL; for (i = addrs->nextslot, iteration = 0; iteration <= QEMU_PCI_ADDRESS_LAST_SLOT; i++, iteration++) { virDomainDeviceInfo maybe; - char *addr; if (QEMU_PCI_ADDRESS_LAST_SLOT < i) i = 0; @@ -2239,7 +2265,8 @@ int qemuDomainPCIAddressSetNextAddr(qemuDomainPCIAddressSetPtr addrs, maybe.addr.pci.bus = 0; maybe.addr.pci.slot = i; - addr = qemuPCIAddressAsString(&maybe); + if (!(addr = qemuPCIAddressAsString(&maybe))) + goto error; if (virHashLookup(addrs->used, addr)) { VIR_DEBUG("PCI addr %s already in use", addr); @@ -2247,17 +2274,20 @@ int qemuDomainPCIAddressSetNextAddr(qemuDomainPCIAddressSetPtr addrs, continue; } + if (VIR_ALLOC(pci) < 0) { + virReportOOMError(); + goto error; + } + *pci = maybe.addr.pci; + VIR_DEBUG("Allocating PCI addr %s", addr); - if (virHashAddEntry(addrs->used, addr, addr) < 0) { - VIR_FREE(addr); - return -1; - } + if (virHashAddEntry(addrs->used, addr, pci) < 0) + goto error; + VIR_FREE(addr); dev->type = VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI; - dev->addr.pci.domain = 0; - dev->addr.pci.bus = 0; - dev->addr.pci.slot = i; + dev->addr.pci = maybe.addr.pci; addrs->nextslot = i + 1; if (QEMU_PCI_ADDRESS_LAST_SLOT < addrs->nextslot) @@ -2268,6 +2298,10 @@ int qemuDomainPCIAddressSetNextAddr(qemuDomainPCIAddressSetPtr addrs, qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("No more available PCI addresses")); + +error: + VIR_FREE(addr); + VIR_FREE(pci); return -1; } -- 1.7.2

On 08/16/2010 11:08 AM, Jiri Denemark wrote:
Qemu driver stores a per-domain hash of used PCI addresses. Currently, payload of each hash entry is exactly the same as its key, i.e., string representation of a PCI address. This patch changes payload to be virDomainDevicePCIAddressPtr. Along the road, return values of all qemuPCIAddressAsString() calls are properly checked and error returned if required. --- src/qemu/qemu_conf.c | 78 ++++++++++++++++++++++++++++++++++++-------------- 1 files changed, 56 insertions(+), 22 deletions(-)
ACK. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Qemu driver stores a per-domain hash of used PCI addresses. Currently, payload of each hash entry is exactly the same as its key, i.e., string representation of a PCI address. This patch changes payload to be virDomainDevicePCIAddressPtr. Along the road, return values of all qemuPCIAddressAsString() calls are properly checked and error returned if required. --- src/qemu/qemu_conf.c | 78 ++++++++++++++++++++++++++++++++++++-------------- 1 files changed, 56 insertions(+), 22 deletions(-)
ACK.
NACK. I'm explicitly nacking the patch here to make it clear this patch was not pushed intentionally. Most of the patch is not needed since 2/2 will not by applied. The important parts of this patch were sent separately in small non-invasive patches. Jirka

--- src/qemu/qemu_conf.c | 40 +++++++++++++++++++++++++++++++++++----- src/qemu/qemu_conf.h | 1 + src/qemu/qemu_driver.c | 2 ++ 3 files changed, 38 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 38d28bf..bf950f2 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -2066,6 +2066,18 @@ struct _qemuDomainPCIAddressSet { }; +static void +qemuDomainPCIAddressSetUpdateNextSlot(qemuDomainPCIAddressSetPtr addrs, + const virDomainDevicePCIAddressPtr pci) +{ + if (pci->slot > addrs->nextslot) { + addrs->nextslot = pci->slot + 1; + if (QEMU_PCI_ADDRESS_LAST_SLOT < addrs->nextslot) + addrs->nextslot = 0; + } +} + + static char *qemuPCIAddressAsString(virDomainDeviceInfoPtr dev) { char *addr; @@ -2174,11 +2186,7 @@ int qemuDomainPCIAddressReserveAddr(qemuDomainPCIAddressSetPtr addrs, VIR_FREE(addr); - if (dev->addr.pci.slot > addrs->nextslot) { - addrs->nextslot = dev->addr.pci.slot + 1; - if (QEMU_PCI_ADDRESS_LAST_SLOT < addrs->nextslot) - addrs->nextslot = 0; - } + qemuDomainPCIAddressSetUpdateNextSlot(addrs, pci); return 0; @@ -2236,6 +2244,28 @@ int qemuDomainPCIAddressReleaseAddr(qemuDomainPCIAddressSetPtr addrs, } +static void +qemuDomainPCIAddressSetUpdateIter(void *payload, + const char *name ATTRIBUTE_UNUSED, + void *data) +{ + qemuDomainPCIAddressSetUpdateNextSlot(data, payload); +} + + +void +qemuDomainPCIAddressSetUpdate(qemuDomainPCIAddressSetPtr addrs) +{ + + if (!addrs) + return; + + virHashForEach(addrs->used, qemuDomainPCIAddressSetUpdateIter, addrs); + + VIR_DEBUG("nextslot updated to %d", addrs->nextslot); +} + + void qemuDomainPCIAddressSetFree(qemuDomainPCIAddressSetPtr addrs) { if (!addrs) diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 1aa9d2e..fb93e0e 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -330,6 +330,7 @@ int qemuDomainPCIAddressEnsureAddr(qemuDomainPCIAddressSetPtr addrs, int qemuDomainPCIAddressReleaseAddr(qemuDomainPCIAddressSetPtr addrs, virDomainDeviceInfoPtr dev); +void qemuDomainPCIAddressSetUpdate(qemuDomainPCIAddressSetPtr addrs); void qemuDomainPCIAddressSetFree(qemuDomainPCIAddressSetPtr addrs); int qemuAssignDevicePCISlots(virDomainDefPtr def, qemuDomainPCIAddressSetPtr addrs); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index dbc3d5c..5a52549 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1461,6 +1461,8 @@ qemuReconnectDomain(void *payload, const char *name ATTRIBUTE_UNUSED, void *opaq if (!(priv->pciaddrs = qemuDomainPCIAddressSetCreate(obj->def)) || qemuAssignDevicePCISlots(obj->def, priv->pciaddrs) < 0) goto error; + + qemuDomainPCIAddressSetUpdate(priv->pciaddrs); } if (driver->securityDriver && -- 1.7.2

On 08/16/2010 11:08 AM, Jiri Denemark wrote:
+void +qemuDomainPCIAddressSetUpdate(qemuDomainPCIAddressSetPtr addrs) +{ + + if (!addrs) + return; + + virHashForEach(addrs->used, qemuDomainPCIAddressSetUpdateIter, addrs);
Does virHashForEach visit the hash table in the same order in which entries were inserted, or does it visit in random order based on how the hashes map to buckets? If the latter, then how are you guaranteeing that you are setting nextslot to the right value, if you don't know if you are visiting the correct table entry last?
+ + VIR_DEBUG("nextslot updated to %d", addrs->nextslot); +} +
-- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Aug 17, 2010 at 09:11:30AM -0600, Eric Blake wrote:
On 08/16/2010 11:08 AM, Jiri Denemark wrote:
+void +qemuDomainPCIAddressSetUpdate(qemuDomainPCIAddressSetPtr addrs) +{ + + if (!addrs) + return; + + virHashForEach(addrs->used, qemuDomainPCIAddressSetUpdateIter, addrs);
Does virHashForEach visit the hash table in the same order in which entries were inserted, or does it visit in random order based on how the hashes map to buckets? If the latter, then how are you guaranteeing that you are setting nextslot to the right value, if you don't know if you are visiting the correct table entry last?
The hash iterator visits in an unpredictable order. It shouldn't matter because for this usage, all that is important is that 'nextslot' eventually ends up with the largest slot ID + 1. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 08/17/2010 09:18 AM, Daniel P. Berrange wrote:
On Tue, Aug 17, 2010 at 09:11:30AM -0600, Eric Blake wrote:
On 08/16/2010 11:08 AM, Jiri Denemark wrote:
+void +qemuDomainPCIAddressSetUpdate(qemuDomainPCIAddressSetPtr addrs) +{ + + if (!addrs) + return; + + virHashForEach(addrs->used, qemuDomainPCIAddressSetUpdateIter, addrs);
Does virHashForEach visit the hash table in the same order in which entries were inserted, or does it visit in random order based on how the hashes map to buckets? If the latter, then how are you guaranteeing that you are setting nextslot to the right value, if you don't know if you are visiting the correct table entry last?
The hash iterator visits in an unpredictable order. It shouldn't matter because for this usage, all that is important is that 'nextslot' eventually ends up with the largest slot ID + 1.
We don't want the first unused slot, rather the last slot reservation that was in place before the reconnect (often the first unused slot if you haven't done a lot of hot-unplugging, but not necessarily the case). The only way I can see to get at this is to revisit the reservations in the same order as they were reserved, rather than in the random order returned by the hash table, or at least keep track of whether each pci reservation has had subsequent reservations, vs. being the last reservation in the chain. qemuDomainPCIAddressSetUpdateIter currently sets nextslot to pci->slot+1 for each pci, without regards to what order the various pci reservations are visited in; but it seems to me that it should _only_ call qemuDomainPCIAddressSetUpdateNextSlot once, on just the particular pci that is provably the last reservation in the original chain of reservations. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

The hash iterator visits in an unpredictable order. It shouldn't matter because for this usage, all that is important is that 'nextslot' eventually ends up with the largest slot ID + 1.
We don't want the first unused slot, rather the last slot reservation that was in place before the reconnect (often the first unused slot if you haven't done a lot of hot-unplugging, but not necessarily the case). The only way I can see to get at this is to revisit the reservations in the same order as they were reserved, rather than in the random order
Actually, thinking about it you are right that setting nextslot to the largest slot used + 1 is not always correct. Even when no PCI devices are added/removed after a guest is started, nextslot doesn't have to look like that. E.g., if there is a device with explicit PCI address 0:0:31 in the XML. Although in normal configurations it would work. However, if we start thinking about hot(un)plugging PCI devices, even your suggestion of revisiting reservations in the original order wouldn't help much since the last device which influenced nextslot value might have been already unplugged from the guest. The only solution which really updates nextslot to exactly the same value it had before restarting libvirt is storing it in the runtime XML inside <domstatus>, from where it can be just read when the domain is being reconnected. Jirka

On Tue, Aug 17, 2010 at 11:37:34PM +0200, Jiri Denemark wrote:
The hash iterator visits in an unpredictable order. It shouldn't matter because for this usage, all that is important is that 'nextslot' eventually ends up with the largest slot ID + 1.
We don't want the first unused slot, rather the last slot reservation that was in place before the reconnect (often the first unused slot if you haven't done a lot of hot-unplugging, but not necessarily the case). The only way I can see to get at this is to revisit the reservations in the same order as they were reserved, rather than in the random order
Actually, thinking about it you are right that setting nextslot to the largest slot used + 1 is not always correct. Even when no PCI devices are added/removed after a guest is started, nextslot doesn't have to look like that. E.g., if there is a device with explicit PCI address 0:0:31 in the XML. Although in normal configurations it would work.
However, if we start thinking about hot(un)plugging PCI devices, even your suggestion of revisiting reservations in the original order wouldn't help much since the last device which influenced nextslot value might have been already unplugged from the guest. The only solution which really updates nextslot to exactly the same value it had before restarting libvirt is storing it in the runtime XML inside <domstatus>, from where it can be just read when the domain is being reconnected.
Why do we need it to be exactly the same value ? nextslot is just an efficiency optimization isn't it. ie, so instead of starting from slot 0 and iterating over 'N' already used slots till we find a free slot, we can get the next free slot in 1 step. As such do we really need to worry about restoring it to the same value after restarting libvirtd. Worst case, the first PCI hotplug needs to do an O(N) loop over PCI devices to find an unused one, thereafter all the hotplug ops will be back to O(1) because nextslot will be set again. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Why do we need it to be exactly the same value ? nextslot is just an efficiency optimization isn't it. ie, so instead of starting from slot 0 and iterating over 'N' already used slots till we find a free slot, we can get the next free slot in 1 step. As such do we really need to worry about restoring it to the same value after restarting libvirtd.
That was my understanding too. But Eric was concerned (in an older thread) about hotplugging PCI devices in a nonmonotonic way. He thinks it could upset Windows guests. Of course, if nextslot ever wraps from QEMU_PCI_ADDRESS_LAST_SLOT back to zero, such guests would be doomed anyway so we are only a bit nicer to them. I don't know if this is a real issue or not since I haven't met a Windows guest which I'd like to hotplug a PCI device in. Jirka

On Wed, Aug 18, 2010 at 01:15:55PM +0200, Jiri Denemark wrote:
Why do we need it to be exactly the same value ? nextslot is just an efficiency optimization isn't it. ie, so instead of starting from slot 0 and iterating over 'N' already used slots till we find a free slot, we can get the next free slot in 1 step. As such do we really need to worry about restoring it to the same value after restarting libvirtd.
That was my understanding too. But Eric was concerned (in an older thread) about hotplugging PCI devices in a nonmonotonic way. He thinks it could upset Windows guests. Of course, if nextslot ever wraps from QEMU_PCI_ADDRESS_LAST_SLOT back to zero, such guests would be doomed anyway so we are only a bit nicer to them. I don't know if this is a real issue or not since I haven't met a Windows guest which I'd like to hotplug a PCI device in.
There's no requirement to plug devices in ascending slot order - we can have gaps at will with any ordering. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 08/18/2010 05:26 AM, Daniel P. Berrange wrote:
On Wed, Aug 18, 2010 at 01:15:55PM +0200, Jiri Denemark wrote:
Why do we need it to be exactly the same value ? nextslot is just an efficiency optimization isn't it. ie, so instead of starting from slot 0 and iterating over 'N' already used slots till we find a free slot, we can get the next free slot in 1 step. As such do we really need to worry about restoring it to the same value after restarting libvirtd.
That was my understanding too. But Eric was concerned (in an older thread) about hotplugging PCI devices in a nonmonotonic way. He thinks it could upset Windows guests. Of course, if nextslot ever wraps from QEMU_PCI_ADDRESS_LAST_SLOT back to zero, such guests would be doomed anyway so we are only a bit nicer to them. I don't know if this is a real issue or not since I haven't met a Windows guest which I'd like to hotplug a PCI device in.
There's no requirement to plug devices in ascending slot order - we can have gaps at will with any ordering.
At this point, I'm starting to think that we can just drop this 2/2 patch and not worry about nextslot being stable across libvirtd restarts. My original concern was for a windows vm created against an older qemu, with no hot-plugging support, then being updated to a newer libvirt: there, nextslot must be incremented in the same order when firing up the vm from XML so that windows won't complain. But that only affects start-up; once the guest is up and running, nextslot only matters for hotplugged slots, and based on my problem definition, this was for a VM defined in the days before hotplug support being ported to newer underlying tools. Even if nextslot is reset to 0 after a libvirtd restart, that should only have an impact on future hot-plugged devices, and not any of the original devices reserved by the xml. And if windows can already handle hot-plugging, then it shouldn't matter which device slot a hot-plug occurs on; even if it is a different slot after the libvirtd restart than it would have been if libvirtd had been constantly running. If anyone can prove us wrong with an actual bug report, we can deal with the issue then. But for now, let's just drop this as over-engineering a solution for a non-problem. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

There's no requirement to plug devices in ascending slot order - we can have gaps at will with any ordering.
At this point, I'm starting to think that we can just drop this 2/2 patch and not worry about nextslot being stable across libvirtd restarts.
Which means we don't even need most of 1/2 since the reason for changing the hash payload to be a structure instead of a string was this second patch. So what do you think, should I push it as is or make a smaller patch which would just fix OOM checking when PCI addresses are converted to strings? Jirka

On Thu, Aug 19, 2010 at 04:57:17PM +0200, Jiri Denemark wrote:
There's no requirement to plug devices in ascending slot order - we can have gaps at will with any ordering.
At this point, I'm starting to think that we can just drop this 2/2 patch and not worry about nextslot being stable across libvirtd restarts.
Which means we don't even need most of 1/2 since the reason for changing the hash payload to be a structure instead of a string was this second patch. So what do you think, should I push it as is or make a smaller patch which would just fix OOM checking when PCI addresses are converted to strings?
Yep, lets do a simpler patch Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Jiri Denemark