Re: [PATCH] pci: Refuse to hotplug PCI Devices when the Guest OS is not ready

On Fri, 23 Oct 2020 11:54:40 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
On Fri, Oct 23, 2020 at 09:47:14AM +0300, Marcel Apfelbaum wrote:
Hi David,
On Fri, Oct 23, 2020 at 6:49 AM David Gibson <dgibson@redhat.com> wrote:
On Thu, 22 Oct 2020 11:01:04 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Oct 22, 2020 at 05:50:51PM +0300, Marcel Apfelbaum wrote: > [...] > > Right. After detecting just failing unconditionally it a bit too > simplistic IMHO.
There's also another factor here, which I thought I'd mentioned already, but looks like I didn't: I think we're still missing some details in what's going on.
The premise for this patch is that plugging while the indicator is in transition state is allowed to fail in any way on the guest side. I don't think that's a reasonable interpretation, because it's unworkable for physical hotplug. If the indicator starts blinking while you're in the middle of shoving a card in, you'd be in trouble.
So, what I'm assuming here is that while "don't plug while blinking" is the instruction for the operator to obey as best they can, on the guest side the rule has to be "start blinking, wait a while and by the time you leave blinking state again, you can be confident any plugs or unplugs have completed". Obviously still racy in the strict computer science sense, but about the best you can do with slow humans in the mix.
So, qemu should of course endeavour to follow that rule as though it was a human operator on a physical machine and not plug when the indicator is blinking. *But* the qemu plug will in practice be fast enough that if we're hitting real problems here, it suggests the guest is still doing something wrong.
I personally think there is a little bit of over-engineering here. Let's start with the spec:
  Power Indicator Blinking   A blinking Power Indicator indicates that the slot is powering up or powering down and that   insertion or removal of the adapter is not permitted.
What exactly is an interpretation here? As you stated, the races are theoretical, the whole point of the indicator is to let the operator know he can't plug the device just yet.
I understand it would be more user friendly if the QEMU would wait internally for the blinking to end, but the whole point of the indicator is to let the operator (human or machine) know they can't plug the device at a specific time. Should QEMU take the responsibility of the operator? Is it even correct?
Even if we would want such a feature, how is it related to this patch? The patch simply refuses to start a hotplug operation when it knows it will not succeed.  Another way that would make sense to me would be is a new QEMU interface other than "add_device", let's say "adding_device_allowed", that would return true if the hotplug is allowed at this point of time. (I am aware of the theoretical races)ÂÂ
Rather than adding_device_allowed, something like "query slot" might be helpful for debugging. That would help user figure out e.g. why isn't device visible without any races.
Would be new command useful tough? What we end up is broken guest (if I read commit message right) and a user who has no idea if device_add was successful or not. So what user should do in this case - wait till it explodes? - can user remove it or it would be stuck there forever? - poll slot before hotplug, manually? (if this is the case then failing device_add cleanly doesn't sound bad, it looks similar to another error we have "/* Check if hot-plug is disabled on the slot */" in pcie_cap_slot_pre_plug_cb) CCing libvirt, as it concerns not only QEMU.
The above will at least mimic the mechanics of the pyhs world. The operator looks at the indicator, the management software checks if adding the device is allowed. Since it is a corner case I would prefer the device_add to fail rather than introducing a new interface, but that's just me.
Thanks, Marcel
I think we want QEMU management interface to be reasonably abstract and agnostic if possible. Pushing knowledge of hardware detail to management will just lead to pain IMHO. We supported device_add which practically never fails for years,
For CPUs and RAM, device_add can fail so maybe management is also prepared to handle errors on PCI hotplug path.
at this point it's easier to keep supporting it than change all users ...
-- David Gibson <dgibson@redhat.com> Principal Software Engineer, Virtualization, Red Hat

On Fri, 23 Oct 2020 19:27:55 +0200 Igor Mammedov <imammedo@redhat.com> wrote:
On Fri, 23 Oct 2020 11:54:40 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
[...] [...] [...] [...]
Rather than adding_device_allowed, something like "query slot" might be helpful for debugging. That would help user figure out e.g. why isn't device visible without any races.
Would be new command useful tough? What we end up is broken guest (if I read commit message right) and a user who has no idea if device_add was successful or not. So what user should do in this case - wait till it explodes? - can user remove it or it would be stuck there forever? - poll slot before hotplug, manually?
(if this is the case then failing device_add cleanly doesn't sound bad, it looks similar to another error we have "/* Check if hot-plug is disabled on the slot */" in pcie_cap_slot_pre_plug_cb)
CCing libvirt, as it concerns not only QEMU.
[...] [...]
I think we want QEMU management interface to be reasonably abstract and agnostic if possible. Pushing knowledge of hardware detail to management will just lead to pain IMHO. We supported device_add which practically never fails for years,
For CPUs and RAM, device_add can fail so maybe management is also prepared to handle errors on PCI hotplug path.
There can be unarguable reasons for PCI hotplug to fail as well (attempting to plug to a bus that can't support it for one). The difference here is that it's a failure that we expect to be transitory. -- David Gibson <dgibson@redhat.com> Principal Software Engineer, Virtualization, Red Hat

On Fri, Oct 23, 2020 at 19:27:55 +0200, Igor Mammedov wrote:
On Fri, 23 Oct 2020 11:54:40 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
On Fri, Oct 23, 2020 at 09:47:14AM +0300, Marcel Apfelbaum wrote:
Hi David,
On Fri, Oct 23, 2020 at 6:49 AM David Gibson <dgibson@redhat.com> wrote:
On Thu, 22 Oct 2020 11:01:04 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Oct 22, 2020 at 05:50:51PM +0300, Marcel Apfelbaum wrote: > [...] > > Right. After detecting just failing unconditionally it a bit too > simplistic IMHO.
There's also another factor here, which I thought I'd mentioned already, but looks like I didn't: I think we're still missing some details in what's going on.
The premise for this patch is that plugging while the indicator is in transition state is allowed to fail in any way on the guest side. I don't think that's a reasonable interpretation, because it's unworkable for physical hotplug. If the indicator starts blinking while you're in the middle of shoving a card in, you'd be in trouble.
So, what I'm assuming here is that while "don't plug while blinking" is the instruction for the operator to obey as best they can, on the guest side the rule has to be "start blinking, wait a while and by the time you leave blinking state again, you can be confident any plugs or unplugs have completed". Obviously still racy in the strict computer science sense, but about the best you can do with slow humans in the mix.
So, qemu should of course endeavour to follow that rule as though it was a human operator on a physical machine and not plug when the indicator is blinking. *But* the qemu plug will in practice be fast enough that if we're hitting real problems here, it suggests the guest is still doing something wrong.
I personally think there is a little bit of over-engineering here. Let's start with the spec:
  Power Indicator Blinking   A blinking Power Indicator indicates that the slot is powering up or powering down and that   insertion or removal of the adapter is not permitted.
What exactly is an interpretation here? As you stated, the races are theoretical, the whole point of the indicator is to let the operator know he can't plug the device just yet.
I understand it would be more user friendly if the QEMU would wait internally for the blinking to end, but the whole point of the indicator is to let the operator (human or machine) know they can't plug the device at a specific time. Should QEMU take the responsibility of the operator? Is it even correct?
Even if we would want such a feature, how is it related to this patch? The patch simply refuses to start a hotplug operation when it knows it will not succeed.  Another way that would make sense to me would be is a new QEMU interface other than "add_device", let's say "adding_device_allowed", that would return true if the hotplug is allowed at this point of time. (I am aware of the theoretical races)ÂÂ
Rather than adding_device_allowed, something like "query slot" might be helpful for debugging. That would help user figure out e.g. why isn't device visible without any races.
Would be new command useful tough? What we end up is broken guest (if I read commit message right) and a user who has no idea if device_add was successful or not. So what user should do in this case - wait till it explodes? - can user remove it or it would be stuck there forever? - poll slot before hotplug, manually?
(if this is the case then failing device_add cleanly doesn't sound bad, it looks similar to another error we have "/* Check if hot-plug is disabled on the slot */" in pcie_cap_slot_pre_plug_cb)
CCing libvirt, as it concerns not only QEMU.
The only reason a separate command might make sense is if libvirt would want to provide a specific error to the user/upper management layer that the operation failed due to a transient failure (so that it can be retried later). We don't really want to go to a policy decision of how long to wait in such case, so unless qemu waits libvirt will plainly want to report an error. That said IMO 'device_add' should still fail if it's certain that the device won't be plugged in. This will fix any client which is currently in use. Adding a separate command is worth only for pre-checking for saner error handling.
The above will at least mimic the mechanics of the pyhs world. The operator looks at the indicator, the management software checks if adding the device is allowed. Since it is a corner case I would prefer the device_add to fail rather than introducing a new interface, but that's just me.
Thanks, Marcel
I think we want QEMU management interface to be reasonably abstract and agnostic if possible. Pushing knowledge of hardware detail to management will just lead to pain IMHO. We supported device_add which practically never fails for years,
For CPUs and RAM, device_add can fail so maybe management is also prepared to handle errors on PCI hotplug path.
While it was me who implemented device_add for cpu/memory I don't remmeber any more whether we are really prepared for it. We certainly want to do it if there's a possibility to do it.
participants (3)
-
David Gibson
-
Igor Mammedov
-
Peter Krempa