On Tue, Mar 12, 2019 at 16:13:20 +0100, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1623389
If a device is detached twice from the same domain the following
race condition may happen:
1) The first DetachDevice() call will issue "device_del" on qemu
monitor, but since the DEVICE_DELETED event did not arrive in
time, the API ends claiming "Device detach request sent
successfully".
2) The second DetachDevice() therefore still find the device in
the domain and thus proceeds to detaching it again. It calls
EnterMonitor() and qemuMonitorSend() trying to issue "device_del"
command again. This gets both domain lock and monitor lock
released.
3) At this point, qemu sends us the DEVICE_DELETED event which is
going to be handled by the event loop which ends up calling
qemuDomainSignalDeviceRemoval() to determine who is going to
remove the device from domain definition. Whether it is the
caller that marked the device for removal or whether it is going
to be the event processing thread.
4) Because the device was marked for removal,
qemuDomainSignalDeviceRemoval() returns true, which means the
event is to be processed by the thread that has marked the device
for removal (and is currently still trying to issue "device_del"
command)
5) The thread finally issues the "device_del" command, which
fails (obviously) and therefore it calls
qemuDomainResetDeviceRemoval() to reset the device marking and
quits immediately after, NOT removing any device from the domain
definition.
At this point, the device is still present in the domain
definition but doesn't exist in qemu anymore. Worse, there is no
way to remove it from the domain definition.
Solution is to note down that we've seen the event and if the
second "device_del" fails, not take it as a failure but carry on
with the usual execution.
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
src/qemu/qemu_domain.h | 1 +
src/qemu/qemu_hotplug.c | 83 +++++++++++++++++++++++++++++------------
2 files changed, 60 insertions(+), 24 deletions(-)
diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
index 9f468e5661..fb361515ba 100644
--- a/src/qemu/qemu_domain.h
+++ b/src/qemu/qemu_domain.h
@@ -218,6 +218,7 @@ typedef qemuDomainUnpluggingDevice *qemuDomainUnpluggingDevicePtr;
struct _qemuDomainUnpluggingDevice {
const char *alias;
qemuDomainUnpluggingDeviceStatus status;
+ bool eventSeen; /* True if DEVICE_DELETED event arrived. */
};
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c
index 574477e916..93c0e14adf 100644
--- a/src/qemu/qemu_hotplug.c
+++ b/src/qemu/qemu_hotplug.c
@@ -70,22 +70,47 @@ unsigned long long qemuDomainRemoveDeviceWaitTime = 1000ull * 5;
/**
* qemuDomainDeleteDevice:
* @mon: qemu monitor
+ * @vm: domain object
* @alias: device to remove
*
* A simple wrapper around qemuMonitorDelDevice().
- * @mon must be locked upon entry.
+ * @mon must be locked upon entry, @vm shan't.
Please use either the more contemporary short form or the long form.
*
* Returns: 0 on success,
* -1 otherwise.
*/
static inline int
qemuDomainDeleteDevice(qemuMonitorPtr mon,
+ virDomainObjPtr vm,
const char *alias)
{
- if (qemuMonitorDelDevice(mon, alias) < 0)
- return -1;
+ qemuDomainObjPrivatePtr priv;
+ int ret = 0;
- return 0;
+ if (qemuMonitorDelDevice(mon, alias) < 0) {
+ if (vm) {
+ /* It is safe to lock and unlock both @mon and @vm
+ * here because:
+ * a) qemuDomainObjEnterMonitor() ensures @mon is
+ * ref()'d
+ * b) The API that is calling us ensures that @vm is
+ * ref()'d
Don't break the two lines.
+ */
+ virObjectUnlock(mon);
+ virObjectLock(vm);
+ priv = vm->privateData;
+ if (priv->unplug.eventSeen)
You can't do this without also clearing priv->unplug.alias. If the event
was not seen at this point you need to make sure that it will
unconditionally be handled via the separate event handler thread prior
to giving up the lock on VM. By not clearing it, you still have another
chance at the event handler thinking the unplug will be handled here
without actually doing so.
+ virResetLastError();
Additionally it would be better if qemuMonitorJSONDelDevice actually
returns the error string or object without reporting it so that we can
decide here not to report it at all rather than resetting it.
+ else
+ ret = -1;
+ virObjectLock(mon);
+ virObjectUnlock(vm);
+ } else {
+ ret = -1;
+ }
+ }
+
+ return ret;