[libvirt] [PATCH] nwfilter: fix deadlock caused updating network device and nwfilter

Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain. The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}. This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780 Signed-off-by: Pavel Hrdina <phrdina@redhat.com> --- This is temporary fix for the deadlock issue as I'm planning to create global libvirt jobs (similar to QEMU domain jobs) and use it for other drivers and for example for nwfilters too. src/qemu/qemu_driver.c | 12 ++++++++++++ src/qemu/qemu_migration.c | 3 +++ src/qemu/qemu_process.c | 4 ++++ 3 files changed, 19 insertions(+) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6acaea8..9e6f505 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5937,6 +5937,8 @@ qemuDomainRestoreFlags(virConnectPtr conn, def = tmp; } + virNWFilterReadLockFilterUpdates(); + if (!(vm = virDomainObjListAdd(driver->domains, def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -5978,6 +5980,7 @@ qemuDomainRestoreFlags(virConnectPtr conn, virFileWrapperFdFree(wrapperFd); if (vm) virObjectUnlock(vm); + virNWFilterUnlockFilterUpdates(); return ret; } @@ -7502,6 +7505,8 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml, affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG); + virNWFilterReadLockFilterUpdates(); + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto cleanup; @@ -7614,6 +7619,7 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); return ret; } @@ -7644,6 +7650,8 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_DEVICE_MODIFY_FORCE, -1); + virNWFilterReadLockFilterUpdates(); + cfg = virQEMUDriverGetConfig(driver); affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG); @@ -7760,6 +7768,7 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); return ret; } @@ -14510,6 +14519,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, * and use of FORCE can cause multiple transitions. */ + virNWFilterReadLockFilterUpdates(); + if (!(vm = qemuDomObjFromSnapshot(snapshot))) return -1; @@ -14831,6 +14842,7 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); return ret; } diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 94a4cf6..18242ae 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2666,6 +2666,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, goto cleanup; } + virNWFilterReadLockFilterUpdates(); + if (!(vm = virDomainObjListAdd(driver->domains, *def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -2825,6 +2827,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, qemuDomainEventQueue(driver, event); qemuMigrationCookieFree(mig); virObjectUnref(caps); + virNWFilterUnlockFilterUpdates(); return ret; stop: diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 26d4948..409a672 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3422,6 +3422,8 @@ qemuProcessReconnect(void *opaque) VIR_FREE(data); + virNWFilterReadLockFilterUpdates(); + virObjectLock(obj); cfg = virQEMUDriverGetConfig(driver); @@ -3573,6 +3575,7 @@ qemuProcessReconnect(void *opaque) virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); return; @@ -3608,6 +3611,7 @@ qemuProcessReconnect(void *opaque) } virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); } static int -- 2.0.4

On 11/05/2014 09:02 AM, Pavel Hrdina wrote:
Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain.
The referenced commit has a roadmap of sorts of what the lock is - perhaps you could add similar details here for before and after. While more of a pain logistically - separating out the patches into single changes may make it easier to list the bad locking order followed by the order the commit changes. The ones that are most odd are the places where you take out the lock in the middle of a function. That seems to go against the original commit which takes them out right after the flags check. In particular that's the qemuDomainRestoreFlags and qemuMigrationPrepareAny changes.
The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}.
This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780
Signed-off-by: Pavel Hrdina <phrdina@redhat.com> ---
This is temporary fix for the deadlock issue as I'm planning to create global libvirt jobs (similar to QEMU domain jobs) and use it for other drivers and for example for nwfilters too.
src/qemu/qemu_driver.c | 12 ++++++++++++ src/qemu/qemu_migration.c | 3 +++ src/qemu/qemu_process.c | 4 ++++ 3 files changed, 19 insertions(+)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6acaea8..9e6f505 100644
[1] Usage around virQEMUDriverGetConfig() calls isn't consistent. In particular qemuDomainAttachDeviceFlags() will lock/unlock in different order.
--- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5937,6 +5937,8 @@ qemuDomainRestoreFlags(virConnectPtr conn, def = tmp; }
+ virNWFilterReadLockFilterUpdates(); +
So no other locks up to this point are taken? And no need perhaps to lock earlier to play follow the leader (eg, the original commit) where the lock was taken after the various flags checks.
if (!(vm = virDomainObjListAdd(driver->domains, def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -5978,6 +5980,7 @@ qemuDomainRestoreFlags(virConnectPtr conn, virFileWrapperFdFree(wrapperFd); if (vm) virObjectUnlock(vm); + virNWFilterUnlockFilterUpdates();
So this could get called without ever having set the lock - is that correct? We can get to cleanup without calling the ReadLock - probably not a good thing since it's indeterminate what happens according to the man page.
return ret; }
@@ -7502,6 +7505,8 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done after the GetConfig
if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto cleanup;
@@ -7614,6 +7619,7 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg);
[1] But the release is done after the cfg unref
+ virNWFilterUnlockFilterUpdates();
return ret; }
@@ -7644,6 +7650,8 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_DEVICE_MODIFY_FORCE, -1);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before the GetConfig
cfg = virQEMUDriverGetConfig(driver);
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG); @@ -7760,6 +7768,7 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; }
@@ -14510,6 +14519,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, * and use of FORCE can cause multiple transitions. */
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before GetConfig
if (!(vm = qemuDomObjFromSnapshot(snapshot))) return -1;
@@ -14831,6 +14842,7 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; } diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 94a4cf6..18242ae 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2666,6 +2666,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, goto cleanup; }
+ virNWFilterReadLockFilterUpdates(); +
Lots of code before we get here - what's the locking
if (!(vm = virDomainObjListAdd(driver->domains, *def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -2825,6 +2827,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, qemuDomainEventQueue(driver, event); qemuMigrationCookieFree(mig); virObjectUnref(caps); + virNWFilterUnlockFilterUpdates();
Yet another case where we could reach the cleanup without having ever obtained the ReadLock
return ret;
stop: diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 26d4948..409a672 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c
This one seems OK - although I think I'd make it a separate patch and of course explain the lock ordering change John
@@ -3422,6 +3422,8 @@ qemuProcessReconnect(void *opaque)
VIR_FREE(data);
+ virNWFilterReadLockFilterUpdates(); + virObjectLock(obj);
cfg = virQEMUDriverGetConfig(driver); @@ -3573,6 +3575,7 @@ qemuProcessReconnect(void *opaque)
virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
return;
@@ -3608,6 +3611,7 @@ qemuProcessReconnect(void *opaque) } virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); }
static int

On 11/11/2014 04:13 PM, John Ferlan wrote:
On 11/05/2014 09:02 AM, Pavel Hrdina wrote:
Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain.
The referenced commit has a roadmap of sorts of what the lock is - perhaps you could add similar details here for before and after.
In that case I can do copy&paste of that roadmap as it's the same and this patch only extend the referenced one.
While more of a pain logistically - separating out the patches into single changes may make it easier to list the bad locking order followed by the order the commit changes.
The ones that are most odd are the places where you take out the lock in the middle of a function. That seems to go against the original commit which takes them out right after the flags check. In particular that's the qemuDomainRestoreFlags and qemuMigrationPrepareAny changes.
The order of the nwfilter lock and flags could be the same in all functions, that's a good point. My intention was to lock the nwfilterRead right before the vm is locked because the deadlock is between the vm and nwfilter objects. However I don't quiet understand why you would like to split the patch into a series. All the paths leads to the same nwfilter call "virDomainConfNWFilterInstantiate" and that's the only function callable outside of nwfilter that could cause the deadlock.
The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}.
This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780
Signed-off-by: Pavel Hrdina <phrdina@redhat.com> ---
This is temporary fix for the deadlock issue as I'm planning to create global libvirt jobs (similar to QEMU domain jobs) and use it for other drivers and for example for nwfilters too.
src/qemu/qemu_driver.c | 12 ++++++++++++ src/qemu/qemu_migration.c | 3 +++ src/qemu/qemu_process.c | 4 ++++ 3 files changed, 19 insertions(+)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6acaea8..9e6f505 100644
[1] Usage around virQEMUDriverGetConfig() calls isn't consistent. In particular qemuDomainAttachDeviceFlags() will lock/unlock in different order.
--- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5937,6 +5937,8 @@ qemuDomainRestoreFlags(virConnectPtr conn, def = tmp; }
+ virNWFilterReadLockFilterUpdates(); +
So no other locks up to this point are taken? And no need perhaps to lock earlier to play follow the leader (eg, the original commit) where the lock was taken after the various flags checks.
if (!(vm = virDomainObjListAdd(driver->domains, def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -5978,6 +5980,7 @@ qemuDomainRestoreFlags(virConnectPtr conn, virFileWrapperFdFree(wrapperFd); if (vm) virObjectUnlock(vm); + virNWFilterUnlockFilterUpdates();
So this could get called without ever having set the lock - is that correct? We can get to cleanup without calling the ReadLock - probably not a good thing since it's indeterminate what happens according to the man page.
return ret; }
@@ -7502,6 +7505,8 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done after the GetConfig
if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto cleanup;
@@ -7614,6 +7619,7 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg);
[1] But the release is done after the cfg unref
+ virNWFilterUnlockFilterUpdates();
return ret; }
@@ -7644,6 +7650,8 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_DEVICE_MODIFY_FORCE, -1);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before the GetConfig
cfg = virQEMUDriverGetConfig(driver);
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG); @@ -7760,6 +7768,7 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; }
@@ -14510,6 +14519,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, * and use of FORCE can cause multiple transitions. */
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before GetConfig
if (!(vm = qemuDomObjFromSnapshot(snapshot))) return -1;
@@ -14831,6 +14842,7 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; } diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 94a4cf6..18242ae 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2666,6 +2666,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, goto cleanup; }
+ virNWFilterReadLockFilterUpdates(); +
Lots of code before we get here - what's the locking
if (!(vm = virDomainObjListAdd(driver->domains, *def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -2825,6 +2827,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, qemuDomainEventQueue(driver, event); qemuMigrationCookieFree(mig); virObjectUnref(caps); + virNWFilterUnlockFilterUpdates();
Yet another case where we could reach the cleanup without having ever obtained the ReadLock
return ret;
stop: diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 26d4948..409a672 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c
This one seems OK - although I think I'd make it a separate patch and of course explain the lock ordering change
John
@@ -3422,6 +3422,8 @@ qemuProcessReconnect(void *opaque)
VIR_FREE(data);
+ virNWFilterReadLockFilterUpdates(); + virObjectLock(obj);
cfg = virQEMUDriverGetConfig(driver); @@ -3573,6 +3575,7 @@ qemuProcessReconnect(void *opaque)
virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
return;
@@ -3608,6 +3611,7 @@ qemuProcessReconnect(void *opaque) } virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); }
static int

On 11/11/2014 11:34 AM, Pavel Hrdina wrote:
On 11/11/2014 04:13 PM, John Ferlan wrote:
On 11/05/2014 09:02 AM, Pavel Hrdina wrote:
Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain.
The referenced commit has a roadmap of sorts of what the lock is - perhaps you could add similar details here for before and after.
In that case I can do copy&paste of that roadmap as it's the same and this patch only extend the referenced one.
While more of a pain logistically - separating out the patches into single changes may make it easier to list the bad locking order followed by the order the commit changes.
The ones that are most odd are the places where you take out the lock in the middle of a function. That seems to go against the original commit which takes them out right after the flags check. In particular that's the qemuDomainRestoreFlags and qemuMigrationPrepareAny changes.
The order of the nwfilter lock and flags could be the same in all functions, that's a good point. My intention was to lock the nwfilterRead right before the vm is locked because the deadlock is between the vm and nwfilter objects.
However I don't quiet understand why you would like to split the patch into a series. All the paths leads to the same nwfilter call "virDomainConfNWFilterInstantiate" and that's the only function callable outside of nwfilter that could cause the deadlock.
If the bad locking order is the same for each, then I suppose it's not necessary to copy commit message and make separate patches. It was mostly a suggestion considering you're changing 6 API's. Also it wasn't perfectly clear about the order especially for the Restore/Migration changes where the lock is taken out in the middle of each function. Furthermore for those cases, we can get to the cleanup without the lock and do an unlock, which perhaps doesn't do anything, but it's been a while since I've thought about the guts of thread mutex lock/unlock. May not occur in these paths, but I seem to remember if your thread has the read lock and attempts to acquire the read lock again, then the thread code does a refcount++ type operation. The unlock would normally refcount-- for each and free the lock when zero. Now in the case where you're jumping to cleanup, you could decr a refcount which would be unexpected. My disclaimer is this was a different OS (hp-ux) in a prior job where there was a mix of mutex locks and file locks that caused all sorts of issues... John
The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}.
This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780
Signed-off-by: Pavel Hrdina <phrdina@redhat.com> ---
This is temporary fix for the deadlock issue as I'm planning to create global libvirt jobs (similar to QEMU domain jobs) and use it for other drivers and for example for nwfilters too.
src/qemu/qemu_driver.c | 12 ++++++++++++ src/qemu/qemu_migration.c | 3 +++ src/qemu/qemu_process.c | 4 ++++ 3 files changed, 19 insertions(+)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6acaea8..9e6f505 100644
[1] Usage around virQEMUDriverGetConfig() calls isn't consistent. In particular qemuDomainAttachDeviceFlags() will lock/unlock in different order.
--- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5937,6 +5937,8 @@ qemuDomainRestoreFlags(virConnectPtr conn, def = tmp; }
+ virNWFilterReadLockFilterUpdates(); +
So no other locks up to this point are taken? And no need perhaps to lock earlier to play follow the leader (eg, the original commit) where the lock was taken after the various flags checks.
if (!(vm = virDomainObjListAdd(driver->domains, def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -5978,6 +5980,7 @@ qemuDomainRestoreFlags(virConnectPtr conn, virFileWrapperFdFree(wrapperFd); if (vm) virObjectUnlock(vm); + virNWFilterUnlockFilterUpdates();
So this could get called without ever having set the lock - is that correct? We can get to cleanup without calling the ReadLock - probably not a good thing since it's indeterminate what happens according to the man page.
return ret; }
@@ -7502,6 +7505,8 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done after the GetConfig
if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto cleanup;
@@ -7614,6 +7619,7 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg);
[1] But the release is done after the cfg unref
+ virNWFilterUnlockFilterUpdates();
return ret; }
@@ -7644,6 +7650,8 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_DEVICE_MODIFY_FORCE, -1);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before the GetConfig
cfg = virQEMUDriverGetConfig(driver);
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG); @@ -7760,6 +7768,7 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; }
@@ -14510,6 +14519,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, * and use of FORCE can cause multiple transitions. */
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before GetConfig
if (!(vm = qemuDomObjFromSnapshot(snapshot))) return -1;
@@ -14831,6 +14842,7 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; } diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 94a4cf6..18242ae 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2666,6 +2666,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, goto cleanup; }
+ virNWFilterReadLockFilterUpdates(); +
Lots of code before we get here - what's the locking
if (!(vm = virDomainObjListAdd(driver->domains, *def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -2825,6 +2827,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, qemuDomainEventQueue(driver, event); qemuMigrationCookieFree(mig); virObjectUnref(caps); + virNWFilterUnlockFilterUpdates();
Yet another case where we could reach the cleanup without having ever obtained the ReadLock
return ret;
stop: diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 26d4948..409a672 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c
This one seems OK - although I think I'd make it a separate patch and of course explain the lock ordering change
John
@@ -3422,6 +3422,8 @@ qemuProcessReconnect(void *opaque)
VIR_FREE(data);
+ virNWFilterReadLockFilterUpdates(); + virObjectLock(obj);
cfg = virQEMUDriverGetConfig(driver); @@ -3573,6 +3575,7 @@ qemuProcessReconnect(void *opaque)
virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
return;
@@ -3608,6 +3611,7 @@ qemuProcessReconnect(void *opaque) } virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); }
static int

On 11/11/2014 06:40 PM, John Ferlan wrote:
On 11/11/2014 11:34 AM, Pavel Hrdina wrote:
On 11/11/2014 04:13 PM, John Ferlan wrote:
On 11/05/2014 09:02 AM, Pavel Hrdina wrote:
Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain.
The referenced commit has a roadmap of sorts of what the lock is - perhaps you could add similar details here for before and after.
In that case I can do copy&paste of that roadmap as it's the same and this patch only extend the referenced one.
While more of a pain logistically - separating out the patches into single changes may make it easier to list the bad locking order followed by the order the commit changes.
The ones that are most odd are the places where you take out the lock in the middle of a function. That seems to go against the original commit which takes them out right after the flags check. In particular that's the qemuDomainRestoreFlags and qemuMigrationPrepareAny changes.
The order of the nwfilter lock and flags could be the same in all functions, that's a good point. My intention was to lock the nwfilterRead right before the vm is locked because the deadlock is between the vm and nwfilter objects.
However I don't quiet understand why you would like to split the patch into a series. All the paths leads to the same nwfilter call "virDomainConfNWFilterInstantiate" and that's the only function callable outside of nwfilter that could cause the deadlock.
If the bad locking order is the same for each, then I suppose it's not necessary to copy commit message and make separate patches. It was mostly a suggestion considering you're changing 6 API's.
Also it wasn't perfectly clear about the order especially for the Restore/Migration changes where the lock is taken out in the middle of each function. Furthermore for those cases, we can get to the cleanup without the lock and do an unlock, which perhaps doesn't do anything, but it's been a while since I've thought about the guts of thread mutex lock/unlock. May not occur in these paths, but I seem to remember if your thread has the read lock and attempts to acquire the read lock again, then the thread code does a refcount++ type operation. The unlock would normally refcount-- for each and free the lock when zero. Now in the case where you're jumping to cleanup, you could decr a refcount which would be unexpected. My disclaimer is this was a different OS (hp-ux) in a prior job where there was a mix of mutex locks and file locks that caused all sorts of issues...
John
Calling unlock without previous lock is wrong and it should be fixed. I'll review the patch move the lock to the right place and send v2, thanks. Pavel
The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}.
This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780
Signed-off-by: Pavel Hrdina <phrdina@redhat.com> ---
This is temporary fix for the deadlock issue as I'm planning to create global libvirt jobs (similar to QEMU domain jobs) and use it for other drivers and for example for nwfilters too.
src/qemu/qemu_driver.c | 12 ++++++++++++ src/qemu/qemu_migration.c | 3 +++ src/qemu/qemu_process.c | 4 ++++ 3 files changed, 19 insertions(+)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6acaea8..9e6f505 100644
[1] Usage around virQEMUDriverGetConfig() calls isn't consistent. In particular qemuDomainAttachDeviceFlags() will lock/unlock in different order.
--- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5937,6 +5937,8 @@ qemuDomainRestoreFlags(virConnectPtr conn, def = tmp; }
+ virNWFilterReadLockFilterUpdates(); +
So no other locks up to this point are taken? And no need perhaps to lock earlier to play follow the leader (eg, the original commit) where the lock was taken after the various flags checks.
if (!(vm = virDomainObjListAdd(driver->domains, def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -5978,6 +5980,7 @@ qemuDomainRestoreFlags(virConnectPtr conn, virFileWrapperFdFree(wrapperFd); if (vm) virObjectUnlock(vm); + virNWFilterUnlockFilterUpdates();
So this could get called without ever having set the lock - is that correct? We can get to cleanup without calling the ReadLock - probably not a good thing since it's indeterminate what happens according to the man page.
return ret; }
@@ -7502,6 +7505,8 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done after the GetConfig
if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto cleanup;
@@ -7614,6 +7619,7 @@ static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg);
[1] But the release is done after the cfg unref
+ virNWFilterUnlockFilterUpdates();
return ret; }
@@ -7644,6 +7650,8 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_DEVICE_MODIFY_FORCE, -1);
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before the GetConfig
cfg = virQEMUDriverGetConfig(driver);
affect = flags & (VIR_DOMAIN_AFFECT_LIVE | VIR_DOMAIN_AFFECT_CONFIG); @@ -7760,6 +7768,7 @@ static int qemuDomainUpdateDeviceFlags(virDomainPtr dom, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; }
@@ -14510,6 +14519,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, * and use of FORCE can cause multiple transitions. */
+ virNWFilterReadLockFilterUpdates(); +
[1] This one is done before GetConfig
if (!(vm = qemuDomObjFromSnapshot(snapshot))) return -1;
@@ -14831,6 +14842,7 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, virObjectUnlock(vm); virObjectUnref(caps); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
[1] Release done after cfg unref
return ret; } diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 94a4cf6..18242ae 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2666,6 +2666,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, goto cleanup; }
+ virNWFilterReadLockFilterUpdates(); +
Lots of code before we get here - what's the locking
if (!(vm = virDomainObjListAdd(driver->domains, *def, driver->xmlopt, VIR_DOMAIN_OBJ_LIST_ADD_LIVE | @@ -2825,6 +2827,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, qemuDomainEventQueue(driver, event); qemuMigrationCookieFree(mig); virObjectUnref(caps); + virNWFilterUnlockFilterUpdates();
Yet another case where we could reach the cleanup without having ever obtained the ReadLock
return ret;
stop: diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 26d4948..409a672 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c
This one seems OK - although I think I'd make it a separate patch and of course explain the lock ordering change
John
@@ -3422,6 +3422,8 @@ qemuProcessReconnect(void *opaque)
VIR_FREE(data);
+ virNWFilterReadLockFilterUpdates(); + virObjectLock(obj);
cfg = virQEMUDriverGetConfig(driver); @@ -3573,6 +3575,7 @@ qemuProcessReconnect(void *opaque)
virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates();
return;
@@ -3608,6 +3611,7 @@ qemuProcessReconnect(void *opaque) } virObjectUnref(conn); virObjectUnref(cfg); + virNWFilterUnlockFilterUpdates(); }
static int

On Wed, Nov 05, 2014 at 03:02:03PM +0100, Pavel Hrdina wrote:
Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain.
The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}.
This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock.
ACK, conceptually it makes sense that we need to hold the read lock in these methods. Concurrency should still be good because it is only a read lock, not a write lock. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11/11/2014 04:20 PM, Daniel P. Berrange wrote:
On Wed, Nov 05, 2014 at 03:02:03PM +0100, Pavel Hrdina wrote:
Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine} and starting of guest, but this same deadlock is also for updating/attaching network device to domain.
The deadlock was introduced by removing global QEMU driver lock because nwfilter was counting on this lock and ensure that all driver locks are locked inside of nwfilter{Define,Undefine}.
This patch extends usage of virNWFilterReadLockFilterUpdates to prevent the deadlock for all possible paths in QEMU driver. LXC and UML drivers still have global lock.
ACK, conceptually it makes sense that we need to hold the read lock in these methods. Concurrency should still be good because it is only a read lock, not a write lock.
Regards, Daniel
Thanks for the review, I've sent a v2 to address some code order issues pointed out by John. Pavel
participants (3)
-
Daniel P. Berrange
-
John Ferlan
-
Pavel Hrdina