On 09/18/2017 05:08 AM, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1448268
When migrating to a file (e.g. when doing 'virsh save file'),
couple of things are happening in the thread that is executing
the API:
1) the domain obj is locked
2) iohelper is spawned as a separate process to handle all I/O
3) the thread waits for iohelper to finish
4) the domain obj is unlocked
Now, the problem is that while the thread waits in step 3 for
iohelper to finish this may take ages because iohelper calls
fdatasync(). And unfortunately, we are waiting the whole time
with the domain locked. So if another thread wants to jump in and
say copy the domain name ('virsh list' for instance), they are
stuck.
The solution is to unlock the domain whenever waiting for I/O and
lock it back again when it finished.
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
src/qemu/qemu_driver.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)
I would think there'd be more concern about dropping the lock in the
middle and something changing that could more negatively affect things.
There's such a delicate balance between Save/Restore processing now - if
you adjust the locks that could mean simultaneous usage, right?
Perhaps thinking differently... Does the virdomainobjlist code for
Collect, Filter, Export really need to obtain the domain obj write lock?
If Export is but a "snapshot in time", then using virObject{Lock|Unlock}
perhaps isn't necessary in any of the @vms list processing. Since
virDomainObjListCollectIterator sets a refcnt on the object it cannot be
freed. Beyond that any state checking done in virDomainObjListFilter is
negated because the lock isn't held until the virGetDomain call is made
during virDomainObjListExport to fill in the returned @doms list. It's
quite possible after *Filter to have the 'removing' boolean get set
before the virGetDomain call is made. So what's the need or purpose of
the write locks then?
If the object locks are removed during Export, then virsh list won't
freeze while other operations that take time occur. Since the RWLock is
taken on the list while @vms is created, we know nothing else can
Add/Remove to/from the list - thus *that* is our snapshot in time.
Anything done after than is just reducing what's in that snapshot.
It's "interesting" that the domain export list uses a very different
model than other drivers... That loop to take a refcnt on each object w/
the Read lock taken is something I could see duplicated for other
drivers conceptually speaking of course if someone were to write that
kind of common processing model ;-)
John
FWIW: The "older" processing model to get the Num of domains followed by
getting a list of domains could continue to do the locking... although
that code could use a cleanup too, but that's a different issue.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index b334cf20b..f81d3def4 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -3216,6 +3216,25 @@ qemuOpenFileAs(uid_t fallback_uid, gid_t fallback_gid,
goto cleanup;
}
+
+static int
+qemuFileWrapperFDClose(virDomainObjPtr vm,
+ virFileWrapperFdPtr fd)
+{
+ int ret;
+
+ /* virFileWrapperFd uses iohelper to write data onto disk.
+ * However, iohelper calls fdatasync() which may take ages to
+ * finish. Therefore, we shouldn't be waiting with the domain
+ * object locked. */
+
+ virObjectUnlock(vm);
+ ret = virFileWrapperFdClose(fd);
+ virObjectLock(vm);
+ return ret;
+}
+
+
/* Helper function to execute a migration to file with a correct save header
* the caller needs to make sure that the processors are stopped and do all other
* actions besides saving memory */
@@ -3276,7 +3295,7 @@ qemuDomainSaveMemory(virQEMUDriverPtr driver,
goto cleanup;
}
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
goto cleanup;
if ((fd = qemuOpenFile(driver, vm, path, O_WRONLY, NULL, NULL)) < 0 ||
@@ -3827,7 +3846,7 @@ doCoreDump(virQEMUDriverPtr driver,
path);
goto cleanup;
}
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
goto cleanup;
ret = 0;
@@ -6687,7 +6706,7 @@ qemuDomainRestoreFlags(virConnectPtr conn,
ret = qemuDomainSaveImageStartVM(conn, driver, vm, &fd, data, path,
false, QEMU_ASYNC_JOB_START);
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
VIR_WARN("Failed to close %s", path);
qemuProcessEndJob(driver, vm);
@@ -6962,7 +6981,7 @@ qemuDomainObjRestore(virConnectPtr conn,
ret = qemuDomainSaveImageStartVM(conn, driver, vm, &fd, data, path,
start_paused, asyncJob);
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
VIR_WARN("Failed to close %s", path);
cleanup: