On 09/19/2017 09:56 AM, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1448268
When migrating to a file (e.g. when doing 'virsh save file'),
couple of things are happening in the thread that is executing
the API:
1) the domain obj is locked
2) iohelper is spawned as a separate process to handle all I/O
3) the thread waits for iohelper to finish
4) the domain obj is unlocked
Now, the problem is that while the thread waits in step 3 for
iohelper to finish this may take ages because iohelper calls
fdatasync(). And unfortunately, we are waiting the whole time
with the domain locked. So if another thread wants to jump in and
say copy the domain name ('virsh list' for instance), they are
stuck.
The solution is to unlock the domain whenever waiting for I/O and
lock it back again when it finished.
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
diff to v1:
- check after locking the domain obj back if the domain is still alive,
since it was unlocked it might have changed its state.
src/qemu/qemu_driver.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index e1a0dd553..6095a5ec5 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -3216,6 +3216,31 @@ qemuOpenFileAs(uid_t fallback_uid, gid_t fallback_gid,
goto cleanup;
}
+
+static int
+qemuFileWrapperFDClose(virDomainObjPtr vm,
+ virFileWrapperFdPtr fd)
+{
+ int ret;
+
+ /* virFileWrapperFd uses iohelper to write data onto disk.
+ * However, iohelper calls fdatasync() which may take ages to
+ * finish. Therefore, we shouldn't be waiting with the domain
+ * object locked. */
+
+ virObjectUnlock(vm);
+ ret = virFileWrapperFdClose(fd);
+ virObjectLock(vm);
+ if (!virDomainObjIsActive(vm)) {
+ if (!virGetLastError())
+ virReportError(VIR_ERR_OPERATION_FAILED, "%s",
+ _("domain is no longer running"));
+ ret = -1;
+ }
+ return ret;
+}
+
+
/* Helper function to execute a migration to file with a correct save header
* the caller needs to make sure that the processors are stopped and do all other
* actions besides saving memory */
@@ -3276,7 +3301,7 @@ qemuDomainSaveMemory(virQEMUDriverPtr driver,
goto cleanup;
}
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
goto cleanup;
if ((fd = qemuOpenFile(driver, vm, path, O_WRONLY, NULL, NULL)) < 0 ||
@@ -3827,7 +3852,7 @@ doCoreDump(virQEMUDriverPtr driver,
path);
goto cleanup;
}
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
goto cleanup;
ret = 0;
@@ -6687,7 +6712,7 @@ qemuDomainRestoreFlags(virConnectPtr conn,
ret = qemuDomainSaveImageStartVM(conn, driver, vm, &fd, data, path,
false, QEMU_ASYNC_JOB_START);
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
VIR_WARN("Failed to close %s", path);
This and the next one - may not be completely true if the -1 return is
because the guest not active.
Furthermore for both if we had a failure because the domain was inactive
the condition wouldn't be propagated as both only care that
qemuDomainSaveImageStartVM succeeded.
qemuProcessEndJob(driver, vm);
@@ -6962,7 +6987,7 @@ qemuDomainObjRestore(virConnectPtr conn,
ret = qemuDomainSaveImageStartVM(conn, driver, vm, &fd, data, path,
start_paused, asyncJob);
- if (virFileWrapperFdClose(wrapperFd) < 0)
+ if (qemuFileWrapperFDClose(vm, wrapperFd) < 0)
VIR_WARN("Failed to close %s", path);
For this one ignoring an active failure means perhaps unlinking the
managed save file and then running qemuProcessStart perhaps unexpectedly
again...
For these two cases there's just enough fear to not be fully supportive
for an R-B. For the Save and Dump cases, they're fine, but it would
seem to me we really want these restore operations to be quite
autonomous and not need to be impacted by some outside influence causing
the domain to be inactive again. Besides, the BZ only mentions Save and
Dump.
if:
1. qemuFileWrapperFDClose is commented to indicate only using for the
"non-restore" cases
2. Using virFileWrapperFdClose for the two restore cases
then:
Reviewed-by: John Ferlan <jferlan(a)redhat.com>
else:
I need more convincing that the Restore cases can release the lock.
IIRC, those have much faster processing since we're reading from file
into buffer and not waiting for buffer to be flushed to disk on write...
Yes, it impacts 'virsh list' (and other similar commands), but for at
least qemuDomainObjRestore there's a fair amount of code afterwards that
may assume that the domain is still active.
John
cleanup: