On Wed, Jun 01, 2022 at 14:50:21 +0200, Jiri Denemark wrote:
QEMU keeps guest CPUs running even in postcopy-paused migration state
so
that processes that already have all memory pages they need migrated to
the destination can keep running. However, this behavior might bring
unexpected delays in interprocess communication as some processes will
be stopped until migration is recover and their memory pages migrated.
So let's make sure all guest CPUs are paused while postcopy migration is
paused.
---
Notes:
Version 2:
- new patch
- this patch does not currently work as QEMU cannot handle "stop"
QMP command while in postcopy-paused state... the monitor just
hangs (see
https://gitlab.com/qemu-project/qemu/-/issues/1052 )
Does it then somehow self-heal? Because if not ...
- an ideal solution of the QEMU bug would be if QEMU itself
paused
the CPUs for us and we just got notified about it via QMP events
- but Peter Xu thinks this behavior is actually worse than keeping
vCPUs running
- so let's take this patch as a base for discussing what we should
be doing with vCPUs in postcopy-paused migration state
src/qemu/qemu_domain.c | 1 +
src/qemu/qemu_domain.h | 1 +
src/qemu/qemu_driver.c | 30 +++++++++++++++++++++++++
src/qemu/qemu_migration.c | 47 +++++++++++++++++++++++++++++++++++++++
src/qemu/qemu_migration.h | 6 +++++
src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++
6 files changed, 117 insertions(+)
[...]
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
index 0314fb1148..58d7009363 100644
--- a/src/qemu/qemu_migration.c
+++ b/src/qemu/qemu_migration.c
@@ -6831,6 +6831,53 @@ qemuMigrationProcessUnattended(virQEMUDriver *driver,
}
+void
+qemuMigrationUpdatePostcopyCPUState(virDomainObj *vm,
+ virDomainState state,
+ int reason,
+ int asyncJob)
+{
+ virQEMUDriver *driver = QEMU_DOMAIN_PRIVATE(vm)->driver;
+ int current;
+
+ if (state == VIR_DOMAIN_PAUSED) {
+ VIR_DEBUG("Post-copy migration of domain '%s' was paused, stopping
guest CPUs",
+ vm->def->name);
+ } else {
+ VIR_DEBUG("Post-copy migration of domain '%s' was resumed, starting
guest CPUs",
+ vm->def->name);
+ }
+
+ if (virDomainObjGetState(vm, ¤t) == state) {
+ int eventType = -1;
+ int eventDetail = -1;
+
+ if (current == reason) {
+ VIR_DEBUG("Guest CPUs are already in the right state");
+ return;
+ }
+
+ VIR_DEBUG("Fixing domain state reason");
+ if (state == VIR_DOMAIN_PAUSED) {
+ eventType = VIR_DOMAIN_EVENT_SUSPENDED;
+ eventDetail = qemuDomainPausedReasonToSuspendedEvent(reason);
+ } else {
+ eventType = VIR_DOMAIN_EVENT_RESUMED;
+ eventDetail = qemuDomainRunningReasonToResumeEvent(reason);
+ }
+ virDomainObjSetState(vm, state, reason);
+ qemuDomainSaveStatus(vm);
+ virObjectEventStateQueue(driver->domainEventState,
+ virDomainEventLifecycleNewFromObj(vm, eventType,
+ eventDetail));
+ } else if (state == VIR_DOMAIN_PAUSED) {
+ qemuProcessStopCPUs(driver, vm, reason, asyncJob);
Then this will obviously break our ability to control qemu. If that is
forever, then we certainly should not be doing this.
In which case if we want to go ahead with pausing it ourselves, once
qemu fixes the issue you've mentioned above, they need to also add a
'feature' flag into QMP which we can probe and avoid breaking qemu
willingly.
+ } else {
+ qemuProcessStartCPUs(driver, vm, reason, asyncJob);
+ }
+}
+
+
/* Helper function called while vm is active. */
int
qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainObj *vm,