[libvirt] [PATCH v2 0/3] qemu: Restore lost shutdown reason

v1: https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html Changes since v1: Patch 1 (NEW) - Set the priv->allowReboot prior to any error processing since we could/should be using that. Patch 2 (ADJUSTED LOGIC) - Rather than open code the "reason" to use the -no-shutdown flag, let's create a qemu_domain helper for that so that both the command line building and the Reconnection failure logic can use it. Patch 3 (NEW) - Let's narrow the window for using VIR_DOMAIN_SHUTOFF_CRASHED to just the period where we try to open a monitor channel. If we cannot do so, then "assume" the reason was crashed. There are a few open failure steps that may not exactly fit the model, but those are probably splitting hairs. John Ferlan (3): qemu: Move allow reboot check setting qemu: Restore lost shutdown reason qemu: Narrow the shutdown reconnection failure reason window src/qemu/qemu_command.c | 6 +----- src/qemu/qemu_domain.c | 17 +++++++++++++++++ src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 27 ++++++++++++++++++--------- 4 files changed, 39 insertions(+), 14 deletions(-) -- 2.17.2

Checking and setting the priv->allowReboot can be done before we start processing the job. A subsequent patch will make use of the value to make decisions in the error label, so we need to have it set properly. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9cf971808c..5232f761af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7767,6 +7767,10 @@ qemuProcessReconnect(void *opaque) cfg = virQEMUDriverGetConfig(driver); priv = obj->privateData; + /* If we are connecting to a guest started by old libvirt there is no + * allowReboot in status XML and we need to initialize it. */ + qemuProcessPrepareAllowReboot(obj); + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto error; @@ -7783,10 +7787,6 @@ qemuProcessReconnect(void *opaque) if (qemuDomainMasterKeyReadFile(priv) < 0) goto error; - /* If we are connecting to a guest started by old libvirt there is no - * allowReboot in status XML and we need to initialize it. */ - qemuProcessPrepareAllowReboot(obj); - if (qemuHostdevUpdateActiveDomainDevices(driver, obj->def) < 0) goto error; -- 2.17.2

On 11/01/2018 05:04 PM, John Ferlan wrote:
Checking and setting the priv->allowReboot can be done before we start processing the job. A subsequent patch will make use of the value to make decisions in the error label, so we need to have it set properly.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9cf971808c..5232f761af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7767,6 +7767,10 @@ qemuProcessReconnect(void *opaque) cfg = virQEMUDriverGetConfig(driver); priv = obj->privateData;
+ /* If we are connecting to a guest started by old libvirt there is no + * allowReboot in status XML and we need to initialize it. */ + qemuProcessPrepareAllowReboot(obj);
I'm not quite sure why this happens outside of job. It doesn't look like it has to.
+ if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto error;
@@ -7783,10 +7787,6 @@ qemuProcessReconnect(void *opaque) if (qemuDomainMasterKeyReadFile(priv) < 0) goto error;
- /* If we are connecting to a guest started by old libvirt there is no - * allowReboot in status XML and we need to initialize it. */ - qemuProcessPrepareAllowReboot(obj); - if (qemuHostdevUpdateActiveDomainDevices(driver, obj->def) < 0) goto error;
Michal

On 11/6/18 4:38 AM, Michal Privoznik wrote:
On 11/01/2018 05:04 PM, John Ferlan wrote:
Checking and setting the priv->allowReboot can be done before we start processing the job. A subsequent patch will make use of the value to make decisions in the error label, so we need to have it set properly.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9cf971808c..5232f761af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7767,6 +7767,10 @@ qemuProcessReconnect(void *opaque) cfg = virQEMUDriverGetConfig(driver); priv = obj->privateData;
+ /* If we are connecting to a guest started by old libvirt there is no + * allowReboot in status XML and we need to initialize it. */ + qemuProcessPrepareAllowReboot(obj);
I'm not quite sure why this happens outside of job. It doesn't look like it has to.
Is there a reason in your opinion it needs to occur inside a job? It is a void function. It's moved to prior to the first "goto error" because of patch3 which would call qemuDomainIsUsingNoShutdown which checks priv->allowReboot which is possibly set in *AllowReboot. Without that move, the code would need to be reworked, which is fine, but understanding the reason why I wrote things the way I did is important, IMO. I can add a comment to "warn" the next person trying to move it that the error: logic uses the ->allowReboot value. The allowReboot alteration has nothing to do with a job AFAICT and whether on error: there is a job or not. Perhaps no different to what qemuDomainObjRestoreJob is doing by using @priv fields for @oldjob. Although looking at and quickly thinking about the code now, I wonder if the virQEMUDriverGetCapabilities and goto error should be inside a job too since error would then call qemuProcessStop without being in a job. If this is dropped then logic in patch3 needs to be altered in order to account for jobStarted = true... I think that got too messy when I first thought about it. Tks - John
+ if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto error;
@@ -7783,10 +7787,6 @@ qemuProcessReconnect(void *opaque) if (qemuDomainMasterKeyReadFile(priv) < 0) goto error;
- /* If we are connecting to a guest started by old libvirt there is no - * allowReboot in status XML and we need to initialize it. */ - qemuProcessPrepareAllowReboot(obj); - if (qemuHostdevUpdateActiveDomainDevices(driver, obj->def) < 0) goto error;
Michal

On 11/06/2018 01:28 PM, John Ferlan wrote:
On 11/6/18 4:38 AM, Michal Privoznik wrote:
On 11/01/2018 05:04 PM, John Ferlan wrote:
Checking and setting the priv->allowReboot can be done before we start processing the job. A subsequent patch will make use of the value to make decisions in the error label, so we need to have it set properly.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9cf971808c..5232f761af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7767,6 +7767,10 @@ qemuProcessReconnect(void *opaque) cfg = virQEMUDriverGetConfig(driver); priv = obj->privateData;
+ /* If we are connecting to a guest started by old libvirt there is no + * allowReboot in status XML and we need to initialize it. */ + qemuProcessPrepareAllowReboot(obj);
I'm not quite sure why this happens outside of job. It doesn't look like it has to.
Is there a reason in your opinion it needs to occur inside a job? It is a void function.
The type of the return value doesn't matter. qemuProcessPrepareAllowReboot() changes private data and that is potentially dangerous if done outside modify job (even though the @vm is locked at this point so I guess it is not that dangerous after all).
It's moved to prior to the first "goto error" because of patch3 which would call qemuDomainIsUsingNoShutdown which checks priv->allowReboot which is possibly set in *AllowReboot. Without that move, the code would need to be reworked, which is fine, but understanding the reason why I wrote things the way I did is important, IMO. I can add a comment to "warn" the next person trying to move it that the error: logic uses the ->allowReboot value.
The allowReboot alteration has nothing to do with a job AFAICT and whether on error: there is a job or not. Perhaps no different to what qemuDomainObjRestoreJob is doing by using @priv fields for @oldjob.
Yeah, but RestoreJob is special - we can't call it after job is acquired, we want to save currently running job to a temp variable so that we can start a new job.
Although looking at and quickly thinking about the code now, I wonder if the virQEMUDriverGetCapabilities and goto error should be inside a job too since error would then call qemuProcessStop without being in a job.
Ooops, yes.
If this is dropped then logic in patch3 needs to be altered in order to account for jobStarted = true... I think that got too messy when I first thought about it.
Michal

On 11/6/18 7:48 AM, Michal Privoznik wrote:
On 11/06/2018 01:28 PM, John Ferlan wrote:
On 11/6/18 4:38 AM, Michal Privoznik wrote:
On 11/01/2018 05:04 PM, John Ferlan wrote:
Checking and setting the priv->allowReboot can be done before we start processing the job. A subsequent patch will make use of the value to make decisions in the error label, so we need to have it set properly.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 9cf971808c..5232f761af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7767,6 +7767,10 @@ qemuProcessReconnect(void *opaque) cfg = virQEMUDriverGetConfig(driver); priv = obj->privateData;
+ /* If we are connecting to a guest started by old libvirt there is no + * allowReboot in status XML and we need to initialize it. */ + qemuProcessPrepareAllowReboot(obj);
I'm not quite sure why this happens outside of job. It doesn't look like it has to.
Is there a reason in your opinion it needs to occur inside a job? It is a void function.
The type of the return value doesn't matter. qemuProcessPrepareAllowReboot() changes private data and that is potentially dangerous if done outside modify job (even though the @vm is locked at this point so I guess it is not that dangerous after all).
Does that mean we need to cull the code looking for everywhere in the code where @priv data is modified outside a job? and start a job if so? I thought jobs were more related to monitor interactions rather than @priv modification. The answer still doesn't make sense to me. John
It's moved to prior to the first "goto error" because of patch3 which would call qemuDomainIsUsingNoShutdown which checks priv->allowReboot which is possibly set in *AllowReboot. Without that move, the code would need to be reworked, which is fine, but understanding the reason why I wrote things the way I did is important, IMO. I can add a comment to "warn" the next person trying to move it that the error: logic uses the ->allowReboot value.
The allowReboot alteration has nothing to do with a job AFAICT and whether on error: there is a job or not. Perhaps no different to what qemuDomainObjRestoreJob is doing by using @priv fields for @oldjob.
Yeah, but RestoreJob is special - we can't call it after job is acquired, we want to save currently running job to a temp variable so that we can start a new job.
Although looking at and quickly thinking about the code now, I wonder if the virQEMUDriverGetCapabilities and goto error should be inside a job too since error would then call qemuProcessStop without being in a job.
Ooops, yes.
If this is dropped then logic in patch3 needs to be altered in order to account for jobStarted = true... I think that got too messy when I first thought about it.
Michal

When qemuProcessReconnectHelper was introduced (commit d38897a5d) reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was removed in commit fe35b1ad6 the conditional state was just left at VIR_DOMAIN_SHUTOFF_CRASHED. So introduce qemuDomainIsUsingNoShutdown which will manage the condition when the domain was started with -no-shutdown so that when/if reconnection failure occurs we can restore the decision point used to determine whether CRASHED or UNKNOWN is provided. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_command.c | 6 +----- src/qemu/qemu_domain.c | 17 +++++++++++++++++ src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 15 ++++++++++----- 4 files changed, 31 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 6e3ff67660..5b94931af6 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6415,11 +6415,7 @@ qemuBuildPMCommandLine(virCommandPtr cmd, if (priv->allowReboot == VIR_TRISTATE_BOOL_NO) virCommandAddArg(cmd, "-no-reboot"); - /* If JSON monitor is enabled, we can receive an event - * when QEMU stops. If we use no-shutdown, then we can - * watch for this event and do a soft/warm reboot. - */ - if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES) + if (qemuDomainIsUsingNoShutdown(priv)) virCommandAddArg(cmd, "-no-shutdown"); if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_NO_ACPI)) { diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index ba3fff607a..045a7b4ac0 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13561,3 +13561,20 @@ qemuDomainRunningReasonToResumeEvent(virDomainRunningReason reason) return VIR_DOMAIN_EVENT_RESUMED_UNPAUSED; } + + +/* qemuDomainIsUsingNoShutdown: + * @priv: Domain private data + * + * If JSON monitor is enabled, we can receive an event when QEMU stops. If + * we use no-shutdown, then we can watch for this event and do a soft/warm + * reboot. + * + * Returns: @true when -no-shutdown either should be or was added to the + * command line. + */ +bool +qemuDomainIsUsingNoShutdown(qemuDomainObjPrivatePtr priv) +{ + return priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES; +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 80bd4bde91..554435e0a9 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1089,4 +1089,7 @@ void qemuDomainStorageIdReset(qemuDomainObjPrivatePtr priv); virDomainEventResumedDetailType qemuDomainRunningReasonToResumeEvent(virDomainRunningReason reason); +bool +qemuDomainIsUsingNoShutdown(qemuDomainObjPrivatePtr priv); + #endif /* __QEMU_DOMAIN_H__ */ diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 5232f761af..7112f3c048 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7988,11 +7988,16 @@ qemuProcessReconnect(void *opaque) if (virDomainObjIsActive(obj)) { /* We can't get the monitor back, so must kill the VM * to remove danger of it ending up running twice if - * user tries to start it again later - * If we couldn't get the monitor since QEMU supports - * no-shutdown, we can safely say that the domain - * crashed ... */ - state = VIR_DOMAIN_SHUTOFF_CRASHED; + * user tries to start it again later. + * + * If we cannot get to the monitor when the QEMU command + * line used -no-shutdown, then we can safely say that the + * domain crashed; otherwise we don't really know. */ + if (qemuDomainIsUsingNoShutdown(priv)) + state = VIR_DOMAIN_SHUTOFF_CRASHED; + else + state = VIR_DOMAIN_SHUTOFF_UNKNOWN; + /* If BeginJob failed, we jumped here without a job, let's hope another * thread didn't have a chance to start playing with the domain yet * (it's all we can do anyway). -- 2.17.2

On 11/01/2018 05:04 PM, John Ferlan wrote:
When qemuProcessReconnectHelper was introduced (commit d38897a5d) reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or VIR_DOMAIN_SHUTOFF_UNKNOWN.
When QEMU_CAPS_NO_SHUTDOWN checking was removed in commit fe35b1ad6 the conditional state was just left at VIR_DOMAIN_SHUTOFF_CRASHED.
So introduce qemuDomainIsUsingNoShutdown which will manage the condition when the domain was started with -no-shutdown so that when/if reconnection failure occurs we can restore the decision point used to determine whether CRASHED or UNKNOWN is provided.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_command.c | 6 +----- src/qemu/qemu_domain.c | 17 +++++++++++++++++ src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 15 ++++++++++----- 4 files changed, 31 insertions(+), 10 deletions(-)
ACK Michal

The current qemuProcessReconnect logic paints a broad brush determining that the shutdown reason must be crashed if it was determined that the domain was started with -no-shutdown; however, there's many other ways to get to the error label, so let's narrow our reasoning window for using VIR_DOMAIN_SHUTOFF_CRASHED to the period where we essentially know we've tried to create to the monitor and before we were successful in opening the connection. Failures that occur outside that window would thus be considered as VIR_DOMAIN_SHUTOFF_UNKNOWN, at least for now. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7112f3c048..85f30bf9d4 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7757,6 +7757,7 @@ qemuProcessReconnect(void *opaque) bool jobStarted = false; virCapsPtr caps = NULL; bool retry = true; + bool tryMonReconn = false; VIR_FREE(data); @@ -7797,6 +7798,8 @@ qemuProcessReconnect(void *opaque) VIR_DEBUG("Reconnect monitor to def=%p name='%s' retry=%d", obj, obj->def->name, retry); + tryMonReconn = true; + /* XXX check PID liveliness & EXE path */ if (qemuConnectMonitor(driver, obj, QEMU_ASYNC_JOB_NONE, retry, NULL) < 0) goto error; @@ -7993,7 +7996,8 @@ qemuProcessReconnect(void *opaque) * If we cannot get to the monitor when the QEMU command * line used -no-shutdown, then we can safely say that the * domain crashed; otherwise we don't really know. */ - if (qemuDomainIsUsingNoShutdown(priv)) + if (!priv->mon && tryMonReconn && + qemuDomainIsUsingNoShutdown(priv)) state = VIR_DOMAIN_SHUTOFF_CRASHED; else state = VIR_DOMAIN_SHUTOFF_UNKNOWN; -- 2.17.2

On 11/01/2018 05:04 PM, John Ferlan wrote:
The current qemuProcessReconnect logic paints a broad brush determining that the shutdown reason must be crashed if it was determined that the domain was started with -no-shutdown; however, there's many other ways to get to the error label, so let's narrow our reasoning window for using VIR_DOMAIN_SHUTOFF_CRASHED to the period where we essentially know we've tried to create to the monitor and before we were successful in opening the connection.
Failures that occur outside that window would thus be considered as VIR_DOMAIN_SHUTOFF_UNKNOWN, at least for now.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_process.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
ACK Michal

On 11/1/18 12:04 PM, John Ferlan wrote:
v1: https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html
Changes since v1:
Patch 1 (NEW)
- Set the priv->allowReboot prior to any error processing since we could/should be using that.
Patch 2 (ADJUSTED LOGIC)
- Rather than open code the "reason" to use the -no-shutdown flag, let's create a qemu_domain helper for that so that both the command line building and the Reconnection failure logic can use it.
Patch 3 (NEW)
- Let's narrow the window for using VIR_DOMAIN_SHUTOFF_CRASHED to just the period where we try to open a monitor channel. If we cannot do so, then "assume" the reason was crashed. There are a few open failure steps that may not exactly fit the model, but those are probably splitting hairs.
John Ferlan (3): qemu: Move allow reboot check setting qemu: Restore lost shutdown reason qemu: Narrow the shutdown reconnection failure reason window
src/qemu/qemu_command.c | 6 +----- src/qemu/qemu_domain.c | 17 +++++++++++++++++ src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 27 ++++++++++++++++++--------- 4 files changed, 39 insertions(+), 14 deletions(-)
After sleeping on it, although we cannot come to an agreement on patch1, I guess in the long run it doesn't matter since patch3 narrows the window that made patch1 unnecessary. In order for CRASHED to be elicited the monitor connection must've failed; otherwise, UNKNOWN would be used. Moving forward also provides the change I found necessary for Nikolay's domain reason addition patch to proceed forward, e.g.: https://www.redhat.com/archives/libvir-list/2018-October/msg00493.html So, I'll drop patch 1 and push patch2/3, then go back to Nikolay's code. Tks - John
participants (2)
-
John Ferlan
-
Michal Privoznik