Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason

8 Nov 2018

On 07.11.2018 17:29, John Ferlan wrote:
...
On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
...
Let's introduce shutdown reason "daemon" which means we have to
kill running domain ourselves as the best action we can take at
that moment. Failure to pick up domain on daemon restart is
one example of such case. Using reason "crashed" is a bit misleading
as it is used when qemu is actually crashed or qemu destroyed on
guest panic as result of user choice that is the problem was
in qemu/guest itself. So I propose to use "crashed" only for
qemu side issues and introduce "daemon" when we have to kill the qemu
for good.
There is another example where "daemon" will be useful. If we can
not reboot domain we kill it and got "crashed" reason now. Looks
like good candidate for "daemon" reason.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
---
 include/libvirt/libvirt-domain.h |  1 +
 src/conf/domain_conf.c           |  3 ++-
 src/qemu/qemu_process.c          | 11 ++++-------
 tools/virsh-domain-monitor.c     |  3 ++-
 4 files changed, 9 insertions(+), 9 deletions(-)
[...]
Now that I've pushed commits 296e05b54 and 8f0f8425d, let's revisit
this...  Merging in that set of changes with this patch means that
instead of:
...

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index e9c7618..c4bc9ca 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
         /* We can't get the monitor back, so must kill the VM
          * to remove danger of it ending up running twice if
          * user tries to start it again later
-         * If we couldn't get the monitor since QEMU supports
-         * no-shutdown, we can safely say that the domain
-         * crashed ... */
-        state = VIR_DOMAIN_SHUTOFF_CRASHED;
-        /* If BeginJob failed, we jumped here without a job, let's hope another
+         * If BeginJob failed, we jumped here without a job, let's hope another
          * thread didn't have a chance to start playing with the domain yet
          * (it's all we can do anyway).
          */
-        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
+        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
+                        QEMU_ASYNC_JOB_NONE, stopFlags);
     }
     goto cleanup;
 }
@@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
          * is no thread that could be doing anything else with the same domain
          * object.
          */
-        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
+        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
                         QEMU_ASYNC_JOB_NONE, 0);
         qemuDomainRemoveInactiveJobLocked(src->driver, obj);
We'd have:
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 3bf84834ec..023e187729 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7995,10 +7995,14 @@ qemuProcessReconnect(void *opaque)
          *
          * If we cannot get to the monitor when the QEMU command
          * line used -no-shutdown, then we can safely say that the
-         * domain crashed; otherwise, we don't really know. */
+         * domain crashed; otherwise, if the monitor was started,
+         * then we can blame ourselves, else we failed before the
+         * monitor started so we don't really know. */
         if (!priv->mon && tryMonReconn &&
             qemuDomainIsUsingNoShutdown(priv))
             state = VIR_DOMAIN_SHUTOFF_CRASHED;
+        else if (priv->mon)
+            state = VIR_DOMAIN_SHUTOFF_DAEMON;
         else
             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
@@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
          * is no thread that could be doing anything else with the same
domain
          * object.
          */
-        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
+        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
                         QEMU_ASYNC_JOB_NONE, 0);
I think we need to report VIR_DOMAIN_SHUTOFF_UNKNOWN here. This is similar to 
qemuProcessReconnect before we try to reconnect. May be this hunk need to be 
placed in distinct patch therefore.
...
qemuDomainRemoveInactiveJobLocked(src->driver, obj);
[...]
So does this essentially meet/handle the condition you were trying to
alter?  That is - once we get beyond the monitor reconnection, then if
we end up in error that means the daemon has decided to stop things.
Leaving the UNKNOWN to be the period prior to attempting to reconnect
and CRASHED to the period during which we try to connect to the monitor.
This also captures failures after connection is successful (when
priv->mon in qemuConnectMonitor).

...
Getting more specific failures of why the connection failed is perhaps a
future task if it really matters.
This looks good enough for me too.
...
If this looks good to you let me know, I can merge and push with the
previously noted changes and a commit message of:
"This patch introduces a new shutdown reason "daemon" in order
to indicate that the daemon needed to force shutdown the domain
as the best course of action to take at the moment.
This action would occur during reconnection when processing encounters
an error once the monitor reconnection is successful."
Besides the first comment I'm fine with this change. I'm going
to send a patch to change reason to VIR_DOMAIN_SHUTOFF_DAEMON in 
appropriate reboot failure cases too. Also one day let's fix the code not to kill
unrelated process.

Nikolay