Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason

17 Oct 2018

On 16.10.2018 15:48, John Ferlan wrote:
...
On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
...
Let's introduce shutdown reason "daemon" which means we have to
kill running domain ourselves as the best action we can take at
that moment. Failure to pick up domain on daemon restart is
one example of such case. Using reason "crashed" is a bit misleading
as it is used when qemu is actually crashed or qemu destroyed on
guest panic as result of user choice that is the problem was
in qemu/guest itself. So I propose to use "crashed" only for
qemu side issues and introduce "daemon" when we have to kill the qemu
for good.
How about (although reading below may shed some more context):
This patch introduces a new shutdown reason "daemon" in order
to indicate that the daemon needed to force shutdown the domain
as the best course of action to take at the moment.
This action would occur during Reconnect processing when it's
determined that either the QEMU monitor no longer exists for
some unknown reason or creation of a thread to reconnect the
domain failed.
Mostly I'm fine with this one except "monitor no longer exists"
is condition for reason "crashed" or "unknown" (Martin's contribution)
...
...
There is another example where "daemon" will be useful. If we can
not reboot domain we kill it and got "crashed" reason now. Looks
like good candidate for "daemon" reason.
If you feel this way, then a followup patch could/should be posted;
otherwise, this'll be lost to commit message heaven where all good ideas
go to die ;-).
Sure!
...
...
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
---
 include/libvirt/libvirt-domain.h |  1 +
 src/conf/domain_conf.c           |  3 ++-
 src/qemu/qemu_process.c          | 11 ++++-------
 tools/virsh-domain-monitor.c     |  3 ++-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
index fdd2d6b..11fdab5 100644
--- a/include/libvirt/libvirt-domain.h
+++ b/include/libvirt/libvirt-domain.h
@@ -145,6 +145,7 @@ typedef enum {
     VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
     VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
                                            * taken while domain was shutoff */
+    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
s/have to/decides to/
...
# ifdef VIR_ENUM_SENTINELS
     VIR_DOMAIN_SHUTOFF_LAST
 # endif
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 9911d56..e441723 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
               "migrated",
               "saved",
               "failed",
-              "from snapshot")
+              "from snapshot",
+              "daemon")
VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
               "unknown",
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index e9c7618..c4bc9ca 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
         /* We can't get the monitor back, so must kill the VM
          * to remove danger of it ending up running twice if
          * user tries to start it again later
Since we're changing anyway, let's put the stop there, e.g.
"s/later/later./"
...
-         * If we couldn't get the monitor since QEMU supports
-         * no-shutdown, we can safely say that the domain
-         * crashed ... */
-        state = VIR_DOMAIN_SHUTOFF_CRASHED;
-        /* If BeginJob failed, we jumped here without a job, let's hope another
+         * If BeginJob failed, we jumped here without a job, let's hope another
          * thread didn't have a chance to start playing with the domain yet
          * (it's all we can do anyway).
          */
-        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
+        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
+                        QEMU_ASYNC_JOB_NONE, stopFlags);
     }
     goto cleanup;
 }
@@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
          * is no thread that could be doing anything else with the same domain
          * object.
          */
-        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
+        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
                         QEMU_ASYNC_JOB_NONE, 0);
         qemuDomainRemoveInactiveJobLocked(src->driver, obj);
When qemuProcessReconnectHelper was introduced (commit d38897a5d)
reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
removed in commit fe35b1ad6 the conditional state was just left at
VIR_DOMAIN_SHUTOFF_CRASHED.
Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
on in the code leading to the latter revert commit losing that unknown
reason. The reason for the "-no-shutdown" *did* change and that probably
should have been reflected in the qemuProcessReconnect logic.
So I think perhaps we should have an initial patch which references or
uses that first paragraph in order to change the code to:
/* We cannot get the monitor back, so kill the VM to
         * remove possibility of it ending up running twice if
         * user tries to start it again later.
         *
         * If we cannot get to the monitor when the QEMU command
         * line used -no-shutdown, then we can safely say that the
         * domain crashed; otherwise we don't really know. */
        if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
            state = VIR_DOMAIN_SHUTOFF_CRASHED;
        else
            state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
and uses the short commit msg of "qemu: Restore lost shutdown reason".
Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
CRASHED otherwise.
Then we will get "crashed" reason always for modern qemu and domains
that can reboot which is not true.

However I see now that the issue is more complicated then I thought. I agree
with Martin too but we have to additionally check that connect(2) for domain
monitor socket was called and error was ECONNREFUSED. So finally state is
resolved like this:

if (connect(2) was not called) {
    state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
else if (connect(2) result is NOT ECONNREFUSED) {
    state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
} else if (connect(2) result is ECONNREFUSED) {
    if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
        state = VIR_DOMAIN_SHUTOFF_CRASHED;
    else
        state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
} else {
    state = VIR_DOMAIN_SHUTOFF_DAEMON;
}

Moreover we should not later kill process by remembered pid in all cases except
last one in the above code because we can kill some other process.

Looks like a bit complicated, not sure should we pursuit this one or just left
VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).

Nikolay
...
...
diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
index 3a26363..f0ad558 100644
--- a/tools/virsh-domain-monitor.c
+++ b/tools/virsh-domain-monitor.c
@@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
               N_("migrated"),
               N_("saved"),
               N_("failed"),
-              N_("from snapshot"))
+              N_("from snapshot"),
+              N_("daemon"))
VIR_ENUM_DECL(virshDomainCrashedReason)
 VIR_ENUM_IMPL(virshDomainCrashedReason,