On 11/8/18 8:10 AM, Nikolay Shirokovskiy wrote:
On 08.11.2018 15:58, John Ferlan wrote:
>
>
> On 11/8/18 2:03 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 07.11.2018 17:29, John Ferlan wrote:
>>>
>>>
>>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>>> Let's introduce shutdown reason "daemon" which means we
have to
>>>> kill running domain ourselves as the best action we can take at
>>>> that moment. Failure to pick up domain on daemon restart is
>>>> one example of such case. Using reason "crashed" is a bit
misleading
>>>> as it is used when qemu is actually crashed or qemu destroyed on
>>>> guest panic as result of user choice that is the problem was
>>>> in qemu/guest itself. So I propose to use "crashed" only for
>>>> qemu side issues and introduce "daemon" when we have to kill
the qemu
>>>> for good.
>>>>
>>>> There is another example where "daemon" will be useful. If we
can
>>>> not reboot domain we kill it and got "crashed" reason now.
Looks
>>>> like good candidate for "daemon" reason.
>>>>
>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy(a)virtuozzo.com>
>>>> ---
>>>> include/libvirt/libvirt-domain.h | 1 +
>>>> src/conf/domain_conf.c | 3 ++-
>>>> src/qemu/qemu_process.c | 11 ++++-------
>>>> tools/virsh-domain-monitor.c | 3 ++-
>>>> 4 files changed, 9 insertions(+), 9 deletions(-)
>>>>
>
> [...]
>
>>> @@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>> * is no thread that could be doing anything else with the same
>>> domain
>>> * object.
>>> */
>>> - qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>> + qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>> QEMU_ASYNC_JOB_NONE, 0);
>>
>> I think we need to report VIR_DOMAIN_SHUTOFF_UNKNOWN here. This is similar to
>> qemuProcessReconnect before we try to reconnect. May be this hunk need to be
>> placed in distinct patch therefore.
>>
>
> Perhaps keeping "FAILED" would be better - that way (and I can add this
> to the qemuProcessReconnect description):
>
> * Failure to reconnect will generate the following reasons:
> * VIR_DOMAIN_SHUTOFF_FAILED => Tried to create this thread, but failed
> * VIR_DOMAIN_SHUTOFF_UNKNOWN => Thread started, but failed before reconnect
> * to the monitor
> * VIR_DOMAIN_SHUTOFF_CRASHED => Attempted to reconnect to the monitor, but
> * failed to open, assume domain crash
> * VIR_DOMAIN_SHUTOFF_DAEMON => Reconnected to the monitor, but decided
> * afterwards we needed to stop the process
>
> Look good?
Unfortunately meaning of each is fixed in API:
VIR_DOMAIN_SHUTOFF_UNKNOWN = 0, /* the reason is unknown */
VIR_DOMAIN_SHUTOFF_CRASHED = 3, /* domain crashed */
VIR_DOMAIN_SHUTOFF_FAILED = 6, /* domain failed to start */
So this is definetly not FAILED. I guess we can't change meaning now also, so
we have to use UNKNOWN and explaining reasons again is qemuProcessReconnect is
needless.
By the way how from reconnection POV failing to spawn a thread differ from any
error in qemuProcessReconnect before monitor connectiton.
True, but let's walk through going through history
The FAILED reason was introduced in commit b046c55d4 (v0.9.2).
Then, commit d38897a5d (v0.9.5) introduced qemuProcessReconnectHelper to
run qemuProcessReconnect in a thread. This commit used FAILED for both
APIs because that kept the same reason.
Eventually commit bda2f17d7 (v0.9.12) changed FAILED to CRASHED and
UNKNOWN based on whether the domain was started w/ -no-shutdown. With
this commit, failure to start a thread would still use FAILED.
That stayed that way until commit fe35b1ad6 (v4.3.0) in inadvertently
removed UNKNOWN while removing the capability used to determine whether
-no-shutdown was used. This left CRASHED as the only reason.
That was repaired as a result of this patch by commit 296e05b54 (4.10.0).
So now the proposal which seems fair to me is to create a category
DAEMON which reduces the UNKNOWN window to better describe why we're
stopping the domain after reconnection.
So, going back to FAILED in qemuProcessReconnectHelper IMO is the right
thing to do at this point. If some patch comes along afterwards to
changed FAILED to UNKNOWN it can be debated then.
So, I'll restore the original FAILED reason, but not document the
various values since the reader of the code could figure it out.
John