On 11/30/18 9:53 AM, Wang King wrote:
> The issue occurs when I make repeated calls to virDomainReboot
with
> VIR_DOMAIN_REBOOT_DEFAULT flag. In the first call to reboot domain,
> the qemu driver chose ACPI path, and set priv->fakeReboot to true.
> Then in a second call, qemu driver chose agent to reboot which set
> fakeReboot to false. But because the guest already responded to ACPI
> shut down, libvirtd daemon will process a SHUTDOWN event in
> qemuProcessShutdownOrReboot and checks priv->fakeReboot. Since the
> fakeReboot flag is now false, qemu process is unexpectedly killed.
>
This sounds fishy. Looking at the code libvirt decides whether to use agent or ACPI based
on:
a) flags (but since you're passing 0 this is out of the picture),
b) guest agent being available,
This means that agent must have connected between two virDomainReboot() calls. Otherwise
libvirt would make the same choice.
> I have no idea how to fix it.
Well, the qemuDomainSetFakeReboot(false) call was added in b0c144c5792 which points to:
https://www.redhat.com/archives/libvir-list/2015-April/msg00732.html
I think the patch proposed there is actually right and not the one that was merged.
Michal
The problem:
Consider this scenario: firstly, call virDomainReboot() with mode ACPI to a domain, then
immediately call virDomainReboot() again with mode “agent”. Supposedly, this sequence will
reboot the domain (hereinafter referred to as DOM1). Instead, however, the current version
of libvirt code will *KILL* the QEMU process associated with the domain.
How to reproduce this problem:
Simply make a ACPI-mode reboot call to the domain, and immediately make an agent-mode
reboot call. You will see that the domain is killed straight away.
Root cause:
The unexpected SIGKILL comes from Libvirt seeing fakeReboot flag set to *FALSE* when it
received an SHUTDOWN event as a result of calling ACPI virDomainReboot(). This happens
when DOM1 is already rebooting, yet its guest agent is still available, thus the second
agent-mode virDomainReboot() was able to pass the check and set fakeReboot flag to *FALSE*
(its “guest-shutdown” command is also sent, which is out of the scope of this email).
Finally, DOM1 shut down as part of the reboot process, its QEMU process sent SHUTDOWN
event to Libvirt. Libvirt checked fakeReboot flag and saw it being *FALSE*, forcefully
kills the QEMU process as a result.
Analysis:
Say that we make two calls to shutdown/reboot DOM1. DOM1 will eventually respond to only
one of the two calls. For another call, only two scenarios happen: DOM1 accepts the second
call but do not execute it (meaning only fakeReboot flag is modified on Libvirt side), or
DOM1 accepts the second call and overrides the first one. In the second scenario, it is
the same situation as first scenario but reversed (DOM1 responds to first call and sets
flag with second call vs DOM1 responds to second call and sets flag with first call).
Therefore, we hereby simplify the problem to that DOM1 always responds to first call, and
the second call is also accepted, but it only serves to change the fakeReboot flag. As you
will see, there is what we expect to happen versus what actually happens. The questionable
result is marked with tilde(~) and the faulted result is marked with emphasis(*).
If you see the chart below, you will see that how the domain should react to
shutdown/reboot calls in different order:
-----------------------------------------------------------------------------------------------------------
|First Command |Second Command(set flags only) |Expected Result |Actual Result
|After Fix |
-----------------------------------------------------------------------------------------------------------
|ACPI SHUTDOWN |ACPI SHUTDOWN |SHUTDOWN |SHUTDOWN |SHUTDOWN |
|ACPI REBOOT |ACPI SHUTDOWN |REBOOT |~SHUTDOWN~ |~SHUTDOWN~ |
|AGENT SHUTDOWN |ACPI SHUTDOWN |SHUTDOWN |SHUTDOWN |SHUTDOWN |
|AGENT REBOOT |ACPI SHUTDOWN |REBOOT |REBOOT |REBOOT
|
|ACPI SHUTDOWN |ACPI REBOOT |SHUTDOWN |~REBOOT~ |~REBOOT~
|
|ACPI REBOOT |ACPI REBOOT |REBOOT |REBOOT |REBOOT
|
|AGENT SHUTDOWN |ACPI REBOOT |SHUTDOWN |~REBOOT~ |~REBOOT~
|
|AGENT REBOOT |ACPI REBOOT |REBOOT |REBOOT |REBOOT
|
|ACPI SHUTDOWN |AGENT SHUTDOWN |SHUTDOWN |SHUTDOWN |SHUTDOWN |
|ACPI REBOOT |AGENT SHUTDOWN |REBOOT |~SHUTDOWN~ |~SHUTDOWN~
|
|AGENT SHUTDOWN |AGENT SHUTDOWN |SHUTDOWN |SHUTDOWN |SHUTDOWN |
|AGENT REBOOT |AGENT SHUTDOWN |REBOOT |REBOOT |REBOOT
|
|ACPI SHUTDOWN |AGENT REBOOT |SHUTDOWN |SHUTDOWN |SHUTDOWN
|
|ACPI REBOOT |AGENT REBOOT |REBOOT |**SHUTDOWN**
|**REBOOT** |
|AGENT SHUTDOWN |AGENT REBOOT |SHUTDOWN |SHUTDOWN |SHUTDOWN |
|AGENT REBOOT |AGENT REBOOT |REBOOT |REBOOT |REBOOT
|
----------------------------------------------------------------------------------------------------------
As you can see, when DOM1 reboots with ACPI and immediately receives agent-mode reboot. It
will *SHUTDOWN* instead. (Certainly, there are questionable results in other combos, but
we think they cause less confusions in real-life situations so we are not discussing them
here)
Fix:
The fix is to revert the changes added by b0c144c5792 commit, which introduced setting
fakeReboot to False in agent-mode reboot. This should, at least, avoid killing QEMU
process during successive reboot calls. Nonetheless, shutdown/reboot process still needs
some optimizations to be done. I will leave it to the analysis above. What do you think
about this? Should we revert the commit first or find an all-fix solution to this
problem?