On 03/05/2018 12:43 PM, Cordius Wu wrote:
>>> On 03/05/2018 03:20 AM, Wuzongyong (Euler Dept) wrote:
>>>> Hi,
>>>>
>>>> We unregister qemu monitor after sending QEMU_PROCESS_EVENT_MONITOR_EOF
>>> to workerPool:
>>>>
>>>> static void
>>>> qemuProcessHandleMonitorEOF(qemuMonitorPtr mon,
>>>> virDomainObjPtr vm,
>>>> void *opaque) {
>>>> virQEMUDriverPtr driver = opaque;
>>>> qemuDomainObjPrivatePtr priv;
>>>> struct qemuProcessEvent *processEvent; ...
>>>> processEvent->eventType = QEMU_PROCESS_EVENT_MONITOR_EOF;
>>>> processEvent->vm = vm;
>>>>
>>>> virObjectRef(vm);
>>>> if (virThreadPoolSendJob(driver->workerPool, 0, processEvent) <
0)
{
>>>> ignore_value(virObjectUnref(vm));
>>>> VIR_FREE(processEvent);
>>>> goto cleanup;
>>>> }
>>>>
>>>> /* We don't want this EOF handler to be called over and over
while
>>> the
>>>> * thread is waiting for a job.
>>>> */
>>>> qemuMonitorUnregister(mon);
>>>> ...
>>>> }
>>>>
>>>> Then we handle QEMU_PROCESS_EVENT_MONITOR_EOF in processMonitorEOFEvent
>>> function:
>>>>
>>>> static void
>>>> processMonitorEOFEvent(virQEMUDriverPtr driver,
>>>> virDomainObjPtr vm) {
>>>> ...
>>>> if (qemuProcessBeginStopJob(driver, vm, QEMU_JOB_DESTROY, true)
<
>>> 0)
>>>> return;
>>>> ...
>>>> }
>>>>
>>>> Here, libvirt will show that the vm state is running all the time if
>>>> qemuProcessBeginStopJob return -1 even though qemu may terminate or be
>>> killed later.
>>>>
>>>> So, may be we should re-register the monitor when
>>> qemuProcessBeginStopJob failed?
>>>
>>> The fact that processMonitorEOFEvent() failed to grab DESTROY job means
>>> that we screwed up earlier and now you're just seeing effects of it.
>>> Threads should be albe to acquire DESTROY job at any point, regardless
of
>>> other jobs set on the domain object.
>>>
>>> Can you please:
>>> a) try to turn on debug logs [1] and tell us why acquiring DESTROY job
>>> failed? You should see an error message like this:
>>>
>>> error: cannot acquire state change lock ..
>>>
>>> b) tell us what is your libvirt version and if you're able to reproduce
>>> this with the latest git HEAD?
>>>
>>
>> I said " qemuProcessBeginStopJob failed" means that:
>
> Oh, I though that the message you've sent earlier is related to this:
>
>
https://www.redhat.com/archives/libvir-list/2018-March/msg00148.html
>
> So you are not accidentally sending SIGKILL to qemu then?
Yep, I send SIGKILL to qemu outside. The 'accident' means that the scene
libvirt indicate
the vm is in running state all the time is hardly to reproduce. In the past
month, I just
reproduce it twice.
>> we failed to kill qemu process in 15 seconds (refer to
virProcessKillPainfully).
>> IOW, we send SIGTERM and SIGKILL but the qemu process doesn't exit in
15s, and
>> then libvirt will think qemu is still in running state event though qemu
exit
>> indeed after the 15s loop in virProcessKillPainfully.
>
> What state is qemu process in then? I mean, how can we see EOF if the
> process still exists?
>
I send SIGKILL to qemu process, but the qemu process didn't exited
immediately, I use
command 'ps -ef | grep qemu' show that the qemu process is in defunct state.
Ah, so you can find the process, but it is in D state. Because I read
the email linked above like qemu is gone.
Then about
20s-30s after sending the SIGKILLthe qemu process exited and I can't find
the qemu info
though ps command.
So, the libvirt still think the qemu process is alive in the 15s loop in
virProcessKillPainfully.
Ah, so IIUC, qemu has closed the monitor but right after that it went to
the D state instead of quitting. Meanwhile, libvirt sees EOF on the
monitor but is unable to kill the process.
Well, registering EOF handler back would be only a workaround, because
if you register EOF handler back the event loop will do a busy wait (in
each iteration it will see EOF), so eventually the
virProcessKillPainfully() will see the process gone and
qemuProcessBeginStopJob() would be able to return successfully.
I'm unsure what the right fix might be though. Maybe, at EOF we can
check what state is qemu process in and if it's in D state don't try to
kill it and continue with BeginJob() call.
Michal