On 10/7/21 20:45, Jim Fehlig wrote:
On 10/4/21 08:55, Michal Prívozník wrote:
> On 9/30/21 7:15 PM, Jim Fehlig wrote:
>> On 9/29/21 21:29, Jim Fehlig wrote:
>>> Hi All,
>>>
>>> Likely Christian received a bug report that motivated commit
>>> aeda1b8c56, which was later reverted by Michal with commit 72adaf2f10.
>>> In the past, I recall being asked about "internal error: End of file
>>> from qemu monitor" on normal VM shutdown and gave a hand wavy response
>>> using some of Michal's words from the revert commit message.
>>>
>>> I recently received a bug report (sorry, but no public link) from a
>>> concerned user about this error and wondered if there is some way to
>>> improve it? I went down some dead ends before circling back to
>>> Christian's patch. When rebased to latest master, I cannot reproduce
>>> the hangs reported by Michal [1]. Perhaps Nikolay's series to resolve
>>> hangs/crashes of libvirtd [2] has now made Christian's patch viable?
>>
>> Hmm, Nikolay's series improves thread management at daemon shutdown and
>> doesn't touch VM shutdown logic. But there has been some behavior change
>> from the time aeda1b8c56 was committed (3.4.0 dev cycle) to current git
>> master. At least I don't see libvirtd hanging when running Michal's test
>> on master + rebased aeda1b8c56.
After reworking the tests a bit, I still don't see libvirtd hanging, but I do
see VM shutdown stuck in "in shutdown" state. Attaching gdb shows the following
thread blocked waiting for a response from the monitor that will never come
since EOF has already occurred on the socket
Thread 21 (Thread 0x7fdb557fa700 (LWP 9110) "rpc-worker"):
#0 0x00007fdbc922a4dc in pthread_cond_wait@(a)GLIBC_2.3.2 () at
/lib64/libpthread.so.0
#1 0x00007fdbccee2310 in virCondWait (c=0x7fdba403cd40, m=0x7fdba403cd18) at
../src/util/virthread.c:156
#2 0x00007fdb87150a8b in qemuMonitorSend (mon=0x7fdba403cd00,
msg=0x7fdb557f95b0) at ../src/qemu/qemu_monito
r.c:964
#3 0x00007fdb8715fbf1 in qemuMonitorJSONCommandWithFd (mon=0x7fdba403cd00,
cmd=0x7fdb4015bae0, scm_fd=-1, re
ply=0x7fdb557f9678) at ../src/qemu/qemu_monitor_json.c:327
#4 0x00007fdb8715fda0 in qemuMonitorJSONCommand (mon=0x7fdba403cd00,
cmd=0x7fdb4015bae0, reply=0x7fdb557f967
8) at ../src/qemu/qemu_monitor_json.c:352
#5 0x00007fdb87174b71 in qemuMonitorJSONGetIOThreads (mon=0x7fdba403cd00,
iothreads=0x7fdb557f9798, niothrea
ds=0x7fdb557f9790) at ../src/qemu/qemu_monitor_json.c:7838
#6 0x00007fdb8715d059 in qemuMonitorGetIOThreads (mon=0x7fdba403cd00,
iothreads=0x7fdb557f9798, niothreads=0
x7fdb557f9790) at ../src/qemu/qemu_monitor.c:4083
#7 0x00007fdb870e8ae3 in qemuDomainGetIOThreadsMon (driver=0x7fdb6c06a4f0,
vm=0x7fdb6c063940, iothreads=0x7f
db557f9798, niothreads=0x7fdb557f9790) at ../src/qemu/qemu_driver.c:4941
#8 0x00007fdb871129bf in qemuDomainGetStatsIOThread (driver=0x7fdb6c06a4f0,
dom=0x7fdb6c063940, params=0x7fd
b401c0cc0, privflags=1) at ../src/qemu/qemu_driver.c:18292
#9 0x00007fdb871130ee in qemuDomainGetStats (conn=0x7fdb9c006760,
dom=0x7fdb6c063940, stats=1023, record=0x7
fdb557f98d0, flags=1) at ../src/qemu/qemu_driver.c:18504
#10 0x00007fdb87113526 in qemuConnectGetAllDomainStats (conn=0x7fdb9c006760,
doms=0x0, ndoms=0, stats=1023, r
etStats=0x7fdb557f9990, flags=0) at ../src/qemu/qemu_driver.c:18598
#11 0x00007fdbcd163e4e in virConnectGetAllDomainStats (conn=0x7fdb9c006760,
stats=0, retStats=0x7fdb557f9990,
flags=0) at ../src/libvirt-domain.c:11975
...
So indeed, reporting the error when processing monitor IO is needed to prevent
other threads from subsequently writing to the socket. One idea I had was to
postpone reporting the error until someone tries to write to the socket,
although not reporting an error when it is encountered seems distasteful. I've
been testing such a hack (attached) and along with squelching the error message,
I no longer see VMs stuck in the "in shutdown" state after 32 iterations of the
test. A simple rebase of aeda1b8c56 on current git master never survived more
than a dozen iterations. I'll let the test continue to run.
The test has now survived 1134 iterations. I adjusted the patch to be more
upstream friendly and submitted it as an RFC
https://listman.redhat.com/archives/libvir-list/2021-October/msg00484.html
I restarted the tests with this version of the patch and now at iteration 75.
Regards,
Jim