On 12/15/23 16:07, Daniel P. Berrangé wrote:
On Fri, Dec 15, 2023 at 01:48:09PM +0000, Efim Shevrin wrote:
> Hello,
>
> here are call traces with two threads generated by python script
> // ==============================
> [root@dandreev-vz-6-0-0-243-master libvirt]# gdb -p 288985
> (gdb) t a a bt
>
> Thread 2 (Thread 0x7f2112862640 (LWP 288986) "python3"):
> #0 0x00007f2121d4296f in poll () at /lib64/libc.so.6
> #1 0x00007f211444650c in g_main_context_iterate.constprop () at
/lib64/libglib-2.0.so.0
> #2 0x00007f21143f1483 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> #3 0x00007f211406800b in virNetClientIOEventLoop () at /lib64/libvirt.so.0
> #4 0x00007f2114068a0a in virNetClientIO () at /lib64/libvirt.so.0
> #5 0x00007f21140692c1 in virNetClientSendInternal () at /lib64/libvirt.so.0
> #6 0x00007f211406936d in virNetClientSendWithReply () at /lib64/libvirt.so.0
> #7 0x00007f2114061a1d in virNetClientProgramCall () at /lib64/libvirt.so.0
> #8 0x00007f21140f79ac in callFull () at /lib64/libvirt.so.0
> #9 0x00007f21140f7a2f in call () at /lib64/libvirt.so.0
> #10 0x00007f21140d8435 in remoteDomainCreate () at /lib64/libvirt.so.0
> #11 0x00007f21141dd60c in virDomainCreate () at /lib64/libvirt.so.0
> #12 0x00007f21145c8114 in libvirt_virDomainCreate () at
/usr/lib64/python3.9/site-packages/libvirtmod.cpython-39-x86_64-linux-gnu.so
Ok, this thread is the primary one responsible for I/O. It is
waiting, while not holding any mutex.
Correct.
> Thread 1 (Thread 0x7f21223cf740 (LWP 288985) "python3"):
> #0 0x00007f2121c9c39a in __futex_abstimed_wait_common () at /lib64/libc.so.6
> #1 0x00007f2121c9eba0 in pthread_cond_wait@(a)GLIBC_2.3.2 () at /lib64/libc.so.6
> #2 0x00007f2113f4982a in virCondWait () at /lib64/libvirt.so.0
> #3 0x00007f2113f1fee3 in virObjectWait () at /lib64/libvirt.so.0
> #4 0x00007f211406882a in virNetClientIO () at /lib64/libvirt.so.0
> #5 0x00007f21140692c1 in virNetClientSendInternal () at /lib64/libvirt.so.0
> #6 0x00007f211406936d in virNetClientSendWithReply () at /lib64/libvirt.so.0
> #7 0x00007f2114061a1d in virNetClientProgramCall () at /lib64/libvirt.so.0
> #8 0x00007f21140f79ac in callFull () at /lib64/libvirt.so.0
> #9 0x00007f21140f7a2f in call () at /lib64/libvirt.so.0
> #10 0x00007f21140f24eb in remoteNodeDeviceNumOfCaps () at /lib64/libvirt.so.0
> #11 0x00007f2114207a00 in virNodeDeviceNumOfCaps () at /lib64/libvirt.so.0
This is a second API call that has arrived and has queued its outgoing
message, and is waiting either for the other thread to pass ownership
of the socket to it, or for the other thread to provide its reply (whichever
comes first).
This is all totally normal, and working as expected, so I still
don't see what the actual problem is ?
No. This is not normal.
The test is written in the following way.
1. One thread (2) has started very long operation, f.e. VM migration
2. Another thread (1) endlessly executes simple fast requests.
Normally these requests are executed fast and we see progress
in the log. Though there is a probability that thread (1)
stuck until request processed by thread (2) is not completed.
3. It should be noted that this is not a final "deadlock" as
once thread (2) completes its request, the process in thread (1)
is restarted normally.
Such behavior is a real pain. It has been detected by us once
OpenStack Nova (which is working in this way) has been reported
as unavailable.
We have traced down this problem and comes to the conclusion
that in the case of such behavior the problem is missed
wakeup. In the debugger, if I call g_mail_loop_quit()
thread (1) in unfrozen and requests are started processed
normally.
We have observed this problem over libvirt 8.5 and compared
the behavior indeed against ancient libvirt 5.6. The ancient
one does *NOT* have this problem and thus the problem is
looking like a degradation.
The difference in between them is a wakeup method. In
libvirt 5.6 thread (2) is sleeping in a poll() and is
woken up by writing to the pipe. Thus the sequence
like called from the same thread (for simplicity)
wakeup()
waiting()
behaves differently in libvirt 5.6 and libvirt 8.5
and this causes this trouble.
> //
=========================================================================
>
> just in case here is python script
>
> [root@dandreev-vz-6-0-0-243-master ~]# cat a.py
> import libvirt
> import time
> from threading import Thread
>
> def startVM(connection, vm_name):
> try:
> # Find the virtual machine by name
> print('starting VM')
> connection.lookupByName(vm_name).create()
> print('done starting VM')
> except libvirt.libvirtError as e:
> print(f'Libvirt Error: {e}')
>
> # Replace 'qemu+tcp://10.34.66.13/system' with your actual connection URI
> connection_uri = 'qemu+tcp://localhost/system'
> connection = libvirt.open(connection_uri)
> if connection is None:
> print(f'Failed to open connection to {connection_uri}')
> exit(1)
>
> try:
> # Replace 'your_vm_name' with the actual name of your virtual machine
> # startVM(connection, 'instance-00000002')
> thread = Thread(target=lambda: startVM(connection,
'instance-00000002'))
> thread.start()
> time.sleep(3)
> devs = connection.listAllDevices()
> for i in range(100000):
> for dev in devs:
> print('listing caps for dev %i %s' % (i, dev))
> try:
> dev.listCaps()
> except Exception as e:
> print('ERROR: %s' % e)
> print('DONE listing caps')
>
> connection.lookupByName('instance-00000002').destroy()
> except:
> connection.lookupByName('instance-00000002').destroy()
> finally:
> thread.join()
> # Close the connection outside the function
> connection.close()
What behaviour are you expecting this demo program to show ?
Expected behavior is
the following:
1. Thread (2) is processing very-very long operation, this
VM has been specifically crafted by us to block the
start. This was used to emulate VM migration procedure.
2. Thread (1) should process its small requests timely
and gracefully.
We observe that after a while thread (1) is stuck.
Test stand was kindly provided to us by Dmitry Andreev.
Sorry for too many words here, I believe that the
situation is finally becomes more clear. The
issue is really serious but could be hidden
until somebody really checking service availability.
It has taken a while until I have been convinced
that this is a real problem and it should really
be fixed.
Thank you in advance,
Den