On Mon, Dec 13, 2021 at 09:42:04AM -0700, Jim Fehlig wrote:
On 12/12/21 14:15, Martin Kletzander wrote:
> On Sun, Dec 12, 2021 at 10:40:46AM -0700, Jim Fehlig wrote:
>> On 12/11/21 03:28, Martin Kletzander wrote:
>>> On Sat, Dec 11, 2021 at 11:16:13AM +0100, Martin Kletzander wrote:
>>>> On Fri, Dec 10, 2021 at 05:48:03PM -0700, Jim Fehlig wrote:
>>>>> Hi Martin!
>>>>>
>>>>> I recently received a bug report (sorry, not public) about simple
operations
>>>>> like 'virsh list' hanging when invoked with an internal test
tool. I found
>>>>> this
>>>>> commit to be the culprit.
>>>>>
>>>
>>> OK, one more thing though, the fact that pkttyagent is spawned cannot
>>> cause virsh to hang. If the authentications is not required, then it
>>> will just wait there for a while and then be killed. If authentication
>>> *is* required, then either you already have an agent running and that
>>> one should be used since we're starting pkttyagent with `--fallback` or
>>> you do not have any agent running in which case virsh list would fail
>>> to connect. Where does the virsh hang, what's the backtrace?
>>
>> The last scenario you describe appears to be the case. virsh fails to connect
>> then gets stuck trying to kill off pkttyagent
>>
>> #0 0x00007f9f07530241 in clock_nanosleep(a)GLIBC_2.2.5 () from /lib64/libc.so.6
>> #1 0x00007f9f07535ad3 in nanosleep () from /lib64/libc.so.6
>> #2 0x00007f9f07f478af in g_usleep () from /usr/lib64/libglib-2.0.so.0
>> #3 0x00007f9f086694fa in virProcessAbort (pid=367) at
>> ../src/util/virprocess.c:187
>> #4 0x00007f9f0861ed9b in virCommandAbort (cmd=cmd@entry=0x55a798660c50) at
>> ../src/util/vircommand.c:2774
>> #5 0x00007f9f08621478 in virCommandFree (cmd=0x55a798660c50) at
>> ../src/util/vircommand.c:3061
>> #6 0x00007f9f08668581 in virPolkitAgentDestroy (agent=0x55a7986426e0) at
>> ../src/util/virpolkit.c:164
>> #7 0x000055a797836d93 in virshConnect (ctl=ctl@entry=0x7ffc551dd980, uri=0x0,
>> readonly=readonly@entry=false) at ../tools/virsh.c:187
>> #8 0x000055a797837007 in virshReconnect (ctl=ctl@entry=0x7ffc551dd980,
>> name=name@entry=0x0, readonly=<optimized out>, readonly@entry=false,
>> force=force@entry=false) at ../tools/virsh.c:223
>> #9 0x000055a7978371e0 in virshConnectionHandler (ctl=0x7ffc551dd980) at
>> ../tools/virsh.c:325
>> #10 0x000055a797880172 in vshCommandRun (ctl=ctl@entry=0x7ffc551dd980,
>> cmd=0x55a79865f580) at ../tools/vsh.c:1308
>> #11 0x000055a7978367b7 in main (argc=2, argv=<optimized out>) at
>> ../tools/virsh.c:907
>>
>> Odd thing is, I attached gdb to this virsh process several minutes after
>> invoking the test tool that calls 'virsh list'. I can't explain why
the process
>> is still blocked in g_usleep, which should only have slept for 10 milliseconds.
>> Even odder, detaching from the process appears to awaken g_usleep and allows
>> process shutdown to continue. The oddness can also be seen in the debug output
>>
>> 2021-12-12 16:35:38.783+0000: 5912: debug : virCommandRunAsync:2629 : About to
>> run /usr/bin/pkttyagent --process 5912 --notify-fd 4 --fallback
>>
>> 2021-12-12 16:35:38.787+0000: 5912: debug : virCommandRunAsync:2632 : Command
>> result 0, with PID 5914
>>
>> ...
>> 2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:177 : aborting child
>> process 5914
>>
>> 2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:185 : trying SIGTERM
>> to child process 5914
>>
>>
>> Attach gdb to the process, observe above backtrace, quit gdb.
>>
>> 2021-12-12 16:44:18.059+0000: 5912: debug : virProcessAbort:195 : trying SIGKILL
>> to child process 5914
>>
>> 2021-12-12 16:44:18.061+0000: 5912: debug : virProcessAbort:201 : process has
>> ended: fatal signal 9
>>
>
> The stuck g_usleep() is weird. Isn't there a tremendous load on the
> machine? I can't think of much else.
The machine is mostly idle.
> Also I am looking at the pkttyagent code and it looks like it blocks the
> first SIGTERM and sending two of them should help in that case, but if
> we want to wait for few ms between them, than we;ll be in the same
> pickle.
>
>>
>>> Anyway, if just adding:
>>>
>>> if (!isatty(STDIN_FILENO))
>>> return false;
>>
>> This indeed fixes the regression in the test tool.
>>
>
> That just means that it won't start the agent. Let's do this, but I
> would really, *really* like to figure out what the heck is happening
> there, because there has to be something wrong and it might just be
> waiting around the corner for us and bite us in the back in a year or
> so. Although I understand how improbable that is.
Do you have additional suggestions that may help us gain a better understanding
of the problem?
Unfortunately I cannot think of anything else than trying to figure out
whether the kernel is scheduling that task at all and if not then why.
How to do that however is unfortunately out of my territory :(
Regards,
Jim