On 2022-05-15 12:13, Digimer wrote:
On 2022-05-15 12:07, Laine Stump wrote:
On 5/15/22 11:48 AM, Digimer wrote:
Hi all,

   I've got a series of programs that monitor various things on a CentOS Stream 8 VM host. All of these scripts work when called directly. However, when I have a parent program that calls all the little programs in series, I found that some virsh calls hang.

Is your script being called from a libvirt "hook" script? (https://libvirt.org/hooks.html )If so, that won't work - a libvirt hook script is called from within libvirt, and can't call back into libvirt.

Other than that, is there anything different about the context the script is being run from vs. the context you're directly running virsh from?

It's a perl script making a shell (system) call. So it's basically;

open (my $fh, "/usr/bin/virsh list --all |") or die;
while ($fh)
{
    chomp;
    # Do things
}
close $fh;

  There's about 15 programs that are sitting in a given directory. When the parent program runs, it looks at the scripts in the directory and runs them (again as simple shell calls), one after the other. This is where things fail. I'm happy to provide more detail or add debugging if you'd like.

I just did a test where I reversed the order that the scripts were called, so that the problematic one was called first (in case it was a connection limit being hit or something), and I had a new failure mode...

When the parent program ran, it hung hard. I call the child scripts with 'timeout 30 /path/to/child/script' and timeout never fired, the program hung hard. In journald, I saw:

May 15 12:18:43 nr-a03n01.nray.ca libvirtd[1643714]: internal error: connection closed due to keepalive timeout
May 15 12:25:31 nr-a03n01.nray.ca libvirtd[1643714]: Cannot recv data: Connection reset by peer

I had to kill the parent program with two 'ctrl + c' entries;

====
time scancore --run-once
scancore has started.
Running the scan agent: [scan-storcli] with a timeout of: [30] seconds now...
- Scan agent: [scan-storcli] exited after: [4] seconds with the return code: [0].
Running the scan agent: [scan-server] with a timeout of: [30] seconds now...
^C

Process with PID: [1705068] exiting on SIGINT.
^C

Process with PID: [1705068] exiting on SIGINT.

real    13m43.899s
user    0m5.253s
sys    0m2.097s
====

I checked 'ps aux' and found that, even after the ctrl + c, the processes were still running...

====
# scancore --run-once
scancore has started.
Running the scan agent: [scan-storcli] with a timeout of: [30] seconds now...
- Scan agent: [scan-storcli] exited after: [5] seconds with the return code: [0].
Running the scan agent: [scan-server] with a timeout of: [30] seconds now...
^C

Process with PID: [1708093] exiting on SIGINT.
^C

Process with PID: [1708093] exiting on SIGINT.
[root@nr-a03n01 ~]# ps aux | grep scan
root     1708900  0.0  0.0  12732  3132 pts/1    S    12:45   0:00 sh -c /usr/bin/timeout 30 /usr/sbin/scancore-agents/scan-server/scan-server 2>&1; /usr/bin/echo return_code:$?
root     1708901  0.0  0.0  11592   976 pts/1    S    12:45   0:00 /usr/bin/timeout 30 /usr/sbin/scancore-agents/scan-server/scan-server
root     1708902  5.8  0.0 249400 91960 pts/1    T    12:45   0:01 /usr/bin/perl /usr/sbin/scancore-agents/scan-server/scan-server
====

While this is hanging, _other_ programs call 'virsh list --all' just fine. And as mentioned, if I call the problem script directly, it runs just fine (confirmed by watching the logs, 'virsh list --all' returns and logic runs fine)...

====
[root@nr-a03n01 ~]# ps aux | grep scan
root     1709321  0.0  0.0  12144  1108 pts/1    S+   12:47   0:00 grep --color=auto scan

[root@nr-a03n01 ~]# /usr/sbin/scancore-agents/scan-server/scan-server

[root@nr-a03n01 ~]# ps aux | grep scan
root     1709716  0.0  0.0  12144  1112 pts/1    S+   12:48   0:00 grep --color=auto scan
[root@nr-a03n01 ~]#
====

I am so confused...

digimer