On 2022-05-15 12:07, Laine Stump wrote:
On 5/15/22 11:48 AM, Digimer wrote:
Hi all,
I've got a series of programs that monitor various things on a CentOS Stream 8 VM host. All of these scripts work when called directly. However, when I have a parent program that calls all the little programs in series, I found that some virsh calls hang.
Is your script being called from a libvirt "hook" script? (https://libvirt.org/hooks.html )If so, that won't work - a libvirt hook script is called from within libvirt, and can't call back into libvirt.
Other than that, is there anything different about the context the script is being run from vs. the context you're directly running virsh from?
It's a perl script making a shell (system) call. So it's basically;
open (my $fh, "/usr/bin/virsh list --all |") or die;
while ($fh)
{
chomp;
# Do things
}
close $fh;
There's about 15 programs that are sitting in a given directory. When the parent program runs, it looks at the scripts in the directory and runs them (again as simple shell calls), one after the other. This is where things fail. I'm happy to provide more detail or add debugging if you'd like.
I just did a test where I reversed the order that the scripts were called, so that the problematic one was called first (in case it was a connection limit being hit or something), and I had a new failure mode...
When the parent program ran, it hung hard. I call the child scripts with 'timeout 30 /path/to/child/script' and timeout never fired, the program hung hard. In journald, I saw:
May 15 12:18:43 nr-a03n01.nray.ca libvirtd[1643714]: internal
error: connection closed due to keepalive timeout
May 15 12:25:31 nr-a03n01.nray.ca libvirtd[1643714]: Cannot recv
data: Connection reset by peer
I had to kill the parent program with two 'ctrl + c' entries;
====
time scancore --run-once
scancore has started.
Running the scan agent: [scan-storcli] with a timeout of: [30]
seconds now...
- Scan agent: [scan-storcli] exited after: [4] seconds with the
return code: [0].
Running the scan agent: [scan-server] with a timeout of: [30]
seconds now...
^C
Process with PID: [1705068] exiting on SIGINT.
^C
Process with PID: [1705068] exiting on SIGINT.
real 13m43.899s
user 0m5.253s
sys 0m2.097s
====
I checked 'ps aux' and found that, even after the ctrl + c, the
processes were still running...
====
# scancore --run-once
scancore has started.
Running the scan agent: [scan-storcli] with a timeout of: [30]
seconds now...
- Scan agent: [scan-storcli] exited after: [5] seconds with the
return code: [0].
Running the scan agent: [scan-server] with a timeout of: [30]
seconds now...
^C
Process with PID: [1708093] exiting on SIGINT.
^C
Process with PID: [1708093] exiting on SIGINT.
[root@nr-a03n01 ~]# ps aux | grep scan
root 1708900 0.0 0.0 12732 3132 pts/1 S 12:45 0:00
sh -c /usr/bin/timeout 30
/usr/sbin/scancore-agents/scan-server/scan-server 2>&1;
/usr/bin/echo return_code:$?
root 1708901 0.0 0.0 11592 976 pts/1 S 12:45 0:00
/usr/bin/timeout 30
/usr/sbin/scancore-agents/scan-server/scan-server
root 1708902 5.8 0.0 249400 91960 pts/1 T 12:45 0:01
/usr/bin/perl /usr/sbin/scancore-agents/scan-server/scan-server
====
While this is hanging, _other_ programs call 'virsh list --all' just fine. And as mentioned, if I call the problem script directly, it runs just fine (confirmed by watching the logs, 'virsh list --all' returns and logic runs fine)...
====
[root@nr-a03n01 ~]# ps aux | grep scan
root 1709321 0.0 0.0 12144 1108 pts/1 S+ 12:47 0:00
grep --color=auto scan
[root@nr-a03n01 ~]#
/usr/sbin/scancore-agents/scan-server/scan-server
[root@nr-a03n01 ~]# ps aux | grep scan
root 1709716 0.0 0.0 12144 1112 pts/1 S+ 12:48 0:00
grep --color=auto scan
[root@nr-a03n01 ~]#
====
I am so confused...
digimer