[libvirt] strange stale qemu processes after domain shutdown

I have 58 active domains with status running, and 62 qemu-system-x86_64 processes. After investigating this issue, i found problem domains. How to fix this issue and not lost this qemu processes? ps auxww: root 29561 0.2 0.2 1599628 743796 ? Sl Aug13 224:44 qemu-system-x86_64 -enable-kvm -name 29953 -S -machine pc-i440fx-1.7,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 7ca8e593-29f7-6389-9b35-000071cc3e1e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/29953.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,num_queues=1,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/dev/vg3/29953,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,discard=unmap,aio=native,iops=5000 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0 -netdev tap,fd=353,id=hostnet0,vhost=on,vhostfd=354 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:00:40:25,bus=pci.0,addr=0x3,rombar=0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/29953.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-mouse,id=input0 -device usb-kbd,id=input1 -vnc [::]:23,password -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,max-bytes=1024,period=2000,bus=pci.0,addr=0x7 -msg timestamp=on libvirt log contains: 2015-10-13 06:52:05.504+0000: starting up libvirt version: 1.2.16, qemu version: 2.3.0 (Debian 2.3.0-2+0~20150518103251.26+wheezy~1.gbp820cc6) LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin HOME=/root USER=root LOGNAME=root QEMU_AUDIO_DRV=none /usr/bin/kvm -name 29953 -S -machine pc-i440fx-1.7,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 7ca8e593-29f7-6389-9b35-000071cc3e1e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/29953.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,num_queues=1,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/dev/vg3/29953,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,discard=unmap,aio=native,iops=5000 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0 -netdev tap,fd=92,id=hostnet0,vhost=on,vhostfd=109 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:00:40:25,bus=pci.0,addr=0x3,rombar=0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/29953.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-mouse,id=input0 -device usb-kbd,id=input1 -vnc [::]:46,password -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,max-bytes=1024,period=2000,bus=pci.0,addr=0x7 -msg timestamp=on Domain id=3262 is tainted: high-privileges char device redirected to /dev/pts/45 (label charserial0) qemu: terminating on signal 15 from pid 14945 2015-10-14 19:32:06.672+0000: shutting down -- Vasiliy Tolstov, e-mail: v.tolstov@selfip.ru

2015-10-15 21:41 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
I have 58 active domains with status running, and 62 qemu-system-x86_64 processes. After investigating this issue, i found problem domains. How to fix this issue and not lost this qemu processes?
Does anybody knows? I think this is libvirt bug, because libvirt says that domain dies, but it running (in sleep state but...) -- Vasiliy Tolstov, e-mail: v.tolstov@selfip.ru

On 15.10.2015 20:41, Vasiliy Tolstov wrote:
I have 58 active domains with status running, and 62 qemu-system-x86_64 processes. After investigating this issue, i found problem domains. How to fix this issue and not lost this qemu processes?
These are probably left over after daemon restart. I mean, when deamon restarts, it reloads the state XML. However, if there's something wrong (e.g. unknown device), the state XML is ignored and libvirt thinks the domain is shut off, leaving qemu process behind. The unknown device can happen if you run older daemon than the domain has been started with. You can check daemon debug logs and see how it loads domain state XMLs. If there's an error somewhere it's possibly this case. Of course, the other option would be that we have a bug somewhere where we leave qemu process behind. But the former is more likely than the latter. Michal

2015-10-22 17:38 GMT+03:00 Michal Privoznik <mprivozn@redhat.com>:
These are probably left over after daemon restart. I mean, when deamon restarts, it reloads the state XML. However, if there's something wrong (e.g. unknown device), the state XML is ignored and libvirt thinks the domain is shut off, leaving qemu process behind. The unknown device can happen if you run older daemon than the domain has been started with.
Daemon does not restarts, also to note - domain started without define. Libvirt version does not changed....
You can check daemon debug logs and see how it loads domain state XMLs. If there's an error somewhere it's possibly this case.
If domain is not defined, if i'm enable debug in libvirt and restart it, this is not helps because qemu process already stalled and libvirt does not know about this domain because they not defined.
Of course, the other option would be that we have a bug somewhere where we leave qemu process behind. But the former is more likely than the latter.
I'm try to add to monitoring (difference with virsh --state-running --name vs ps auxww | grep qemu-system | grep -v grep | wc -l) and investigate this issue... -- Vasiliy Tolstov, e-mail: v.tolstov@selfip.ru

On 23.10.2015 00:40, Vasiliy Tolstov wrote:
2015-10-22 17:38 GMT+03:00 Michal Privoznik <mprivozn@redhat.com>:
These are probably left over after daemon restart. I mean, when deamon restarts, it reloads the state XML. However, if there's something wrong (e.g. unknown device), the state XML is ignored and libvirt thinks the domain is shut off, leaving qemu process behind. The unknown device can happen if you run older daemon than the domain has been started with.
Daemon does not restarts, also to note - domain started without define. Libvirt version does not changed....
You can check daemon debug logs and see how it loads domain state XMLs. If there's an error somewhere it's possibly this case.
If domain is not defined, if i'm enable debug in libvirt and restart it, this is not helps because qemu process already stalled and libvirt does not know about this domain because they not defined.
That does not matter. We keep state XML for all running domains, regardless if they are persistent or transient. But since the daemon does not restart, I suspect we have a bug somewhere. BTW: you can check if the state XML for domain still exists. We pass -name $domname to qemu, and the state XML should then be: /var/run/libvirt/qemu/$domname.xml Also, there is PID of the qemu process - can you check if they match?
Of course, the other option would be that we have a bug somewhere where we leave qemu process behind. But the former is more likely than the latter.
I'm try to add to monitoring (difference with virsh --state-running --name vs ps auxww | grep qemu-system | grep -v grep | wc -l) and investigate this issue...
Michak

2015-10-23 11:37 GMT+03:00 Michal Privoznik <mprivozn@redhat.com>:
That does not matter. We keep state XML for all running domains, regardless if they are persistent or transient. But since the daemon does not restart, I suspect we have a bug somewhere. BTW: you can check if the state XML for domain still exists. We pass -name $domname to qemu, and the state XML should then be:
/var/run/libvirt/qemu/$domname.xml
Also, there is PID of the qemu process - can you check if they match?
Xml is absent, this is not strange, because libvirt log says, that domain is shutdown. So i think libvirt cleanup xml and pid files for this domains -- Vasiliy Tolstov, e-mail: v.tolstov@selfip.ru

2015-10-23 11:57 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
2015-10-23 11:37 GMT+03:00 Michal Privoznik <mprivozn@redhat.com>:
That does not matter. We keep state XML for all running domains, regardless if they are persistent or transient. But since the daemon does not restart, I suspect we have a bug somewhere. BTW: you can check if the state XML for domain still exists. We pass -name $domname to qemu, and the state XML should then be:
/var/run/libvirt/qemu/$domname.xml
Also, there is PID of the qemu process - can you check if they match?
Xml is absent, this is not strange, because libvirt log says, that domain is shutdown. So i think libvirt cleanup xml and pid files for this domains
Now i'm have the same issue. I'm do live migration on another node, qemu log says that domain receive 15 signal and shutdown. But as i see in process list: root 29561 0.1 0.2 1599628 743796 ? Sl Aug13 227:33 qemu-system-x86_64 -enable-kvm -name 29953 -S -machine pc-i440fx-1.7,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 7ca8e593-29f7-6389-9b35-000071cc3e1e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/29953.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,num_queues=1,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/dev/vg3/29953,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,discard=unmap,aio=native,iops=5000 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0 -netdev tap,fd=353,id=hostnet0,vhost=on,vhostfd=354 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:00:40:25,bus=pci.0,addr=0x3,rombar=0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/29953.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-mouse,id=input0 -device usb-kbd,id=input1 -vnc [::]:23,password -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,max-bytes=1024,period=2000,bus=pci.0,addr=0x7 -msg timestamp=on In strace qemu poll some fds lsof output: https://gist.githubusercontent.com/vtolstov/7ba49e8193c4ac9e9da0/raw/f077529... virsh version Compiled against library: libvirt 1.2.16 Using library: libvirt 1.2.16 Using API: QEMU 1.2.16 Running hypervisor: QEMU 2.3.0 uname -r 3.19-3-amd64 what can i do next to debug this issue? Also as i say before - libvirt thinks that domain dies successful and cleanup xml and pid files. -- Vasiliy Tolstov, e-mail: v.tolstov@selfip.ru
participants (2)
-
Michal Privoznik
-
Vasiliy Tolstov