
"Daniel P. Berrange" <berrange@redhat.com> wrote:
A number of bugs conspired together to cause some nasty problems when a QEMU vm failed to start
- vm->monitor was not initialized to -1, so when a VM failed to start the vm->monitor was just '0', and thus we closed FD 0 (libvirtd's stdin)
- The next client to connect got FD 0 as its socket
- The first bug struck again, causing the client to be closed even though libvirt thought it was still open
- libvirtd now polle on FD=0, which gave back POLLNVAL because it was closed
- event.c was not looking for POLLNVAL so it span 100% cpu when this happened, instead of invoking the callback with an error code
- virsh was not cleaning up the priv->watiDispatch call upon I/O errors, so virsh then hung when doing virConenctClose
It could also segfault, and it was easy to make it do that for me, every third client call. For reference, here's what I did: LIBVIRT_DEBUG=1 qemud/libvirtd > log 2>&1 & cat <<\EOF > e.xml <domain type='qemu'> <name>E</name> <uuid>d7a5fdbd-cdaf-9455-926a-d65c16db1809</uuid> <memory>219200</memory> <currentMemory>219200</currentMemory> <vcpu>2</vcpu> <os> <type arch='i686' machine='pc'>hvm</type> <boot dev='cdrom'/> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/bin/qemu-system-x86_64</emulator> <disk type='file' device='cdrom'> <source file='NO_SUCH_FILE'/> <target dev='hdc' bus='ide'/> <readonly/> </disk> <input type='mouse' bus='ps2'/> <graphics type='vnc' port='-1' autoport='yes'/> </devices> </domain> EOF $ src/virsh create e.xml libvir: Remote error : no call waiting for reply with serial 3 error: failed to connect to the hypervisor [Exit 1] $ src/virsh create e.xml libvir: Remote error : no call waiting for reply with serial 0 error: failed to connect to the hypervisor [Exit 1] $ src/virsh create e.xml libvir: Remote error : server closed connection error: Failed to create domain from e.xml zsh: segmentation fault src/virsh create e.xml FYI, that was due to this code remote_internal.c:6319, while (tmp && tmp->next) where "tmp" is bogus because priv->waitDispatch was freed. Note that this was probably easier for me than most, since I have this in my environment: export MALLOC_PERTURB_=$(($RANDOM % 255 + 1))
This patch does 3 things
- Treats POLLNVAL as VIR_EVENT_HANDLE_ERROR, so the callback gets to see the error & de-registers the client from the event loop - Add the missing initialization of vm->monitor - Fix remote_internal.c handling of I/O errors
ACK.