"Daniel P. Berrange" <berrange(a)redhat.com> wrote:
A number of bugs conspired together to cause some nasty problems
when
a QEMU vm failed to start
- vm->monitor was not initialized to -1, so when a VM failed to start
the vm->monitor was just '0', and thus we closed FD 0 (libvirtd's
stdin)
- The next client to connect got FD 0 as its socket
- The first bug struck again, causing the client to be closed even
though libvirt thought it was still open
- libvirtd now polle on FD=0, which gave back POLLNVAL because it was
closed
- event.c was not looking for POLLNVAL so it span 100% cpu when this
happened, instead of invoking the callback with an error code
- virsh was not cleaning up the priv->watiDispatch call upon I/O errors,
so virsh then hung when doing virConenctClose
It could also segfault, and it was easy to make it do that
for me, every third client call. For reference, here's what I did:
LIBVIRT_DEBUG=1 qemud/libvirtd > log 2>&1 &
cat <<\EOF > e.xml
<domain type='qemu'>
<name>E</name>
<uuid>d7a5fdbd-cdaf-9455-926a-d65c16db1809</uuid>
<memory>219200</memory>
<currentMemory>219200</currentMemory>
<vcpu>2</vcpu>
<os>
<type arch='i686' machine='pc'>hvm</type>
<boot dev='cdrom'/>
</os>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='file' device='cdrom'>
<source file='NO_SUCH_FILE'/>
<target dev='hdc' bus='ide'/>
<readonly/>
</disk>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='-1' autoport='yes'/>
</devices>
</domain>
EOF
$ src/virsh create e.xml
libvir: Remote error : no call waiting for reply with serial 3
error: failed to connect to the hypervisor
[Exit 1]
$ src/virsh create e.xml
libvir: Remote error : no call waiting for reply with serial 0
error: failed to connect to the hypervisor
[Exit 1]
$ src/virsh create e.xml
libvir: Remote error : server closed connection
error: Failed to create domain from e.xml
zsh: segmentation fault src/virsh create e.xml
FYI, that was due to this code
remote_internal.c:6319, while (tmp && tmp->next)
where "tmp" is bogus because priv->waitDispatch was freed.
Note that this was probably easier for me than most,
since I have this in my environment:
export MALLOC_PERTURB_=$(($RANDOM % 255 + 1))
This patch does 3 things
- Treats POLLNVAL as VIR_EVENT_HANDLE_ERROR, so the callback gets
to see the error & de-registers the client from the event loop
- Add the missing initialization of vm->monitor
- Fix remote_internal.c handling of I/O errors
ACK.