Howdy, all.
I maintain a test infrastructure which makes heavy use of virDomainSave
and virDomainRestore, and have been seeing occasional cases where my
saved images are for some reason not restored correctly -- and, indeed,
the incoming migration streams are not even read in their entirety.
While this generally appears to be caused by issues outside of libvirt's
purview, one unfortunate issue is that libvirt can report success
performing a restore even when the operation is effectively an abject
failure.
Consider the following snippet, taken from one of my
/var/log/libvirt/qemu/<domain>.log files:
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin USER=root LOGNAME=root
/usr/bin/qemu-kvm -S -M pc-0.11 -m 512 -smp 1 <...lots of arguments
here...> -incoming exec:cat
cat: write error: Broken pipe
This leaves a running qemu hosting a catatonic guest -- but the libvirt
client (connecting through the Python bindings) received a status of
success for the operation given here.
libvirt's mechanism for validating a successful restore consists of
running a "cont" command on the guest, and then checking
virGetLastError(); AIUI, it is expected that the "cont" will not be able
to run until the restore is completed, as the monitor should not be
responsive until that time. Browsing through qemudMonitorSendCont (and
qemudMonitorCommandWithHandler, which it calls), I don't see anything
which looks at the log file with the stderr output from qemu to
determine whether an error actually occurred. (As an aside, "info
history" on the guest's monitor socket indicates that it was indeed
issued this "cont").
Should the existing cont+virGetLastError() approach be sufficient to
handle this class of error? If not, is there any guidance on what would
comprise a better system? (I suppose we could add something to the exec:
to affirmatively indicate on stderr that the decompressor [or cat, if
not using one] exited successfully, and check for that marker in the log
file... but that seems quite a dirty hack).
Thanks!