[libvirt] Silently ignored virDomainRestore failures

Howdy, all. I maintain a test infrastructure which makes heavy use of virDomainSave and virDomainRestore, and have been seeing occasional cases where my saved images are for some reason not restored correctly -- and, indeed, the incoming migration streams are not even read in their entirety. While this generally appears to be caused by issues outside of libvirt's purview, one unfortunate issue is that libvirt can report success performing a restore even when the operation is effectively an abject failure. Consider the following snippet, taken from one of my /var/log/libvirt/qemu/<domain>.log files: LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin USER=root LOGNAME=root /usr/bin/qemu-kvm -S -M pc-0.11 -m 512 -smp 1 <...lots of arguments here...> -incoming exec:cat cat: write error: Broken pipe This leaves a running qemu hosting a catatonic guest -- but the libvirt client (connecting through the Python bindings) received a status of success for the operation given here. libvirt's mechanism for validating a successful restore consists of running a "cont" command on the guest, and then checking virGetLastError(); AIUI, it is expected that the "cont" will not be able to run until the restore is completed, as the monitor should not be responsive until that time. Browsing through qemudMonitorSendCont (and qemudMonitorCommandWithHandler, which it calls), I don't see anything which looks at the log file with the stderr output from qemu to determine whether an error actually occurred. (As an aside, "info history" on the guest's monitor socket indicates that it was indeed issued this "cont"). Should the existing cont+virGetLastError() approach be sufficient to handle this class of error? If not, is there any guidance on what would comprise a better system? (I suppose we could add something to the exec: to affirmatively indicate on stderr that the decompressor [or cat, if not using one] exited successfully, and check for that marker in the log file... but that seems quite a dirty hack). Thanks!

There appears to be a race condition wherein a 'cont' command sent immediately on qemu startup can prevent a inbound migration specified via -incoming from occurring. libvirt's process for starting up qemu domains with an incoming migration includes with a 'cont' command at the end of qemudInitCpus, shortly after a successful connection with the monitor is made. While the libvirt monitor is generally unresponsive while an inbound migration is ongoing, forcing the 'cont' to occur only after the migration has completed, this isn't always true (as will be demonstrated below). I suspect strongly that this is responsible for an occasional failure I'm seeing when loading libvirt domains from file. This is highly reproducible using qemu-kvm-0.11.0-rc2, and straightforward to demonstrate by the following means: [ONE-TIME SETUP] - Build an appropriate ramsave file via migrating a stopped guest to disk. - Mark any backing store used by this guest read-only. [COMMON STEPS] - Create an empty qcow2 file backed by the read-only store, if your guest has any disks. - Invoke qemu with arguments appropriate to the VM being resumed, and also the following: -S -monitor stdio -incoming 'exec:echo START_DELAY >&2 && sleep 5 && echo END_DELAY >&2 && cat <ramsave.raw && echo LOAD_DONE >&2'. [VALIDATING CORRECT OPERATION] - Wait until 'LOAD_DONE' is displayed, and run 'cont' - The VM will correctly resume. [REPRODUCING THE BUG] - Run 'cont' after START_DELAY is displayed, but before END_DELAY. - 'cat: write error: Broken pipe' will be displayed. - The guest VM will reboot, enter a catatonic state, or otherwise fail to load correctly. [REPRODUCING WITHOUT ARTIFICIAL DELAY] As the 'sleep 5' used in the above may be considered cheating, this issue may also be reproduced without any delay by removing the 'sleep', and terminating the shell command used to invoke qemu with <<<$'cont\n' [REPRODUCING OVER A UNIX SOCKET] Included for completeness, as libvirt 0.7.x uses UNIX sockets here. Use -monitor unix:tmp/test.monitor during qemu invocation, and - Invoke the following in a separate window: socat - UNIX-LISTEN:/tmp/test.monitor <<<$'cont\n' - Invoke qemu as above, but with -monitor unix:/tmp/test.monitor I have a work-in-progress patch which modifies libvirt to use -daemonize for startup; waiting for the guest to detach before attempting to interact with the monitor may avoid this issue. However, as this patch is against libvirt master, and the master branch has other issues which expose themselves on virDomainRestore, I am unable to test it here. Thoughts (and workarounds) welcome.

On Fri, Sep 25, 2009 at 06:47:07PM -0500, Charles Duffy wrote:
Howdy, all.
I maintain a test infrastructure which makes heavy use of virDomainSave and virDomainRestore, and have been seeing occasional cases where my saved images are for some reason not restored correctly -- and, indeed, the incoming migration streams are not even read in their entirety.
While this generally appears to be caused by issues outside of libvirt's purview, one unfortunate issue is that libvirt can report success performing a restore even when the operation is effectively an abject failure.
Consider the following snippet, taken from one of my /var/log/libvirt/qemu/<domain>.log files:
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin USER=root LOGNAME=root /usr/bin/qemu-kvm -S -M pc-0.11 -m 512 -smp 1 <...lots of arguments here...> -incoming exec:cat cat: write error: Broken pipe
This leaves a running qemu hosting a catatonic guest -- but the libvirt client (connecting through the Python bindings) received a status of success for the operation given here.
Urgh, more QEMU badness. QEMU spawned 'cat', so if 'cat' exits with a non-zero exit status, QEMU should see that it failed and thus exit itself, rather than pretending everything was OK with migration. The flaw in QEMU is depressingly obvious static int stdio_pclose(void *opaque) { QEMUFileStdio *s = opaque; pclose(s->stdio_file); qemu_free(s); return 0; } Notice how it completely discards the exit status returned by pclone() and just pretends everything always worked :-( If this was handling errors correctly, you'd at least see QEMU exiting rather than hanging around broken.
libvirt's mechanism for validating a successful restore consists of running a "cont" command on the guest, and then checking virGetLastError(); AIUI, it is expected that the "cont" will not be able to run until the restore is completed, as the monitor should not be responsive until that time. Browsing through qemudMonitorSendCont (and qemudMonitorCommandWithHandler, which it calls), I don't see anything which looks at the log file with the stderr output from qemu to determine whether an error actually occurred. (As an aside, "info history" on the guest's monitor socket indicates that it was indeed issued this "cont").
Hmm, this does look problematic - we need the monitor to be responsive in order to do things like CPU pinning. We need the monitor to be non-responsive to ensure 'cont' doesn't run until migration has finished. We can't have it both ways, and the former wins since we need that to be done before ever letting QEMU start allocating guest RAM pages. So relying on 'cont' to block is not good. Is the 'cont' even neccessary - I remember seeing somewhere that QEMU unconditionally started its CPUs after an incoming migraiton finished ? Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, Sep 28, 2009 at 6:43 AM, Daniel P. Berrange <berrange@redhat.com>wrote:
The flaw in QEMU is depressingly obvious
static int stdio_pclose(void *opaque) { QEMUFileStdio *s = opaque; pclose(s->stdio_file); qemu_free(s); return 0; }
Notice how it completely discards the exit status returned by pclone() and just pretends everything always worked :-(
If this was handling errors correctly, you'd at least see QEMU exiting rather than hanging around broken.
Ugh, indeed. I'll submit a patch for that later today, if nobody beats me to it.
Hmm, this does look problematic - we need the monitor to be responsive in order to do things like CPU pinning. We need the monitor to be non-responsive to ensure 'cont' doesn't run until migration has finished. We can't have it both ways, and the former wins since we need that to be done before ever letting QEMU start allocating guest RAM pages. So relying on 'cont' to block is not good. Is the 'cont' even neccessary - I remember seeing somewhere that QEMU unconditionally started its CPUs after an incoming migraiton finished ?
I've seen patches to change that behavior, so IMHO it's probably not to safe to depend on it being one way or the other throughout the versions of qemu libvirt supports. What I'm tempted to do is add a command which sends a sigil to stderr to the end of the exec: migration lines specified by libvirt, and wait for either that sigil or an error to show up in the log for that domain before issuing the cont; if my memory is at all correct, libvirt should have some helper functions useful for that purpose already available. Does this sound like a reasonable approach?

Charles Duffy wrote:
What I'm tempted to do is add a command which sends a sigil to stderr to the end of the exec: migration lines specified by libvirt, and wait for either that sigil or an error to show up in the log for that domain before issuing the cont; if my memory is at all correct, libvirt should have some helper functions useful for that purpose already available.
Ugh, bad idea -- I was thinking only of the migrate-from-file case, and not of migration from any alternate source; any non-exec source doesn't allow for this hack. Anthony suggested dropping -S from the command line on incoming migrations, and using "info status" to poll for when the guest resumes itself; however, this doesn't work for cases when the VM was saved or migrated in a paused state and the end-user expectation is for it to remain paused. I'm tempted to add additional output to "info migrate" indicating whether an inbound migration is ongoing; however, this wouldn't help versions of qemu without the patch applied.

On Mon, Sep 28, 2009 at 09:03:37PM -0500, Charles Duffy wrote:
Charles Duffy wrote:
What I'm tempted to do is add a command which sends a sigil to stderr to the end of the exec: migration lines specified by libvirt, and wait for either that sigil or an error to show up in the log for that domain before issuing the cont; if my memory is at all correct, libvirt should have some helper functions useful for that purpose already available.
Ugh, bad idea -- I was thinking only of the migrate-from-file case, and not of migration from any alternate source; any non-exec source doesn't allow for this hack.
Anthony suggested dropping -S from the command line on incoming migrations, and using "info status" to poll for when the guest resumes itself; however, this doesn't work for cases when the VM was saved or migrated in a paused state and the end-user expectation is for it to remain paused.
We can't drop -S, because we need to be able to run 'info cpus' and then do sched_setaffinity on each vCPU, before it ever starts executing, otherwise memory will end u pinned to the wrong NUMA nodes
I'm tempted to add additional output to "info migrate" indicating whether an inbound migration is ongoing; however, this wouldn't help versions of qemu without the patch applied.
Yeah, I've been thinking much the same this morning. I think we should consider what the optimal setup is for our needs long term and try and do whatever we can for that in QEMU now. I think it'd definitely be worthwhile to have an 'info migrate' impl for incoming migration, and even to make '-incoming' optional, and add a 'migrate-incoming' command to the monitor, which like 'migrate' could be run in either blocking or non-blocking mode. For existing QEMU, it might be sufficient to just put an arbitrary 'sleep(5)' before issuing 'cont', which would at least give it a reasonable chance of avoiding the race condition. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Tue, Sep 29, 2009 at 4:16 AM, Daniel P. Berrange <berrange@redhat.com>wrote:
Yeah, I've been thinking much the same this morning. I think we should consider what the optimal setup is for our needs long term and try and do whatever we can for that in QEMU now. I think it'd definitely be worthwhile to have an 'info migrate' impl for incoming migration, and even to make '-incoming' optional, and add a 'migrate-incoming' command to the monitor, which like 'migrate' could be run in either blocking or non-blocking mode.
'migrate-incoming' is heavier surgery than I'm comfortable doing myself, but I'll try my hand at the info command. The only thing that bugs me about this is that if the output is being added to the existing 'info migrate', an explicit negative output ("Incoming: none") will be necessary to distinguish from the case where reporting incoming migration is simply unsupported; as such, the current behavior of reporting no output at all if no migration is ongoing would need to change.
For existing QEMU, it might be sufficient to just put an arbitrary 'sleep(5)' before issuing 'cont', which would at least give it a reasonable chance of avoiding the race condition.
Well -- I wasn't going to submit the patch I'm now using internally (using and waiting for a sigil on stderr when migrate_url is "stdin"), but I suppose that with an else clause doing a sleep it might actually be the closest available option to a Right Thing for preexisting qemu. I'll wait to hear back today from the contractors testing with it (they were hitting this race condition frequently) and post an amended copy here if it gets their thumbs-up.

Charles Duffy wrote:
For existing QEMU, it might be sufficient to just put an arbitrary 'sleep(5)' before issuing 'cont', which would at least give it a reasonable chance of avoiding the race condition.
Well -- I wasn't going to submit the patch I'm now using internally (using and waiting for a sigil on stderr when migrate_url is "stdin"), but I suppose that with an else clause doing a sleep it might actually be the closest available option to a Right Thing for preexisting qemu.
I'll wait to hear back today from the contractors testing with it (they were hitting this race condition frequently) and post an amended copy here if it gets their thumbs-up.
Attached is what I'm presently using as a workaround until I have a proper ("info incoming") approach implemented. It is perhaps excessively paranoid (the extra .5sec of delay on the stdin case may not be needed), but we haven't been able to reproduce the issue with it applied.
participants (2)
-
Charles Duffy
-
Daniel P. Berrange