Re: [libvirt] [RFC] Add a new feild to qemud_save_header to tell if the saved image is corrupt

19 Aug 2011

      于 2011年08月17日 23:50, Eric Blake 写道:
...
On 08/17/2011 08:59 AM, Daniel P. Berrange wrote:
...
On Wed, Aug 17, 2011 at 09:10:41PM +0800, Osier Yang wrote:
...
If one tries to restore a domain from a corrupt save image, we blindly
goes forward to restore from it, this can cause many different errors,
depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct 
qemud_save_header,
such as "bool complete;", and set it true if succeeded to save the 
image,
false if not. So that could do some checking while trying to open 
the image
(qemuDomainSaveImageOpen), and quit early if "complete" is false, with
a sensiable error message.
Thought?
I assume you mean that when saving the guest, we'd do the following 
sequence
1. Write out basic header with complete=false
2. Write out XML doc
3. Run 'migrate' in QEMU to write save state
4. If success, update header with complete=true
And then on restore we'd do
1. If complete == false, then quit with error
2. Run QEMU with -incoming to restore
Yes, I think we need that.
...
At which point I wonder what's wrong with:
1. Write out basic header
2. Write out XML doc
3. Run 'migrate' in QEMU to write save state
4. If not success, unlink save fail
The only case where there's any difference, is if libvirtd itself 
crashes
between steps 1& 4.
Remember, that 'migrate' is a long-running async job command, and can 
be interrupted. That is, 'service libvirtd restart' is a legal action 
to take during step 3, and it is not as severe as a libvirtd crash, 
and we have already recently added patches to remember async job 
status across libvirtd restarts with the intention of making it legal 
to restart libvirtd in the middle of an async job (whether the async 
job should still succeed, or should remove the save file, is a 
slightly different question; but removing the save file would require 
that we save in the XML the name of the file to remove if libvirtd is 
restarted).
Hmm, how about restart libvirtd during the process of managed saving?

Domain will be restored from the corrupt save image automatically. We
report an error like "image is corrupt" and quite the domain starting 
simply?
This might be not good, as one will see a running domain fails to start
after libvirtd restarting.

Or we want to the managed saving still succeed? If so, we might need:

1) continue the managed saving job, (Per we are already support remeber
the async job status across libvirtd restarting)
2) restore from the saved image finished in 1).
...
Also, in the case of 'virsh save', the file is user-visible, and 
there's no telling if a user does 'virsh save' in one window, then 
asynchronously types 'cp file elsewhere; virsh restore elsewhere' in 
another window before the first one has finished. Using the longer 
algorithm of marking 'complete=false' then revisiting it to write 
'complete=true' provides more protection against users accessing their 
file system asynchronously, where they can otherwise get a corrupt 
image even without a libvirtd crash or restart.