[libvirt] [RFC] Add a new feild to qemud_save_header to tell if the saved image is corrupt

If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g. https://bugzilla.redhat.com/show_bug.cgi?id=730750 So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message. Thought? Thanks Osier

On 08/17/2011 07:10 AM, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Almost. I think we can reuse one of the existing spare fields in the header (that is, change unused[15] to instead be unused[14] and make the new field a uint32_t), and I also think we need to have a tri-state value: 0 - save image was created with older libvirt, no idea if image is sane 1 - save image created by current libvirt, but not yet marked complete; attempts to restore from this image should fail with sensible message suggesting nuking the save image since it is broken - value written at start of save process 2 - save image created by current libvirt and completed - value written at end of save process And of course, we have to update the bswap_header routine to treat it with the same endianness as the rest of the header.
Thought?
Sounds reasonable. I don't even think we have to bump the qemud_save_header version (that is, older libvirt will ignore the field, and the size is not changing, and newer libvirt will correctly handle the situation where the field was not set). The toughest part will be figuring out how to modify the field when using an O_DIRECT fd. I guess here, we write the entire file O_DIRECT, then when qemu completes the save, we close the fd, and open a new one that is not O_DIRECT, since we will be polluting the cache with only one sector of the file, rather than having to worry about O_DIRECT pickiness (the main point of O_DIRECT was for the benefit to the much larger portion of the file written by qemu). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Wed, Aug 17, 2011 at 08:12:03AM -0600, Eric Blake wrote:
On 08/17/2011 07:10 AM, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Almost. I think we can reuse one of the existing spare fields in the header (that is, change unused[15] to instead be unused[14] and make the new field a uint32_t), and I also think we need to have a tri-state value:
0 - save image was created with older libvirt, no idea if image is sane 1 - save image created by current libvirt, but not yet marked complete; attempts to restore from this image should fail with sensible message suggesting nuking the save image since it is broken - value written at start of save process 2 - save image created by current libvirt and completed - value written at end of save process
And of course, we have to update the bswap_header routine to treat it with the same endianness as the rest of the header.
The downside to adding a new header field, is that old libvirt won't look for it. A slightly more evil approach is to 1. Write header, but with 'magic' set to all zerso 2. do migration 3. Re-write header to set correct 'magic' On the plus side old livirt will refuse to restore from this. On the downside new libvirt will give less good errors "not a known save file" instad of "save file is corrupt" Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 08/17/2011 10:01 AM, Daniel P. Berrange wrote:
The downside to adding a new header field, is that old libvirt won't look for it. A slightly more evil approach is to
1. Write header, but with 'magic' set to all zerso
Or even to a specific value, different from all zeros, which new libvirt recognizes as incomplete save file.
2. do migration 3. Re-write header to set correct 'magic'
On the plus side old livirt will refuse to restore from this. On the downside new libvirt will give less good errors "not a known save file" instad of "save file is corrupt"
Old libvirt won't recognize the file at all (still a good point, since it shouldn't be trying to load an incomplete file), but by having two different magic numbers, newer libvirt can make more sensible error reporting and decisions on how to handle an incomplete file, in contrast to a file that is not even a partial save file. I like the idea of reusing magic rather than burning a new field, since that buys us protection to older libvirt. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

于 2011年08月17日 22:12, Eric Blake 写道:
On 08/17/2011 07:10 AM, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Almost. I think we can reuse one of the existing spare fields in the header (that is, change unused[15] to instead be unused[14] and make the new field a uint32_t), and I also think we need to have a tri-state value:
0 - save image was created with older libvirt, no idea if image is sane 1 - save image created by current libvirt, but not yet marked complete; attempts to restore from this image should fail with sensible message suggesting nuking the save image since it is broken - value written at start of save process 2 - save image created by current libvirt and completed - value written at end of save process
And of course, we have to update the bswap_header routine to treat it with the same endianness as the rest of the header.
Thought?
Sounds reasonable. I don't even think we have to bump the qemud_save_header version (that is, older libvirt will ignore the field, and the size is not changing, and newer libvirt will correctly handle the situation where the field was not set).
Agree on using the "unused[14]", with this we don't need to care about the C/S communications (between older and newer, or newer and older). Thanks for this good thought. :)
The toughest part will be figuring out how to modify the field when using an O_DIRECT fd. I guess here, we write the entire file O_DIRECT, then when qemu completes the save, we close the fd, and open a new one that is not O_DIRECT, since we will be polluting the cache with only one sector of the file, rather than having to worry about O_DIRECT pickiness (the main point of O_DIRECT was for the benefit to the much larger portion of the file written by qemu).
Not quite understand here, what I can get is opening the file with O_DIRECT will cause the dirty cache will be flushed to disk before every following read/write. What pollutes the cache? Agree on reopening the save file without O_DIRECT to modify the feild though. Osier

On Wed, Aug 17, 2011 at 09:10:41PM +0800, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Thought?
I assume you mean that when saving the guest, we'd do the following sequence 1. Write out basic header with complete=false 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If success, update header with complete=true And then on restore we'd do 1. If complete == false, then quit with error 2. Run QEMU with -incoming to restore At which point I wonder what's wrong with: 1. Write out basic header 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If not success, unlink save fail The only case where there's any difference, is if libvirtd itself crashes between steps 1 & 4. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 08/17/2011 08:59 AM, Daniel P. Berrange wrote:
On Wed, Aug 17, 2011 at 09:10:41PM +0800, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Thought?
I assume you mean that when saving the guest, we'd do the following sequence
1. Write out basic header with complete=false 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If success, update header with complete=true
And then on restore we'd do
1. If complete == false, then quit with error 2. Run QEMU with -incoming to restore
Yes, I think we need that.
At which point I wonder what's wrong with:
1. Write out basic header 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If not success, unlink save fail
The only case where there's any difference, is if libvirtd itself crashes between steps 1& 4.
Remember, that 'migrate' is a long-running async job command, and can be interrupted. That is, 'service libvirtd restart' is a legal action to take during step 3, and it is not as severe as a libvirtd crash, and we have already recently added patches to remember async job status across libvirtd restarts with the intention of making it legal to restart libvirtd in the middle of an async job (whether the async job should still succeed, or should remove the save file, is a slightly different question; but removing the save file would require that we save in the XML the name of the file to remove if libvirtd is restarted). Also, in the case of 'virsh save', the file is user-visible, and there's no telling if a user does 'virsh save' in one window, then asynchronously types 'cp file elsewhere; virsh restore elsewhere' in another window before the first one has finished. Using the longer algorithm of marking 'complete=false' then revisiting it to write 'complete=true' provides more protection against users accessing their file system asynchronously, where they can otherwise get a corrupt image even without a libvirtd crash or restart. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

于 2011年08月17日 23:50, Eric Blake 写道:
On 08/17/2011 08:59 AM, Daniel P. Berrange wrote:
On Wed, Aug 17, 2011 at 09:10:41PM +0800, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Thought?
I assume you mean that when saving the guest, we'd do the following sequence
1. Write out basic header with complete=false 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If success, update header with complete=true
And then on restore we'd do
1. If complete == false, then quit with error 2. Run QEMU with -incoming to restore
Yes, I think we need that.
At which point I wonder what's wrong with:
1. Write out basic header 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If not success, unlink save fail
The only case where there's any difference, is if libvirtd itself crashes between steps 1& 4.
Remember, that 'migrate' is a long-running async job command, and can be interrupted. That is, 'service libvirtd restart' is a legal action to take during step 3, and it is not as severe as a libvirtd crash, and we have already recently added patches to remember async job status across libvirtd restarts with the intention of making it legal to restart libvirtd in the middle of an async job (whether the async job should still succeed, or should remove the save file, is a slightly different question; but removing the save file would require that we save in the XML the name of the file to remove if libvirtd is restarted).
Hmm, how about restart libvirtd during the process of managed saving? Domain will be restored from the corrupt save image automatically. We report an error like "image is corrupt" and quite the domain starting simply? This might be not good, as one will see a running domain fails to start after libvirtd restarting. Or we want to the managed saving still succeed? If so, we might need: 1) continue the managed saving job, (Per we are already support remeber the async job status across libvirtd restarting) 2) restore from the saved image finished in 1).
Also, in the case of 'virsh save', the file is user-visible, and there's no telling if a user does 'virsh save' in one window, then asynchronously types 'cp file elsewhere; virsh restore elsewhere' in another window before the first one has finished. Using the longer algorithm of marking 'complete=false' then revisiting it to write 'complete=true' provides more protection against users accessing their file system asynchronously, where they can otherwise get a corrupt image even without a libvirtd crash or restart.

On 08/18/2011 11:42 PM, Osier Yang wrote:
Remember, that 'migrate' is a long-running async job command, and can be interrupted. That is, 'service libvirtd restart' is a legal action to take during step 3, and it is not as severe as a libvirtd crash, and we have already recently added patches to remember async job status across libvirtd restarts with the intention of making it legal to restart libvirtd in the middle of an async job (whether the async job should still succeed, or should remove the save file, is a slightly different question; but removing the save file would require that we save in the XML the name of the file to remove if libvirtd is restarted).
Hmm, how about restart libvirtd during the process of managed saving?
Domain will be restored from the corrupt save image automatically. We report an error like "image is corrupt" and quite the domain starting simply? This might be not good, as one will see a running domain fails to start after libvirtd restarting.
Or we want to the managed saving still succeed? If so, we might need:
1) continue the managed saving job, (Per we are already support remeber the async job status across libvirtd restarting) 2) restore from the saved image finished in 1).
I think the easiest approach is: if we restart libvirtd, and see that an async job for save-to-file was in progress, then we abort the job (leaving the file marked unfinished, whether it was managed save or user save), and log the error. On managed restore (virDomainCreate or autostart), if the save file exists but is incomplete, then log the fact that the file is unusable, then unlink() the file and proceed to do a normal boot (nothing we can do to recover the lost autosave, but we can at least clean up on the user's behalf). On user restore (virDomainRestore), if the save file exists but is incomplete, report the error to the user. No unlink(), and no rebooting the guest; it's up to the user to decide how to handle the failed save. But if we can figure out how to do better, by making a libvirtd restart able to complete the save process rather than ditch it, then that would be nicer. It's just that I don't know how easy that would be, and we have to start this patch somewhere. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

于 2011年08月17日 22:59, Daniel P. Berrange 写道:
On Wed, Aug 17, 2011 at 09:10:41PM +0800, Osier Yang wrote:
If one tries to restore a domain from a corrupt save image, we blindly goes forward to restore from it, this can cause many different errors, depending on how much the image is saved. E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=730750
So I'm thinking if we can introduce a new feild to struct qemud_save_header, such as "bool complete;", and set it true if succeeded to save the image, false if not. So that could do some checking while trying to open the image (qemuDomainSaveImageOpen), and quit early if "complete" is false, with a sensiable error message.
Thought? I assume you mean that when saving the guest, we'd do the following sequence
1. Write out basic header with complete=false 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If success, update header with complete=true
And then on restore we'd do
1. If complete == false, then quit with error 2. Run QEMU with -incoming to restore
Yes, exactly.
At which point I wonder what's wrong with:
1. Write out basic header 2. Write out XML doc 3. Run 'migrate' in QEMU to write save state 4. If not success, unlink save fail
The only case where there's any difference, is if libvirtd itself crashes between steps 1& 4.
Daniel
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Osier Yang