[libvirt-users] Can not restore domain from a shared state file

Hi, I have two KVM host: h1 and h2, both of them mount an NFS directory as a shared storage. I can save (virsh save <domain> <file>) a domain in h1 to a state file in the shared storage successfully, but failed to restore it from h2 with the following error message: # virsh restore testRes.dat error: Failed to restore domain from testRes.dat error: operation failed: failed to start VM I can always restore it from h1, but sometimes works for h2 (wait for a while, then "virsh restore" command may succeed in h2). I guess the state file generated by "virsh save" command is not intact from h2 point view, may be cause by the cache of NFS server? Any suggestions will be really appreciated, thanks in advance! Regards, Qian

I found that waiting for a while does not work, but if I tried to restore it from h2 more times, it will work. For example, if I save a domain which has 512MB memory, and restore if from h2 repeatedly, at the third time, the restore can succeed, but for a 256MB domain, it will succeed at the second time. 2010/5/11 Zhang Qian <zhq527725@gmail.com>
Hi,
I have two KVM host: h1 and h2, both of them mount an NFS directory as a shared storage. I can save (virsh save <domain> <file>) a domain in h1 to a state file in the shared storage successfully, but failed to restore it from h2 with the following error message: # virsh restore testRes.dat error: Failed to restore domain from testRes.dat error: operation failed: failed to start VM
I can always restore it from h1, but sometimes works for h2 (wait for a while, then "virsh restore" command may succeed in h2). I guess the state file generated by "virsh save" command is not intact from h2 point view, may be cause by the cache of NFS server?
Any suggestions will be really appreciated, thanks in advance!
Regards, Qian

On 05/11/2010 04:40 AM, Zhang Qian wrote:
Hi,
I have two KVM host: h1 and h2, both of them mount an NFS directory as a shared storage. I can save (virsh save <domain> <file>) a domain in h1 to a state file in the shared storage successfully, but failed to restore it from h2 with the following error message: # virsh restore testRes.dat error: Failed to restore domain from testRes.dat error: operation failed: failed to start VM
I can always restore it from h1, but sometimes works for h2 (wait for a while, then "virsh restore" command may succeed in h2). I guess the state file generated by "virsh save" command is not intact from h2 point view, may be cause by the cache of NFS server?
There is a race condition in qemu when restarting a domain - it is possible for qemu to start the CPU before the domain image has been read from the file (this is regardless of where the file is stored). This may or may not be your problem (the error condition I saw due to this race was different from what you are seeing). It is easy to test for though - before saving your domain, first suspend it with "virsh suspend <domain>", then save it; after you've restored the domain with "virsh restore <image-file>", resume the domain with "virsh resume <domain>". If the domain successfully resumes, your problem was the race I describe. If not, you have found a different problem. I'm interested to know if this solves your problem.

Hi Laine, Thanks for your reply. I think I have found a different problem. I tried what you said "before saving your domain, first suspend it with "virsh suspend <domain>", then save it; after you've restored the domain with "virsh restore <image-file>", resume the domain with "virsh resume <domain>"", when I restore the domain on another host, it failed with the error I mentioned before: error: Failed to restore domain from testRes.dat error: operation failed: failed to start VM But now I can resolve my problem in an "ugly" way: before restoring the domain on another host, read the whole suspend image completely on that host, here is my code to do that read, it is very simple: if ((fd = open(suspendImage, O_RDONLY)) < 0) { goto error; } while ((size = read(fd, buf, MAXLINELEN))) { if (size == -1) { goto error; } } close(fd); After these codes is executed, then restoring the domain on that host will succeed! This solution(before restoring a domain on another host, read the suspend image on that host completely) works every time in my environment up to now. I am not sure why it works, maybe this read operation triggers the NFS cache refresh, so that the complete suspend image can be accessed in the target host, I don't know... Regards, Qian 2010/5/12 Laine Stump <laine@laine.org>
On 05/11/2010 04:40 AM, Zhang Qian wrote:
Hi,
I have two KVM host: h1 and h2, both of them mount an NFS directory as a shared storage. I can save (virsh save <domain> <file>) a domain in h1 to a state file in the shared storage successfully, but failed to restore it from h2 with the following error message: # virsh restore testRes.dat error: Failed to restore domain from testRes.dat error: operation failed: failed to start VM
I can always restore it from h1, but sometimes works for h2 (wait for a while, then "virsh restore" command may succeed in h2). I guess the state file generated by "virsh save" command is not intact from h2 point view, may be cause by the cache of NFS server?
There is a race condition in qemu when restarting a domain - it is possible for qemu to start the CPU before the domain image has been read from the file (this is regardless of where the file is stored). This may or may not be your problem (the error condition I saw due to this race was different from what you are seeing). It is easy to test for though - before saving your domain, first suspend it with "virsh suspend <domain>", then save it; after you've restored the domain with "virsh restore <image-file>", resume the domain with "virsh resume <domain>". If the domain successfully resumes, your problem was the race I describe. If not, you have found a different problem.
I'm interested to know if this solves your problem.
participants (2)
-
Laine Stump
-
Zhang Qian