On 04/07/2010 03:43 PM, Chris Lalancette wrote:
Hm, this really doesn't seem like it's the way to fix this.
You are correct that it isn't what should be done in the long term.
Short term, though, it definitely fixes bad behavior that I wouldn't
want to see in an official release (on my hardware, restores will
basically always fail unless the guest was paused prior to saving).
We really
should investigate what is going on in qemu, and see if it's a bug in
qemu itself (in which case we should fix qemu), or if it's a bug in the
way we communicate with qemu (in which case we should fix that).
I'm operating on information I learned in an IRC chat. Perhaps Dan
Berrange can pipe up here to repeat / expand on what he said, but
basically it sounds like the problem is that qemu will happily start the
CPUs for us before the restore operation has begun, and there's no way
for us to verify whether or not it has begun - for that qemu will need
to make 'info migrate' work on the incoming side, and that's not likely
to happen very quickly (of course it will take even longer if I don't
whine about it, I just haven't gotten there yet ;-)
A sleep is just hiding the problem
Yes, I dislike this solution. I'd love it if someone could tell me of an
alternate way. If there is no other way to fix it entirely within
libvirt, I don't think we should just report the problem to qemu and let
users suffer until it gets fixed there, though; especially if that fix
requires a new interface in qemu that must then be supported by libvirt,
the path to reliably working domain restores could be very long indeed;
and in the meantime we'd be left with delivered code that may fail in a
rather bad way for someone, especially in the case of a managed save,
where the image is deleted as soon as the domain is started - if it
fails once, you've lost the image so you can't even try again.
(which means it can still pop up on
machines slower, or more busy, than yours!).
I'm doubtful that slower VT-capable machines exist (although I haven't
checked - possibly this same problem exists when doing software
emulation too). I hadn't considered if this would pop up on faster
hardware that was also busier - a very good point.
(I did just do some more testing, and found that even 50msec is enough
to make things work. 10msec isn't enough, though...)