[libvirt] libvirt-0.9.1 to 0.9.3-r1: managedsave/save won't start/restore at saved state

I'm seeing strange behaviour, here. Any guests saved using both managedsave and save commands from virsh won't restore at saved state. A new full boot sequence happen. - Tested against libvirt v0.9.1, v0.9.2, v0.9.3-r1 (Gentoo) - Confirmed on three different hosts Gentoo amd64 systems. - Tested with gentoo and ubuntu guests. - Nothing relevant in /var/log/libvirt/libvirt.log or /var/log/libvirt/qemu/<dom>.log The "state file" /var/lib/libvirt/qemu/save/<dom>.save exists and is deleted when 'virsh start' is called. The new boot sequence is confirmed by : - VNC console checks - previous screen sessions lost - uptime I've open a bug at https://bugs.gentoo.org/show_bug.cgi?id=376333 but had no answer. Any idea on what could happen or how to inspect it? -- Nicolas Sebrecht

On 07/27/2011 02:37 AM, Nicolas Sebrecht wrote:
I'm seeing strange behaviour, here. Any guests saved using both managedsave and save commands from virsh won't restore at saved state. A new full boot sequence happen.
- Tested against libvirt v0.9.1, v0.9.2, v0.9.3-r1 (Gentoo) - Confirmed on three different hosts Gentoo amd64 systems. - Tested with gentoo and ubuntu guests. - Nothing relevant in /var/log/libvirt/libvirt.log or /var/log/libvirt/qemu/<dom>.log
The "state file" /var/lib/libvirt/qemu/save/<dom>.save exists and is deleted when 'virsh start' is called.
The new boot sequence is confirmed by : - VNC console checks - previous screen sessions lost - uptime
I've open a bug at https://bugs.gentoo.org/show_bug.cgi?id=376333 but had no answer.
Any idea on what could happen or how to inspect it?
Does /var/log/libvirt/qemu/<dom>.log show the qemu process getting started with the -incoming fd:nnn flag? While you claim that nothing appeared to be relevant in that log, it might actually help to post a few lines of it for confirmation. It's working for me with libvirt 0.9.3 on RHEL 6, so I'm not sure what to suggest that you try next. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

The 27/07/11, Eric Blake wrote:
On 07/27/2011 02:37 AM, Nicolas Sebrecht wrote:
I'm seeing strange behaviour, here. Any guests saved using both managedsave and save commands from virsh won't restore at saved state. A new full boot sequence happen.
- Tested against libvirt v0.9.1, v0.9.2, v0.9.3-r1 (Gentoo) - Confirmed on three different hosts Gentoo amd64 systems. - Tested with gentoo and ubuntu guests. - Nothing relevant in /var/log/libvirt/libvirt.log or /var/log/libvirt/qemu/<dom>.log
The "state file" /var/lib/libvirt/qemu/save/<dom>.save exists and is deleted when 'virsh start' is called.
The new boot sequence is confirmed by : - VNC console checks - previous screen sessions lost - uptime
I've open a bug at https://bugs.gentoo.org/show_bug.cgi?id=376333 but had no answer.
Any idea on what could happen or how to inspect it?
Does /var/log/libvirt/qemu/<dom>.log show the qemu process getting started with the -incoming fd:nnn flag? While you claim that nothing appeared to be relevant in that log, it might actually help to post a few lines of it for confirmation.
Here is a fresh test. Hostnames are: nicolas-desktop: my desktop homer: guest (logged as root) xenon: host (logged as root) nicolas@nicolas-desktop> ssh homer.test.lan root@homer> uptime 10:06:44 up 3 min, 1 user, load average: 0.10, 0.24, 0.11 root@homer> exit nicolas@nicolas-desktop> ssh xenon.test.lan xenon ~ # virsh managedsave homer Domain homer state saved by libvirt xenon ~ # cd /var/lib/libvirt/qemu/save xenon save # ls -l total 195M -rw------- 1 root root 195M Jul 28 10:08 homer.save <waiting a bit> xenon save # virsh start homer Domain homer started xenon save # ls -l total 0 xenon save # exit nicolas@nicolas-desktop> ssh homer.test.lan root@homer> uptime 10:22:42 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 root@homer> nicolas@nicolas-desktop> ssh xenon.test.lan xenon ~ # tail /var/log/libvirt/qemu/homer.log 2011-07-28 10:03:07.718: shutting down 2011-07-28 10:03:41.103: starting up LC_ALL=C PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.4.5:/root/bin HOME=/root USER=root QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.11 -enable-kvm -m 512 -smp 2,sockets=2,cores=1,threads=1 -name homer -uuid 90b87fd0-6add-c7c8-e6f8-b8245bae8329 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/homer.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/home/piing/libvirt/images/piing/homer.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,bus=pci.0,multifunction=on,addr=0x5.0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/root/virtuals/images/piing/homer-lun0.raw,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,bus=pci.0,multifunction=on,addr=0x8.0x0,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=17,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c3:7b:da,bus=pci.0,multifunction=on,addr=0x4.0x0 -netdev tap,fd=18,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:30:15:24,bus=pci.0,multifunction=on,addr=0x3.0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -incoming fd:13 -device virtio-balloon-pci,id=balloon0,bus=pci.0,multifunction=on,addr=0x6.0x0 Domain id=19 is tainted: high-privileges char device redirected to /dev/pts/4 2011-07-28 10:08:11.024: shutting down 2011-07-28 10:22:48.203: starting up LC_ALL=C PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.4.5:/root/bin HOME=/root USER=root QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.11 -enable-kvm -m 512 -smp 2,sockets=2,cores=1,threads=1 -name homer -uuid 90b87fd0-6add-c7c8-e6f8-b8245bae8329 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/homer.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/home/piing/libvirt/images/piing/homer.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,bus=pci.0,multifunction=on,addr=0x5.0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/root/virtuals/images/piing/homer-lun0.raw,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,bus=pci.0,multifunction=on,addr=0x8.0x0,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=18,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:c3:7b:da,bus=pci.0,multifunction=on,addr=0x4.0x0 -netdev tap,fd=20,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:30:15:24,bus=pci.0,multifunction=on,addr=0x3.0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -incoming fd:15 -device virtio-balloon-pci,id=balloon0,bus=pci.0,multifunction=on,addr=0x6.0x0 Domain id=20 is tainted: high-privileges char device redirected to /dev/pts/5 xenon ~ #
It's working for me with libvirt 0.9.3 on RHEL 6, so I'm not sure what to suggest that you try next.
Yes, I'm pretty sure it works almost always for everybody out there. I suspect this issue to be somewhat subtle. Thanks Eric for your help. -- Nicolas Sebrecht

On 07/27/2011 04:37 AM, Nicolas Sebrecht wrote:
I'm seeing strange behaviour, here. Any guests saved using both managedsave and save commands from virsh won't restore at saved state. A new full boot sequence happen.
- Tested against libvirt v0.9.1, v0.9.2, v0.9.3-r1 (Gentoo) - Confirmed on three different hosts Gentoo amd64 systems. - Tested with gentoo and ubuntu guests. - Nothing relevant in /var/log/libvirt/libvirt.log or /var/log/libvirt/qemu/<dom>.log
The "state file" /var/lib/libvirt/qemu/save/<dom>.save exists and is deleted when 'virsh start' is called.
The new boot sequence is confirmed by : - VNC console checks - previous screen sessions lost - uptime
I've open a bug at https://bugs.gentoo.org/show_bug.cgi?id=376333 but had no answer.
Any idea on what could happen or how to inspect it?
If you pause ("virsh suspend") the guest before the save, can you then successfully start the domain (it should "start" in a paused state, and when the virsh commandline returns to the prompt, you can resume it with "virsh resume".) If this is the case, it could be a race condition within qemu.

The 27/07/11, Laine Stump wrote:
If you pause ("virsh suspend") the guest before the save, can you then successfully start the domain (it should "start" in a paused state, and when the virsh commandline returns to the prompt, you can resume it with "virsh resume".) If this is the case, it could be a race condition within qemu.
This is why I like help from other: I would never had the idea to do this check on my own. Unfortunately, this didn't help: root@homer> uptime 11:05:34 up 8 min, 1 user, load average: 0.00, 0.00, 0.00 root@homer> virsh # suspend homer Domain homer suspended virsh # list Id Name State ---------------------------------- 21 homer paused virsh # managedsave homer Domain homer state saved by libvirt virsh # list Id Name State ---------------------------------- virsh # start homer Domain homer started virsh # list Id Name State ---------------------------------- 22 homer paused virsh # resume homer Domain homer resumed virsh # root@homer> uptime 11:06:51 up 0 min, 1 user, load average: 0.00, 0.00, 0.00 root@homer> Thank you. -- Nicolas Sebrecht

The 27/07/11, Nicolas Sebrecht wrote:
I'm seeing strange behaviour, here. Any guests saved using both managedsave and save commands from virsh won't restore at saved state. A new full boot sequence happen.
- Tested against libvirt v0.9.1, v0.9.2, v0.9.3-r1 (Gentoo) - Confirmed on three different hosts Gentoo amd64 systems. - Tested with gentoo and ubuntu guests. - Nothing relevant in /var/log/libvirt/libvirt.log or /var/log/libvirt/qemu/<dom>.log
The "state file" /var/lib/libvirt/qemu/save/<dom>.save exists and is deleted when 'virsh start' is called.
The new boot sequence is confirmed by : - VNC console checks - previous screen sessions lost - uptime
I've open a bug at https://bugs.gentoo.org/show_bug.cgi?id=376333 but had no answer.
Any idea on what could happen or how to inspect it?
I've found another Gentoo host system with same version of libvirt deployed where managedsave works. I'll investigate on my side to understand what differs between them. -- Nicolas Sebrecht

The 27/07/11, Nicolas Sebrecht wrote:
I'm seeing strange behaviour, here. Any guests saved using both managedsave and save commands from virsh won't restore at saved state. A new full boot sequence happen.
- Tested against libvirt v0.9.1, v0.9.2, v0.9.3-r1 (Gentoo) - Confirmed on three different hosts Gentoo amd64 systems. - Tested with gentoo and ubuntu guests. - Nothing relevant in /var/log/libvirt/libvirt.log or /var/log/libvirt/qemu/<dom>.log
The "state file" /var/lib/libvirt/qemu/save/<dom>.save exists and is deleted when 'virsh start' is called.
The new boot sequence is confirmed by : - VNC console checks - previous screen sessions lost - uptime
I've open a bug at https://bugs.gentoo.org/show_bug.cgi?id=376333 but had no answer.
I'm stuck! As told before, I have one working (in production) system and others failing Gentoo systems (including the testing machine). I've check the working system against the testing machine and looked for differences. I did remove differences one by one (luckily systems are very near from each other) and couldn't have the testing machine to work. I've check Linux kernel configuration (first) and the whole system for installed packages and compilation options. On each difference found I've done: - compilation and reinstallation of ALL the packages; - reboot; - tests. Now, I have almost two exact same systems behaving differently. Some minor differences remain about installed packages (missing on testing): - lshw - pv - colorgcc - autofs - iperf Hardware isn't the same, though. Main differences are: - Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (cpu family : 6) hardware RAID - Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz (cpu family : 6) software RAID Ouch! -- Nicolas Sebrecht

The 02/08/11, Nicolas Sebrecht wrote:
I'm stuck!
As told before, I have one working (in production) system and others failing Gentoo systems (including the testing machine).
I've check the working system against the testing machine and looked for differences. I did remove differences one by one (luckily systems are very near from each other) and couldn't have the testing machine to work.
I've check Linux kernel configuration (first) and the whole system for installed packages and compilation options. On each difference found I've done: - compilation and reinstallation of ALL the packages; - reboot; - tests.
Now, I have almost two exact same systems behaving differently. Some minor differences remain about installed packages (missing on testing): - lshw - pv - colorgcc - autofs - iperf
Hardware isn't the same, though. Main differences are: - Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (cpu family : 6) hardware RAID - Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz (cpu family : 6) software RAID
Ouch!
So, I've tried yet another thing. I made a tarball of the whole working system and installed it on the testing bare metal (Core i3-2100). The 'start' command after 'managedsave' still fails. Then, I tried to change kvm related kernel compilation option and compiled kvm-intel as module: fails again. Did the hardware requirements change for libvirt/qemu-kvm? I can't understand why the _exact same_ system works on a hardware and not on another (where previously it worked perfectly well). -- Nicolas Sebrecht

On 08/03/2011 06:44 AM, Nicolas Sebrecht wrote:
The 02/08/11, Nicolas Sebrecht wrote:
I'm stuck!
As told before, I have one working (in production) system and others failing Gentoo systems (including the testing machine).
I've check the working system against the testing machine and looked for differences. I did remove differences one by one (luckily systems are very near from each other) and couldn't have the testing machine to work.
I've check Linux kernel configuration (first) and the whole system for installed packages and compilation options. On each difference found I've done: - compilation and reinstallation of ALL the packages; - reboot; - tests.
Now, I have almost two exact same systems behaving differently. Some minor differences remain about installed packages (missing on testing): - lshw - pv - colorgcc - autofs - iperf
Hardware isn't the same, though. Main differences are: - Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (cpu family : 6) hardware RAID - Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz (cpu family : 6) software RAID
Ouch! So, I've tried yet another thing. I made a tarball of the whole working system and installed it on the testing bare metal (Core i3-2100).
The 'start' command after 'managedsave' still fails. Then, I tried to change kvm related kernel compilation option and compiled kvm-intel as module: fails again.
Did the hardware requirements change for libvirt/qemu-kvm? I can't understand why the _exact same_ system works on a hardware and not on another (where previously it worked perfectly well).
One likely possibility is some sort of race condition where CPU speed (and other hardware related) differences) cause one thread/process to win the race on one of the machines, and another process/thread to win on the other (the test I suggested earlier was inspired by just such a race that I previously encountered, but apparently the qemu you're running already has the fix for that one :-( )

The 05/08/11, Laine Stump wrote:
So, I've tried yet another thing. I made a tarball of the whole working system and installed it on the testing bare metal (Core i3-2100).
The 'start' command after 'managedsave' still fails. Then, I tried to change kvm related kernel compilation option and compiled kvm-intel as module: fails again.
Did the hardware requirements change for libvirt/qemu-kvm? I can't understand why the _exact same_ system works on a hardware and not on another (where previously it worked perfectly well).
One likely possibility is some sort of race condition where CPU speed (and other hardware related) differences) cause one thread/process to win the race on one of the machines, and another process/thread to win on the other (the test I suggested earlier was inspired by just such a race that I previously encountered, but apparently the qemu you're running already has the fix for that one :-( )
Hmm. I wonder if this is so simple. I tried yet on another CPU and where it fails, too. So, the system was tested on: - Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz (fail) - Intel(R) Xeon(R) CPU E5310 @ 1.60GHz (fail) - Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz (fail) - Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (work) The only difference I can find is that on all failing systems the RAID10 module is used. The working system has hardware RAID10 and the Linux RAID10 is not compiled. I'll try to investigate this way. Thanks for the heads up. -- Nicolas Sebrecht
participants (3)
-
Eric Blake
-
Laine Stump
-
Nicolas Sebrecht