[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

6 Feb 2018

      Hi everyone,

I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.

I have a nested virtualization setup that looks as follows:

- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default

The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".

This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
Then, in the L0 guest we immediately see this:

[Tue Feb  6 07:00:37 2018] ------------[ cut here ]------------
[Tue Feb  6 07:00:37 2018] kernel BUG at ../arch/x86/kvm/x86.c:328!
[Tue Feb  6 07:00:37 2018] invalid opcode: 0000 [#1] SMP
[Tue Feb  6 07:00:37 2018] Modules linked in: fuse vhost_net vhost
macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp tun br_netfilter bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables x_tables vboxpci(O) vboxnetadp(O) vboxnetflt(O) af_packet
iscsi_ibft iscsi_boot_sysfs vboxdrv(O) kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
hid_generic usbhid jitterentropy_rng drbg ansi_cprng ppdev parport_pc
floppy parport joydev aesni_intel processor button aes_x86_64
virtio_balloon virtio_net lrw gf128mul glue_helper pcspkr serio_raw
ablk_helper cryptd i2c_piix4 ext4 crc16 jbd2 mbcache ata_generic
[Tue Feb  6 07:00:37 2018]  virtio_blk ata_piix ahci libahci cirrus(O)
drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O)
drm(O) virtio_pci virtio_ring virtio uhci_hcd ehci_hcd usbcore
usb_common libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc
scsi_dh_alua scsi_mod autofs4
[Tue Feb  6 07:00:37 2018] CPU: 2 PID: 2041 Comm: CPU 0/KVM Tainted: G
       W  O     4.4.104-39-default #1
[Tue Feb  6 07:00:37 2018] Hardware name: OpenStack Foundation
OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[Tue Feb  6 07:00:37 2018] task: ffff880037108d80 ti: ffff88042e964000
task.ti: ffff88042e964000
[Tue Feb  6 07:00:37 2018] RIP: 0010:[<ffffffffa04f20e5>]
[<ffffffffa04f20e5>] kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb  6 07:00:37 2018] RSP: 0018:ffff88042e967d70  EFLAGS: 00010246
[Tue Feb  6 07:00:37 2018] RAX: 0000000000000000 RBX: ffff88042c4f0040
RCX: 0000000000000000
[Tue Feb  6 07:00:37 2018] RDX: 0000000000006820 RSI: 0000000000000282
RDI: ffff88042c4f0040
[Tue Feb  6 07:00:37 2018] RBP: ffff88042c4f00d8 R08: ffff88042e964000
R09: 0000000000000002
[Tue Feb  6 07:00:37 2018] R10: 0000000000000004 R11: 0000000000000000
R12: 0000000000000001
[Tue Feb  6 07:00:37 2018] R13: 0000021d34fbb21d R14: 0000000000000001
R15: 000055d2157cf840
[Tue Feb  6 07:00:37 2018] FS:  00007f7c52b96700(0000)
GS:ffff88043fd00000(0000) knlGS:0000000000000000
[Tue Feb  6 07:00:37 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Feb  6 07:00:37 2018] CR2: 00007f823b15f000 CR3: 0000000429334000
CR4: 0000000000362670
[Tue Feb  6 07:00:37 2018] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[Tue Feb  6 07:00:37 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[Tue Feb  6 07:00:37 2018] Stack:
[Tue Feb  6 07:00:37 2018]  ffffffffa07939b1 ffffffffa0787875
ffffffffa0503a60 ffff88042c4f0040
[Tue Feb  6 07:00:37 2018]  ffffffffa04e5ede ffff88042c4f0040
ffffffffa04e6f0f ffff880037108d80
[Tue Feb  6 07:00:37 2018]  ffff88042c4f00e0 ffff88042c4f00e0
ffff88042c4f0040 ffff88042e968000
[Tue Feb  6 07:00:37 2018] Call Trace:
[Tue Feb  6 07:00:37 2018]  [<ffffffffa07939b1>]
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb  6 07:00:37 2018] DWARF2 unwinder stuck at
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb  6 07:00:37 2018] Leftover inexact backtrace:
[Tue Feb  6 07:00:37 2018]  [<ffffffffa0787875>] ?
vmx_interrupt_allowed+0x15/0x30 [kvm_intel]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa0503a60>] ?
kvm_arch_vcpu_runnable+0xa0/0xd0 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa04e5ede>] ?
kvm_vcpu_check_block+0xe/0x60 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa04e6f0f>] ?
kvm_vcpu_block+0x8f/0x310 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa0503c17>] ?
kvm_arch_vcpu_ioctl_run+0x187/0x400 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa04ea6d9>] ?
kvm_vcpu_ioctl+0x359/0x680 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffff81016689>] ? __switch_to+0x1c9/0x460
[Tue Feb  6 07:00:37 2018]  [<ffffffff81224f02>] ? do_vfs_ioctl+0x322/0x5d0
[Tue Feb  6 07:00:37 2018]  [<ffffffff811362ef>] ?
__audit_syscall_entry+0xaf/0x100
[Tue Feb  6 07:00:37 2018]  [<ffffffff8100383b>] ?
syscall_trace_enter_phase1+0x15b/0x170
[Tue Feb  6 07:00:37 2018]  [<ffffffff81225224>] ? SyS_ioctl+0x74/0x80
[Tue Feb  6 07:00:37 2018]  [<ffffffff81634a02>] ?
entry_SYSCALL_64_fastpath+0x16/0xae
[Tue Feb  6 07:00:37 2018] Code: d7 fe ff ff 8b 2d 04 6e 06 00 e9 c2
fe ff ff 48 89 f2 e9 65 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
00 00 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 55 89 ff 48 89
[Tue Feb  6 07:00:37 2018] RIP  [<ffffffffa04f20e5>]
kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb  6 07:00:37 2018]  RSP <ffff88042e967d70>
[Tue Feb  6 07:00:37 2018] ---[ end trace e15c567f77920049 ]---

We only hit this kernel bug if we have a nested VM running. The exact
same setup, sent into managed save after shutting down the nested VM,
wakes up just fine.

Now I am aware of https://bugzilla.redhat.com/show_bug.cgi?id=1076294,
which talks about live migration — but I think the same considerations
apply.

I am also aware of
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM,
which strongly suggests to use host-passthrough or host-model. I have
tried both, to no avail. The stack trace persists. I have also tried
running a 4.15 kernel in the L0 guest, from
https://kernel.opensuse.org/packages/stable, but again, the stack
trace persists.

What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.

So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?

I'd be extraordinarily grateful for any suggestions. Thanks!

Cheers,
Florian

Florian Haas

Kashyap Chamarthy

David Hildenbrand

Florian Haas

David Hildenbrand

Florian Haas

David Hildenbrand

Florian Haas

David Hildenbrand

Daniel P. Berrangé

David Hildenbrand

Kashyap Chamarthy

Florian Haas

Kashyap Chamarthy

Florian Haas

Kashyap Chamarthy

Florian Haas

Kashyap Chamarthy

Kashyap Chamarthy

Daniel P. Berrangé

David Hildenbrand

Kashyap Chamarthy

tags

participants (4)