On Wed, Mar 26, 2014 at 8:45 AM, Raphael Bauduin <rblists(a)gmail.com> wrote:
Hi,
we have regular crashed of a kvm host with the error "unable to handle
paging request".
Can this be due to memory over-commitment even if some memory is still
used by the kernel for caches and buffers? (collectd graph shows no free
memory, with 15G used, very little buffers, and 1G cache). There are 32GB
of swap, of which only 150MB are used.
I suspect might be the direction to search to find the cause, but would be
happy to learn from people versed in the kernel behaviour to confirm or
reject my hypothesis. Below is the full error.
Thanks!
Raph
745 Mar 23 14:27:37 sMaster01 kernel: [241450.355339] BUG: unable to
handle kernel paging request at ffff8804c001fade
746 Mar 23 14:27:37 sMaster01 kernel: [241450.355384] IP:
[<ffffffff8117e9e9>] bio_check_eod+0x29/0xcd
747 Mar 23 14:27:37 sMaster01 kernel: [241450.355433] PGD 1002063 PUD 0
748 Mar 23 14:27:37 sMaster01 kernel: [241450.355464] Oops: 0000 [#1] SMP
749 Mar 23 14:27:37 sMaster01 kernel: [241450.355496] last sysfs file:
/sys/devices/system/cpu/cpu15/
topology/thread_siblings
750 Mar 23 14:27:37 sMaster01 kernel: [241450.355551] CPU 4
751 Mar 23 14:27:37 sMaster01 kernel: [241450.355577] Modules linked in:
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp kvm_amd kvm ip6table_filter
ip6_tables iptable_fi lter ip_tables x_tables tun nfsd exportfs nfs
lockd fscache nfs_acl auth_rpcgss sunrpc bridge stp bonding dm_round_robin
dm_multipath scsi_dh loop snd_pcm snd_timer snd soundcore snd_page_alloc
serio_raw evdev tpm_tis tpm tpm_bios p smouse pcspkr amd64_edac_mod
edac_core button edac_mce_amd shpchp i2c_piix4 container pci_hotplug
i2c_core processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
dm_snapshot dm_mod sd_mod crc_t10dif mptsas mptscsih mptbase lpfc
ehci_hcd scsi_transport_fc tg3 scsi_tgt scsi_transport_sas ohci_hcd libphy
scsi_mod usbcore nls_base thermal fan thermal_sys [last unloaded:
scsi_wait_scan]
752 Mar 23 14:27:37 sMaster01 kernel: [241450.356084] Pid: 3557, comm:
kjournald Not tainted 2.6.32.61vanilla #1 PRIMERGY BX630 S2
753 Mar 23 14:27:37 sMaster01 kernel: [241450.356141] RIP:
0010:[<ffffffff8117e9e9>] [<ffffffff8117e9e9>] bio_check_eod+0x29/0xcd
754 Mar 23 14:27:37 sMaster01 kernel: [241450.356196] RSP:
0018:ffff8804229abba0 EFLAGS: 00010202
755 Mar 23 14:27:37 sMaster01 kernel: [241450.356228] RAX:
ffff8804c001fad6 RBX: ffff8802e7235080 RCX: 00011200061e5110
756 Mar 23 14:27:37 sMaster01 kernel: [241450.356279] RDX:
0000000000000008 RSI: 0000000000000008 RDI: ffff8802e7235080
757 Mar 23 14:27:37 sMaster01 kernel: [241450.356331] RBP:
ffff8802e7235080 R08: 0000000000000000 R09: ffff880425c54c00
758 Mar 23 14:27:37 sMaster01 kernel: [241450.356383] R10:
0000000000000003 R11: 00000000022e539e R12: ffff8802e7235080
759 Mar 23 14:27:37 sMaster01 kernel: [241450.356434] R13:
ffff8802e7235080 R14: ffff880425c54c00 R15: ffff8802e6281850
760 Mar 23 14:27:37 sMaster01 kernel: [241450.356486] FS:
00007faa6a757820(0000) GS:ffff88000fc80000(0000) knlGS:0000000000000000
761 Mar 23 14:27:37 sMaster01 kernel: [241450.356540] CS: 0010 DS: 0018
ES: 0018 CR0: 000000008005003b
762 Mar 23 14:27:37 sMaster01 kernel: [241450.356573] CR2:
ffff8804c001fade CR3: 00000000cc11f000 CR4: 00000000000006e0
763 Mar 23 14:27:37 sMaster01 kernel: [241450.356628] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
764 Mar 23 14:27:37 sMaster01 kernel: [241450.356681] DR3:
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
765 Mar 23 14:27:37 sMaster01 kernel: [241450.356733] Process kjournald
(pid: 3557, threadinfo ffff8804229aa000, task ffff88041490a300)
766 Mar 23 14:27:37 sMaster01 kernel: [241450.356788] Stack:
767 Mar 23 14:27:37 sMaster01 kernel: [241450.356812] ffff880415382c00
0000000100000285 ffff8804229abfd8 0000000000005186
768 Mar 23 14:27:37 sMaster01 kernel: [241450.356852] <0>
0000000000000000 000000000f1c2776 ffff8804128efa38 ffff8802e7235080
769 Mar 23 14:27:37 sMaster01 kernel: [241450.356913] <0>
ffff8802e7235080 ffff8802e7235080 ffff8800cdacae40 ffffffff8117eb5a
770 Mar 23 14:27:37 sMaster01 kernel: [241450.356993] Call Trace:
771 Mar 23 14:27:37 sMaster01 kernel: [241450.357021]
[<ffffffff8117eb5a>] ? generic_make_request+0xcd/0x2f9
772 Mar 23 14:27:37 sMaster01 kernel: [241450.357058]
[<ffffffff810b6034>] ? mempool_alloc+0x55/0x106
773 Mar 23 14:27:37 sMaster01 kernel: [241450.357091]
[<ffffffff8117ee5c>] ? submit_bio+0xd6/0xf2
774 Mar 23 14:27:37 sMaster01 kernel: [241450.357125]
[<ffffffff8110d83f>] ? submit_bh+0xf5/0x115
775 Mar 23 14:27:37 sMaster01 kernel: [241450.357158]
[<ffffffff8110edc0>] ? sync_dirty_buffer+0x51/0x93
776 Mar 23 14:27:37 sMaster01 kernel: [241450.357196]
[<ffffffffa01727c7>] ? journal_commit_transaction+0xaa6/0xe4f [jbd]
777 Mar 23 14:27:37 sMaster01 kernel: [241450.357252]
[<ffffffffa0175194>] ? kjournald+0xdf/0x226 [jbd]
778 Mar 23 14:27:37 sMaster01 kernel: [241450.357288]
[<ffffffff810651de>] ? autoremove_wake_function+0x0/0x2e
779 Mar 23 14:27:37 sMaster01 kernel: [241450.357324]
[<ffffffffa01750b5>] ? kjournald+0x0/0x226 [jbd]
780 Mar 23 14:27:37 sMaster01 kernel: [241450.357357]
[<ffffffff81064f11>] ? kthread+0x79/0x81
781 Mar 23 14:27:37 sMaster01 kernel: [241450.357391]
[<ffffffff81011baa>] ? child_rip+0xa/0x20
782 Mar 23 14:27:37 sMaster01 kernel: [241450.357425]
[<ffffffff81016568>] ? read_tsc+0xa/0x20
783 Mar 23 14:27:37 sMaster01 kernel: [241450.357456]
[<ffffffff81064e98>] ? kthread+0x0/0x81
784 Mar 23 14:27:37 sMaster01 kernel: [241450.357487]
[<ffffffff81011ba0>] ? child_rip+0x0/0x20
785 Mar 23 14:27:37 sMaster01 kernel: [241450.357517] Code: 5c c3 41 55
49 89 fd 41 54 55 53 48 83 ec 38 65 48 8b 04 25 28 00 00 00 48 89 44 24 28
31 c0 85 f6 0f 84 86 00 00 00 48 8b 47 10 <48> 8b 40 08 48 8b 40 68 48 c1
f8 09 74 74 89 f2 48 8b 0f 48 39
786 Mar 23 14:27:37 sMaster01 kernel: [241450.357738] RIP
[<ffffffff8117e9e9>] bio_check_eod+0x29/0xcd
787 Mar 23 14:27:37 sMaster01 kernel: [241450.357772] RSP
<ffff8804229abba0>
788 Mar 23 14:27:37 sMaster01 kernel: [241450.357799] CR2:
ffff8804c001fade
789 Mar 23 14:27:37 sMaster01 kernel: [241450.358183] ---[ end trace
608fcf1f5a482549 ]---
We had a guest crashing with the same error "unable to handle kernel paging
request", but in the function __destroy_inode this time.
Could faulty memory cause this problem on host and guest?
Raph