On Wed, Mar 26, 2014 at 8:45 AM, Raphael Bauduin <rblists@gmail.com> wrote:
Hi,

we have regular crashed of a kvm host with the error "unable to handle paging request".
Can this be due to memory over-commitment even if some memory is still used by the kernel for caches and buffers?  (collectd graph shows no free memory, with 15G used, very little buffers, and 1G cache). There are 32GB of swap, of which only 150MB are used.

I suspect might be the direction to search to find the cause, but would be happy to learn from people versed in the kernel behaviour to confirm or reject my hypothesis. Below is the full error.

Thanks!

Raph



745 Mar 23 14:27:37 sMaster01 kernel: [241450.355339] BUG: unable to handle kernel paging request at ffff8804c001fade
 746 Mar 23 14:27:37 sMaster01 kernel: [241450.355384] IP: [<ffffffff8117e9e9>] bio_check_eod+0x29/0xcd
 747 Mar 23 14:27:37 sMaster01 kernel: [241450.355433] PGD 1002063 PUD 0
 748 Mar 23 14:27:37 sMaster01 kernel: [241450.355464] Oops: 0000 [#1] SMP
 749 Mar 23 14:27:37 sMaster01 kernel: [241450.355496] last sysfs file: /sys/devices/system/cpu/cpu15/
topology/thread_siblings
 750 Mar 23 14:27:37 sMaster01 kernel: [241450.355551] CPU 4
 751 Mar 23 14:27:37 sMaster01 kernel: [241450.355577] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp kvm_amd kvm ip6table_filter ip6_tables iptable_fi     lter ip_tables x_tables tun nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc bridge stp bonding dm_round_robin dm_multipath scsi_dh loop snd_pcm snd_timer snd soundcore snd_page_alloc serio_raw evdev tpm_tis tpm tpm_bios p     smouse pcspkr amd64_edac_mod edac_core button edac_mce_amd shpchp i2c_piix4 container pci_hotplug i2c_core processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sd_mod crc_t10dif mptsas mptscsih mptbase lpfc      ehci_hcd scsi_transport_fc tg3 scsi_tgt scsi_transport_sas ohci_hcd libphy scsi_mod usbcore nls_base thermal fan thermal_sys [last unloaded: scsi_wait_scan]
 752 Mar 23 14:27:37 sMaster01 kernel: [241450.356084] Pid: 3557, comm: kjournald Not tainted 2.6.32.61vanilla #1 PRIMERGY BX630 S2
 753 Mar 23 14:27:37 sMaster01 kernel: [241450.356141] RIP: 0010:[<ffffffff8117e9e9>]  [<ffffffff8117e9e9>] bio_check_eod+0x29/0xcd
 754 Mar 23 14:27:37 sMaster01 kernel: [241450.356196] RSP: 0018:ffff8804229abba0  EFLAGS: 00010202
 755 Mar 23 14:27:37 sMaster01 kernel: [241450.356228] RAX: ffff8804c001fad6 RBX: ffff8802e7235080 RCX: 00011200061e5110
 756 Mar 23 14:27:37 sMaster01 kernel: [241450.356279] RDX: 0000000000000008 RSI: 0000000000000008 RDI: ffff8802e7235080
 757 Mar 23 14:27:37 sMaster01 kernel: [241450.356331] RBP: ffff8802e7235080 R08: 0000000000000000 R09: ffff880425c54c00
 758 Mar 23 14:27:37 sMaster01 kernel: [241450.356383] R10: 0000000000000003 R11: 00000000022e539e R12: ffff8802e7235080
 759 Mar 23 14:27:37 sMaster01 kernel: [241450.356434] R13: ffff8802e7235080 R14: ffff880425c54c00 R15: ffff8802e6281850
 760 Mar 23 14:27:37 sMaster01 kernel: [241450.356486] FS:  00007faa6a757820(0000) GS:ffff88000fc80000(0000) knlGS:0000000000000000
 761 Mar 23 14:27:37 sMaster01 kernel: [241450.356540] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
 762 Mar 23 14:27:37 sMaster01 kernel: [241450.356573] CR2: ffff8804c001fade CR3: 00000000cc11f000 CR4: 00000000000006e0
 763 Mar 23 14:27:37 sMaster01 kernel: [241450.356628] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 764 Mar 23 14:27:37 sMaster01 kernel: [241450.356681] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 765 Mar 23 14:27:37 sMaster01 kernel: [241450.356733] Process kjournald (pid: 3557, threadinfo ffff8804229aa000, task ffff88041490a300)
 766 Mar 23 14:27:37 sMaster01 kernel: [241450.356788] Stack:
 767 Mar 23 14:27:37 sMaster01 kernel: [241450.356812]  ffff880415382c00 0000000100000285 ffff8804229abfd8 0000000000005186
 768 Mar 23 14:27:37 sMaster01 kernel: [241450.356852] <0> 0000000000000000 000000000f1c2776 ffff8804128efa38 ffff8802e7235080
 769 Mar 23 14:27:37 sMaster01 kernel: [241450.356913] <0> ffff8802e7235080 ffff8802e7235080 ffff8800cdacae40 ffffffff8117eb5a
 770 Mar 23 14:27:37 sMaster01 kernel: [241450.356993] Call Trace:
 771 Mar 23 14:27:37 sMaster01 kernel: [241450.357021]  [<ffffffff8117eb5a>] ? generic_make_request+0xcd/0x2f9
 772 Mar 23 14:27:37 sMaster01 kernel: [241450.357058]  [<ffffffff810b6034>] ? mempool_alloc+0x55/0x106
 773 Mar 23 14:27:37 sMaster01 kernel: [241450.357091]  [<ffffffff8117ee5c>] ? submit_bio+0xd6/0xf2
 774 Mar 23 14:27:37 sMaster01 kernel: [241450.357125]  [<ffffffff8110d83f>] ? submit_bh+0xf5/0x115
 775 Mar 23 14:27:37 sMaster01 kernel: [241450.357158]  [<ffffffff8110edc0>] ? sync_dirty_buffer+0x51/0x93
 776 Mar 23 14:27:37 sMaster01 kernel: [241450.357196]  [<ffffffffa01727c7>] ? journal_commit_transaction+0xaa6/0xe4f [jbd]
 777 Mar 23 14:27:37 sMaster01 kernel: [241450.357252]  [<ffffffffa0175194>] ? kjournald+0xdf/0x226 [jbd]
 778 Mar 23 14:27:37 sMaster01 kernel: [241450.357288]  [<ffffffff810651de>] ? autoremove_wake_function+0x0/0x2e
 779 Mar 23 14:27:37 sMaster01 kernel: [241450.357324]  [<ffffffffa01750b5>] ? kjournald+0x0/0x226 [jbd]
 780 Mar 23 14:27:37 sMaster01 kernel: [241450.357357]  [<ffffffff81064f11>] ? kthread+0x79/0x81
 781 Mar 23 14:27:37 sMaster01 kernel: [241450.357391]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
 782 Mar 23 14:27:37 sMaster01 kernel: [241450.357425]  [<ffffffff81016568>] ? read_tsc+0xa/0x20
 783 Mar 23 14:27:37 sMaster01 kernel: [241450.357456]  [<ffffffff81064e98>] ? kthread+0x0/0x81
 784 Mar 23 14:27:37 sMaster01 kernel: [241450.357487]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
 785 Mar 23 14:27:37 sMaster01 kernel: [241450.357517] Code: 5c c3 41 55 49 89 fd 41 54 55 53 48 83 ec 38 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 85 f6 0f 84 86 00 00 00 48 8b 47 10 <48> 8b 40 08 48 8b 40 68 48 c1 f8 09 74 74 89      f2 48 8b 0f 48 39
 786 Mar 23 14:27:37 sMaster01 kernel: [241450.357738] RIP  [<ffffffff8117e9e9>] bio_check_eod+0x29/0xcd
 787 Mar 23 14:27:37 sMaster01 kernel: [241450.357772]  RSP <ffff8804229abba0>
 788 Mar 23 14:27:37 sMaster01 kernel: [241450.357799] CR2: ffff8804c001fade
 789 Mar 23 14:27:37 sMaster01 kernel: [241450.358183] ---[ end trace 608fcf1f5a482549 ]---


We had a guest crashing with the same error "unable to handle kernel paging request", but in the function __destroy_inode this time.
Could faulty memory cause this problem on host and guest?

Raph