On Thu, Jan 31, 2013 at 12:11 AM, Marcelo Tosatti <mtosatti(a)redhat.com> wrote:
On Wed, Jan 30, 2013 at 11:21:08AM +0300, Andrey Korolyov wrote:
> On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti <mtosatti(a)redhat.com> wrote:
> > On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote:
> >> On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov <andrey(a)xdel.ru>
wrote:
> >> > On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti
<mtosatti(a)redhat.com> wrote:
> >> >> On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote:
> >> >>> On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti
<mtosatti(a)redhat.com> wrote:
> >> >>> > On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov
wrote:
> >> >>> >> On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti
<mtosatti(a)redhat.com> wrote:
> >> >>> >> > On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey
Korolyov wrote:
> >> >>> >> >> Thank you Marcelo,
> >> >>> >> >>
> >> >>> >> >> Host node locking up sometimes later than
yesterday, bur problem still
> >> >>> >> >> here, please see attached dmesg. Stuck
process looks like
> >> >>> >> >> root 19251 0.0 0.0 228476 12488 ?
D 14:42 0:00
> >> >>> >> >> /usr/bin/kvm -no-user-config -device ?
-device pci-assign,? -device
> >> >>> >> >> virtio-blk-pci,? -device
> >> >>> >> >>
> >> >>> >> >> on fourth vm by count.
> >> >>> >> >>
> >> >>> >> >> Should I try upstream kernel instead of
applying patch to the latest
> >> >>> >> >> 3.4 or it is useless?
> >> >>> >> >
> >> >>> >> > If you can upgrade to an upstream kernel, please
do that.
> >> >>> >> >
> >> >>> >>
> >> >>> >> With vanilla 3.7.4 there is almost no changes, and NMI
started firing
> >> >>> >> again. External symptoms looks like following:
starting from some
> >> >>> >> count, may be third or sixth vm, qemu-kvm process
allocating its
> >> >>> >> memory very slowly and by jumps, 20M-200M-700M-1.6G in
minutes. Patch
> >> >>> >> helps, of course - on both patched 3.4 and vanilla 3.7
I`m able to
> >> >>> >> kill stuck kvm processes and node returned back to the
normal, when on
> >> >>> >> 3.2 sending SIGKILL to the process causing zombies and
hanged ``ps''
> >> >>> >> output (problem and workaround when no scheduler
involved described
> >> >>> >> here
http://www.spinics.net/lists/kvm/msg84799.html).
> >> >>> >
> >> >>> > Try disabling pause loop exiting with ple_gap=0
kvm-intel.ko module parameter.
> >> >>> >
> >> >>>
> >> >>> Hi Marcelo,
> >> >>>
> >> >>> thanks, this parameter helped to increase number of working VMs
in a
> >> >>> half of order of magnitude, from 3-4 to 10-15. Very high SY
load, 10
> >> >>> to 15 percents, persists on such numbers for a long time, where
linux
> >> >>> guests in same configuration do not jump over one percent even
under
> >> >>> stress bench. After I disabled HT, crash happens only in long
runs and
> >> >>> now it is kernel panic :)
> >> >>> Stair-like memory allocation behaviour disappeared, but other
symptom
> >> >>> leading to the crash which I have not counted previously,
persists: if
> >> >>> VM count is ``enough'' for crash, some qemu processes
starting to eat
> >> >>> one core, and they`ll panic system after run in tens of minutes
in
> >> >>> such state or if I try to attach debugger to one of them. If
needed, I
> >> >>> can log entire crash output via netconsole, now I have some
tail,
> >> >>> almost the same every time:
> >> >>>
http://xdel.ru/downloads/btwin.png
> >> >>
> >> >> Yes, please log entire crash output, thanks.
> >> >>
> >> >
> >> > Here please, 3.7.4-vanilla, 16 vms, ple_gap=0:
> >> >
> >> >
http://xdel.ru/downloads/oops-default-kvmintel.txt
> >>
> >> Just an update: I was able to reproduce that on pure linux VMs using
> >> qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs
at
> >> start of vm(with count ten working machines at the moment). Qemu-1.1.2
> >> generally is not able to reproduce that, but host node with older
> >> version crashing on less amount of Windows VMs(three to six instead
> >> ten to fifteen) than with 1.3, please see trace below:
> >>
> >>
http://xdel.ru/downloads/oops-old-qemu.txt
> >
> > Single bit memory error, apparently. Try:
> >
> > 1. memtest86.
> > 2. Boot with slub_debug=ZFPU kernel parameter.
> > 3. Reproduce on different machine
> >
> >
>
> Hi Marcelo,
>
> I always follow the rule - if some weird bug exists, check it on
> ECC-enabled machine and check IPMI logs too before start complaining
> :) I have finally managed to ``fix'' the problem, but my solution
> seems a bit strange:
> - I have noticed that if virtual machines started without any cgroup
> setting they will not cause this bug under any conditions,
> - I have thought, very wrong in my mind, that the
> CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and
> should not touch tasks already inside any existing cpu cgroup. First
> sight on the 200-line patch shows that the autogrouping always applies
> to all tasks, so I tried to disable it,
> - wild magic appears - VMs didn`t crashed host any more, even in count
> 30+ they work fine.
> I still don`t know what exactly triggered that and will I face it
> again under different conditions, so my solution more likely to be a
> patch of mud in wall of the dam, instead of proper fixing.
>
> There seems to be two possible origins of such error - a very very
> hideous race condition involving cgroups and processes like qemu-kvm
> causing frequent context switches and simple incompatibility between
> NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing work
> in the cgroup, since I have not observed this errors on single numa
> node(mean, desktop) on relatively heavier condition.
Yes, it would be important to track it down though. Enabling
slub_debug=ZFPU kernel parameter should help.
Hi Marcelo,
I have finally beat that one. As I have mentioned before in the
off-list message, nested cgroups for vcpu/emulator threads created by
libvirt was a root cause of this problem. Today we`ve disabled
creation of cgroup deeper than qemu/vm/ level and trace didn`t showed
up under different workloads. So for libvirt itself, it may be a
feature request to create thread-based cgroups iff any element of the
VM` config requires that. As for cgroups, seems it is fatal to have
very large amount of nested elements inside cpu on qemu-kvm, or on
very large amount of threads - since I have limited core amount on
each node, I can`t prove what exactly, complicated cgroup hierarchy or
some side effects putting threads on the dedicated cgroup, caused all
this pain. And, of course, without Windows(tm) bug is very hard to
observe in the wild, since almost no synthetic test I have put on the
linux VMs is able to show it.