On Tue, Jan 08, 2013 at 04:42:00PM +0100, Michal Privoznik wrote:
> On 08.01.2013 16:24, Daniel P. Berrange wrote:
>> On Tue, Jan 08, 2013 at 10:37:19AM +0100, Michal Privoznik wrote:
>>> Currently, if there's no hard memory limit defined for a domain,
>>> libvirt tries to calculate one, based on domain definition and magic
>>> equation and set it upon the domain startup. The rationale behind was,
>>> if there's a memory leak or exploit in qemu, we should prevent the
>>> host system trashing. However, the equation was too tightening, as it
>>> didn't reflect what the kernel counts into the memory used by a
>>> process. Since many hosts do have a swap, nobody hasn't noticed
>>> anything, because if hard memory limit is reached, process can
>>> continue allocating memory on a swap. However, if there is no swap on
>>> the host, the process gets killed by OOM killer. In our case, the qemu
>>> process it is.
>>>
>>> To prevent this, we need to relax the hard RSS limit. Moreover, we
>>> should reflect more precisely the kernel way of accounting the memory
>>> for process. That is, even the kernel caches are counted within the
>>> memory used by a process (within cgroups at least). Hence the magic
>>> equation has to be changed:
>>>
>>> limit = 1.5 * (domain memory + total video memory) + (32MB for cache
>>> per each disk) + 200MB
>>> ---
>>>
>>> There is a bit more that should be taken into account, e.g. shared
>>> pages, where accounting is even more complicated:
>>>
>>> "Shared pages are accounted on the basis of the first touch approach.
>>> The cgroup that first touches a page is accounted for the page." [1]
>>>
>>> I don't we even want to try to reflect this in our code. That's why
>>> the coefficient of domain memory has been lifted from 1.02 to 1.5, in
>>> hope it will just be enough.
>>>
>>> 1:
http://www.kernel.org/doc/Documentation/cgroups/memory.txt
>>>
>>> src/qemu/qemu_cgroup.c | 15 +++++++++------
>>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c
>>> index 7faf025..16a9d7c 100644
>>> --- a/src/qemu/qemu_cgroup.c
>>> +++ b/src/qemu/qemu_cgroup.c
>>> @@ -343,15 +343,18 @@ int qemuSetupCgroup(virQEMUDriverPtr driver,
>>> unsigned long long hard_limit = vm->def->mem.hard_limit;
>>>
>>> if (!hard_limit) {
>>> - /* If there is no hard_limit set, set a reasonable
>>> - * one to avoid system trashing caused by exploited qemu.
>>> - * As 'reasonable limit' has been chosen:
>>> - * (1 + k) * (domain memory + total video memory) + F
>>> - * where k = 0.02 and F = 200MB. */
>>> + /* If there is no hard_limit set, set a reasonable one to avoid
>>> + * system trashing caused by exploited qemu. As 'reasonable
limit'
>>> + * has been chosen:
>>> + * (1 + k) * (domain memory + total video memory) + (32MB
for
>>> + * cache per each disk) + F
>>> + * where k = 0.5 and F = 200MB. The cache for disks is
important as
>>> + * kernel cache on the host side counts into the RSS limit. */
>>> hard_limit = vm->def->mem.max_balloon;
>>> for (i = 0; i < vm->def->nvideos; i++)
>>> hard_limit += vm->def->videos[i]->vram;
>>> - hard_limit = hard_limit * 1.02 + 204800;
>>> + hard_limit = hard_limit * 1.5 + 204800;
>>> + hard_limit += vm->def->ndisks * 32768;
>>> }
>>>
>>> rc = virCgroupSetMemoryHardLimit(cgroup, hard_limit);
>>
>> ACK,
>>
>> can't say I'm a fan of our heuristics but I don't see a better way
>> yet. Lets see how this new limit copes.
>>
>> Daniel
>>
>
> Yeah, it's sort of magic. Pushed now. Thanks.
How does one turn off the limits?
Dave
Either disable mem cgroup (e.g. by unmounting it), or set own limit in
the domain XML (libvirt won't even try to calculate new one then).
Michal