On 08.01.2013 16:24, Daniel P. Berrange wrote:
> On Tue, Jan 08, 2013 at 10:37:19AM +0100, Michal Privoznik wrote:
>> Currently, if there's no hard memory limit defined for a domain,
>> libvirt tries to calculate one, based on domain definition and magic
>> equation and set it upon the domain startup. The rationale behind was,
>> if there's a memory leak or exploit in qemu, we should prevent the
>> host system trashing. However, the equation was too tightening, as it
>> didn't reflect what the kernel counts into the memory used by a
>> process. Since many hosts do have a swap, nobody hasn't noticed
>> anything, because if hard memory limit is reached, process can
>> continue allocating memory on a swap. However, if there is no swap on
>> the host, the process gets killed by OOM killer. In our case, the qemu
>> process it is.
>>
>> To prevent this, we need to relax the hard RSS limit. Moreover, we
>> should reflect more precisely the kernel way of accounting the memory
>> for process. That is, even the kernel caches are counted within the
>> memory used by a process (within cgroups at least). Hence the magic
>> equation has to be changed:
>>
>> limit = 1.5 * (domain memory + total video memory) + (32MB for cache
>> per each disk) + 200MB
>> ---
>>
>> There is a bit more that should be taken into account, e.g. shared
>> pages, where accounting is even more complicated:
>>
>> "Shared pages are accounted on the basis of the first touch approach.
>> The cgroup that first touches a page is accounted for the page." [1]
>>
>> I don't we even want to try to reflect this in our code. That's why
>> the coefficient of domain memory has been lifted from 1.02 to 1.5, in
>> hope it will just be enough.
>>
>> 1:
http://www.kernel.org/doc/Documentation/cgroups/memory.txt
>>
>> src/qemu/qemu_cgroup.c | 15 +++++++++------
>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c
>> index 7faf025..16a9d7c 100644
>> --- a/src/qemu/qemu_cgroup.c
>> +++ b/src/qemu/qemu_cgroup.c
>> @@ -343,15 +343,18 @@ int qemuSetupCgroup(virQEMUDriverPtr driver,
>> unsigned long long hard_limit = vm->def->mem.hard_limit;
>>
>> if (!hard_limit) {
>> - /* If there is no hard_limit set, set a reasonable
>> - * one to avoid system trashing caused by exploited qemu.
>> - * As 'reasonable limit' has been chosen:
>> - * (1 + k) * (domain memory + total video memory) + F
>> - * where k = 0.02 and F = 200MB. */
>> + /* If there is no hard_limit set, set a reasonable one to avoid
>> + * system trashing caused by exploited qemu. As 'reasonable
limit'
>> + * has been chosen:
>> + * (1 + k) * (domain memory + total video memory) + (32MB for
>> + * cache per each disk) + F
>> + * where k = 0.5 and F = 200MB. The cache for disks is important
as
>> + * kernel cache on the host side counts into the RSS limit. */
>> hard_limit = vm->def->mem.max_balloon;
>> for (i = 0; i < vm->def->nvideos; i++)
>> hard_limit += vm->def->videos[i]->vram;
>> - hard_limit = hard_limit * 1.02 + 204800;
>> + hard_limit = hard_limit * 1.5 + 204800;
>> + hard_limit += vm->def->ndisks * 32768;
>> }
>>
>> rc = virCgroupSetMemoryHardLimit(cgroup, hard_limit);
>
> ACK,
>
> can't say I'm a fan of our heuristics but I don't see a better way
> yet. Lets see how this new limit copes.
>
> Daniel
>
Yeah, it's sort of magic. Pushed now. Thanks.
Michal