[libvirt] [RFC] Memory controller exploitation in libvirt

Subject: [RFC] Memory controller exploitation in libvirt Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that. At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area. Design 1: Provide new API and XML changes for resource management ================================================================= All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface. 1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); Parameter list supported: MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started) SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel. I am not an expert here, so just listing what new elements need to be added to the XML schema: <define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define> Pros: * Support all the tunables exported by the kernel * More tunables can be added as and when required Cons: * Code changes would touch various levels * Might need to redefine(changing the scope) of existing memory API. Currently, domainSetMemory is used to set limit_in_bytes in LXC and memory ballooning in QEmu. While the domainSetMaxMemory is not defined in QEmu and in case of LXC it is setting the internal object's maxmem variable. Future: * Later on, CPU/IO/Network controllers related tunables can be added/enhanced along with the APIs/XML elements: CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage * libvirt-cim support for resource management Design 2: Reuse the current memory APIs in libvirt ================================================== Use memory.limit_in_bytes to tweak memory hard limits Init - Set the memory.limit_in_bytes to maximum mem. Claiming memory from guest: a) Reduce balloon size b) If the guest does not co-operate(How do we know?), reduce memory.limit_in_bytes. Allocating memory more than max memory: How to solve this? As we have already set the max balloon size. We can only play within this! Pros: * Few changes * Is not intrusive Cons: * SetMemory and SetMaxMemory usage is confusing. * SetMemory is too generic a name, it does not cover all the tunables. * Does not support memory softlimit * Does not have support to reserve the memory swap region * This solution is not extensible IMO, "Design 1" is more generic and extensible for various memory tuneables. Nikunj

* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 11:53:27]:
Subject: [RFC] Memory controller exploitation in libvirt
Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that.
At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area.
Design 1: Provide new API and XML changes for resource management =================================================================
All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface.
1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats()
Parameter list supported:
MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory
Soft limits allows you to set memory limit on contention.
MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started)
SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space
We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The semantics are just too hard to get right at the moment.
Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel.
I am not an expert here, so just listing what new elements need to be added to the XML schema:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
I'd prefer a syntax that integrates well with what we currently have <cgroup> <path>...</path> <controller> <name>..</name> <soft limit>...</> <hard limit>...</> </controller> ... </cgroup> But I am not an XML expert or an export in designing XML configurations.
Pros: * Support all the tunables exported by the kernel * More tunables can be added as and when required
Cons: * Code changes would touch various levels * Might need to redefine(changing the scope) of existing memory API. Currently, domainSetMemory is used to set limit_in_bytes in LXC and memory ballooning in QEmu. While the domainSetMaxMemory is not defined in QEmu and in case of LXC it is setting the internal object's maxmem variable.
Future:
* Later on, CPU/IO/Network controllers related tunables can be added/enhanced along with the APIs/XML elements:
CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage
* libvirt-cim support for resource management
Design 2: Reuse the current memory APIs in libvirt ==================================================
Use memory.limit_in_bytes to tweak memory hard limits Init - Set the memory.limit_in_bytes to maximum mem.
Claiming memory from guest: a) Reduce balloon size b) If the guest does not co-operate(How do we know?), reduce memory.limit_in_bytes.
This is a policy and I am not sure an API should be hiding policy succintly
Allocating memory more than max memory: How to solve this? As we have already set the max balloon size. We can only play within this!
Pros: * Few changes * Is not intrusive
Cons: * SetMemory and SetMaxMemory usage is confusing. * SetMemory is too generic a name, it does not cover all the tunables. * Does not support memory softlimit * Does not have support to reserve the memory swap region * This solution is not extensible
IMO, "Design 1" is more generic and extensible for various memory tuneables.
Agreed, 1 is better.
Nikunj
-- Three Cheers, Balbir

* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 11:53:27]:
Subject: [RFC] Memory controller exploitation in libvirt
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into Yes it helps, when parsing the parameters from the domain xml file, we can call
On Tue, 24 Aug 2010 13:05:26 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: this API and set them at once. BTW, it can also be called with one parameter if desired.
virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats()
This is at the cgroup level(internal API) and will be implemented in the way that is suggested. The RFC should not be specific to cgroups. libvirt is supported on multiple OS and the above described APIs in the RFC are public API.
SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space
We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The Ok.
Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel.
I am not an expert here, so just listing what new elements need to be added to the XML schema:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
I'd prefer a syntax that integrates well with what we currently have
<cgroup> <path>...</path> <controller> <name>..</name> <soft limit>...</> <hard limit>...</> </controller> ... </cgroup>
Again this is a libvirt domain xml file, IMO, it should not be cgroup specific. Nikunj

* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 13:35:10]:
* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 11:53:27]:
Subject: [RFC] Memory controller exploitation in libvirt
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into Yes it helps, when parsing the parameters from the domain xml file, we can call
On Tue, 24 Aug 2010 13:05:26 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: this API and set them at once. BTW, it can also be called with one parameter if desired.
virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats()
This is at the cgroup level(internal API) and will be implemented in the way that is suggested. The RFC should not be specific to cgroups. libvirt is supported on multiple OS and the above described APIs in the RFC are public API.
I thought we were talking of cgroups in the QEMU driver for Linux. IMHO the generalization is too big. ESX for example, already abstracts their WLM/RM needs in their driver.
SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space
We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The Ok.
Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel.
I am not an expert here, so just listing what new elements need to be added to the XML schema:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
I'd prefer a syntax that integrates well with what we currently have
<cgroup> <path>...</path> <controller> <name>..</name> <soft limit>...</> <hard limit>...</> </controller> ... </cgroup>
Again this is a libvirt domain xml file, IMO, it should not be cgroup specific.
See the comment above. -- Three Cheers, Balbir

2010/8/24 Balbir Singh <balbir@linux.vnet.ibm.com>:
* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 13:35:10]:
* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 11:53:27]:
Subject: [RFC] Memory controller exploitation in libvirt
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into Yes it helps, when parsing the parameters from the domain xml file, we can call
On Tue, 24 Aug 2010 13:05:26 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: this API and set them at once. BTW, it can also be called with one parameter if desired.
virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats()
This is at the cgroup level(internal API) and will be implemented in the way that is suggested. The RFC should not be specific to cgroups. libvirt is supported on multiple OS and the above described APIs in the RFC are public API.
I thought we were talking of cgroups in the QEMU driver for Linux. IMHO the generalization is too big. ESX for example, already abstracts their WLM/RM needs in their driver.
Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory. ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. So you can control how much virtual memory a guest has (virDomainSetMaxMemory) and define and upper (virDomainSetMemory) and a lower (not exposed via libvirt) bound for the physical memory that the hypervisor should use to satisfy the virtual memory of a guest. ESX also allows to defines shares, a relative value that defines a priority between guests in case there is not enough physical memory to satisfy all guests, the remaining virtual memory is then satisfied by swapping at the hypervisor level. The same pattern applies to the virtual CPUs. There is an upper and a lower limit for the CPU allocation of a guest and a shares value to define priority in case of contention. All three are exposed using the virDomainSetSchedulerParameters API for ESX. Regarding the new elements proposed here: <define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define> memoryHardLimit is already there and called memory, memorySoftLimit is also there and called currentMemory, memoryMinGaurantee is new. I'm not sure where swapHardLimit and swapSoftLimit apply, is that for swapping that the hypervisor level? Also keep in mind that there was a recent discussion about how to express ballooning and memory configuration in the domain XML config: https://www.redhat.com/archives/libvir-list/2010-August/msg00118.html Regarding future additions: CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage The CPU part of this is already possible via the virDomainSetSchedulerParameters API. But they aren't expressed in the domain XML config, maybe your suggesting to do this? The I/O part is in fact new, I think. In general when you want to extend the domain XML config make sure that you don't model it to closely based on a specific implementation like CGroup. Matthias

On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte <matthias.bolte@googlemail.com> wrote: <snip>
Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory.
ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. LXC driver is using virDomainSetMemory to set the memory hard limit while QEmu/ESX use them to change the ballooning. And as you said, ESX does support memoryMinGaurantee, we can get this exported in libvirt using this new API.
Here I am trying to group all the memory related parameters into one single public API as we have in virDomainSetSchedulerParameters. Currently, the names are not conveying what they modify in the below layer and are confusing.
So you can control how much virtual memory a guest has (virDomainSetMaxMemory) and define and upper (virDomainSetMemory) and a lower (not exposed via libvirt) bound for the physical memory that the hypervisor should use to satisfy the virtual memory of a guest. ESX also allows to defines shares, a relative value that defines a priority between guests in case there is not enough physical memory to satisfy all guests, the remaining virtual memory is then satisfied by swapping at the hypervisor level.
The same pattern applies to the virtual CPUs. There is an upper and a lower limit for the CPU allocation of a guest and a shares value to define priority in case of contention. All three are exposed using the virDomainSetSchedulerParameters API for ESX.
Regarding the new elements proposed here:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
memoryHardLimit is already there and called memory, memorySoftLimit is also there and called currentMemory, memoryMinGaurantee is new. Thats correct, I am aware of this. The names are misleading. Also, we can have all these under the memory element.
Later we can add something like this: <define name="resource"> <element memory> <!-- All memory related tunables --> </element> <element cpu> <!-- All CPU related tunables --> </element> <element blkio> <!-- All Block IO related tunables --> </element> <element network> <!-- All network related tunables --> </element> </define>
I'm not sure where swapHardLimit and swapSoftLimit apply, is that for swapping that the hypervisor level? Memory Cgroup provides the maximum swap a group of task can use. swapSoftLimit is not supported as Balbir said and they do not have plans to support it. We can drop this.
Also keep in mind that there was a recent discussion about how to express ballooning and memory configuration in the domain XML config:
https://www.redhat.com/archives/libvir-list/2010-August/msg00118.html I will have a look at this.
Regarding future additions:
CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage
The CPU part of this is already possible via the virDomainSetSchedulerParameters API. But they aren't expressed in the domain XML config, maybe your suggesting to do this? Yes, thats correct for CPU.
IO would need API as well as XML config changes. Does ESX also support Block IO bandwidth control?
The I/O part is in fact new, I think.
In general when you want to extend the domain XML config make sure that you don't model it to closely based on a specific implementation like CGroup. Sure
Nikunj

On Tue, Aug 24, 2010 at 03:17:44PM +0530, Nikunj A. Dadhania wrote:
On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte <matthias.bolte@googlemail.com> wrote:
<snip>
Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory.
ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. LXC driver is using virDomainSetMemory to set the memory hard limit while QEmu/ESX use them to change the ballooning. And as you said, ESX does support memoryMinGaurantee, we can get this exported in libvirt using this new API.
Here I am trying to group all the memory related parameters into one single public API as we have in virDomainSetSchedulerParameters. Currently, the names are not conveying what they modify in the below layer and are confusing.
For historical design record, I think it would be good to write a short description of what memory tunables are available for each hypervisor, covering VMWare, OpenVZ, Xen, KVM and LXC (the latter both cgroups based). I do recall that OpenVZ in particular had a huge number of memory tunables. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Tue, 24 Aug 2010 11:07:29 +0100, "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Aug 24, 2010 at 03:17:44PM +0530, Nikunj A. Dadhania wrote:
On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte <matthias.bolte@googlemail.com> wrote:
<snip>
Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory.
ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. LXC driver is using virDomainSetMemory to set the memory hard limit while QEmu/ESX use them to change the ballooning. And as you said, ESX does support memoryMinGaurantee, we can get this exported in libvirt using this new API.
Here I am trying to group all the memory related parameters into one single public API as we have in virDomainSetSchedulerParameters. Currently, the names are not conveying what they modify in the below layer and are confusing.
For historical design record, I think it would be good to write a short description of what memory tunables are available for each hypervisor, covering VMWare, OpenVZ, Xen, KVM and LXC (the latter both cgroups based). I do recall that OpenVZ in particular had a huge number of memory tunables.
I will collect the info and update. Nikunj

On Tue, 24 Aug 2010 11:07:29 +0100, "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Aug 24, 2010 at 03:17:44PM +0530, Nikunj A. Dadhania wrote:
On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte <matthias.bolte@googlemail.com> wrote:
<snip>
Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory.
ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. LXC driver is using virDomainSetMemory to set the memory hard limit while QEmu/ESX use them to change the ballooning. And as you said, ESX does support memoryMinGaurantee, we can get this exported in libvirt using this new API.
Here I am trying to group all the memory related parameters into one single public API as we have in virDomainSetSchedulerParameters. Currently, the names are not conveying what they modify in the below layer and are confusing.
For historical design record, I think it would be good to write a short description of what memory tunables are available for each hypervisor, covering VMWare, OpenVZ, Xen, KVM and LXC (the latter both cgroups based). I do recall that OpenVZ in particular had a huge number of memory tunables.
This is an attempt at covering the memory tunables supported by various hypervisors in libvirt. Let me know if I have missed any memory tunable. Moreover, inputs from the maintainers/key contributes of each HVs on these parameters is appreciable. This would help in getting a complete coverage of the memory tunables that libvirt can support. 1) OpenVZ ========= vmguarpages: Memory allocation guarantee, in pages. kmemsize: Size of unswappable kernel memory(in bytes), allocated for processes in this container. oomguarpages: The guaranteed amount of memory for the case the memory is “over-booked” (out-of-memory kill guarantee), in pages. privvmpages: Memory allocation limit, in pages. OpenVZ driver does not implement any of these functions: domainSetMemory domainSetMaxMemory domainGetMaxMemory Although, the driver has an internal implementation for setting memory: openvzDomainSetMemoryInternal that is read from the domain xml file. 2) VMWare ========= ConfiuredSize: Virtual memory the guest can have. Shares: Priority of the VM, in case there is not enough memory or in case when there is more memory. It has symbolic values like Low, Normal, High and Custom Reservation: Gauranteed lower bound on the amount of the physical memory that the host reserves for the VM even in case of the overcommit. The VM is allowed to allocate till this level and after it has hit the reservation, those pages are not reclaimed. In case, if guest is not using till the reservation, the host can use that portion of memory. Limit: This is the upper bound for the amount of physical memory that the host can allocate for the VM. Memory Balloon ESX driver uses following: * domainSetMaxMemory to set the max virtual memory for the VM. * domainSetMemory to inflate/deflate the balloon. * ESX provides lower bound(Reservation), but is not being exploited currently. 3) Xen ====== maxmem_set: Maximum amount of memory reservation of the domain mem_target_set: Set current memory usage of the domain 4) KVM & LXC ============ memory.limit_in_bytes: Memory hard limit memory.soft_limit_in_bytes: Memory limit held during contention memory.memsw_limit_in_bytes: Memory+Swap hard limit memory.swapiness: Controls the tendency of moving the VM processes to the swap. Value range is 0-100, where 0 means, avoid swapping as long as possible and 100 means aggressively swap processes. Statistics: memory.usage_in_bytes: Current memory usage memory.memsw_usage_in_bytes: Current memory+swap usage memory.max_usage_in_bytes: Maximum memory usage recorded memory.memsw_max_usage_in_bytes: Maximum memory+swap usage Nikunj

On Mon, Aug 30, 2010 at 11:56 AM, Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote:
On Tue, 24 Aug 2010 11:07:29 +0100, "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Aug 24, 2010 at 03:17:44PM +0530, Nikunj A. Dadhania wrote:
On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte <matthias.bolte@googlemail.com> wrote:
<snip>
Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory.
ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. LXC driver is using virDomainSetMemory to set the memory hard limit while QEmu/ESX use them to change the ballooning. And as you said, ESX does support memoryMinGaurantee, we can get this exported in libvirt using this new API.
Here I am trying to group all the memory related parameters into one single public API as we have in virDomainSetSchedulerParameters. Currently, the names are not conveying what they modify in the below layer and are confusing.
For historical design record, I think it would be good to write a short description of what memory tunables are available for each hypervisor, covering VMWare, OpenVZ, Xen, KVM and LXC (the latter both cgroups based). I do recall that OpenVZ in particular had a huge number of memory tunables.
This is an attempt at covering the memory tunables supported by various hypervisors in libvirt. Let me know if I have missed any memory tunable. Moreover, inputs from the maintainers/key contributes of each HVs on these parameters is appreciable. This would help in getting a complete coverage of the memory tunables that libvirt can support.
1) OpenVZ ========= vmguarpages: Memory allocation guarantee, in pages. kmemsize: Size of unswappable kernel memory(in bytes), allocated for processes in this container. oomguarpages: The guaranteed amount of memory for the case the memory is “over-booked” (out-of-memory kill guarantee), in pages. privvmpages: Memory allocation limit, in pages.
OpenVZ driver does not implement any of these functions: domainSetMemory domainSetMaxMemory domainGetMaxMemory
Although, the driver has an internal implementation for setting memory: openvzDomainSetMemoryInternal that is read from the domain xml file.
2) VMWare ========= ConfiuredSize: Virtual memory the guest can have.
Shares: Priority of the VM, in case there is not enough memory or in case when there is more memory. It has symbolic values like Low, Normal, High and Custom
Reservation: Gauranteed lower bound on the amount of the physical memory that the host reserves for the VM even in case of the overcommit. The VM is allowed to allocate till this level and after it has hit the reservation, those pages are not reclaimed. In case, if guest is not using till the reservation, the host can use that portion of memory.
Limit: This is the upper bound for the amount of physical memory that the host can allocate for the VM.
Memory Balloon
ESX driver uses following: * domainSetMaxMemory to set the max virtual memory for the VM. * domainSetMemory to inflate/deflate the balloon. * ESX provides lower bound(Reservation), but is not being exploited currently.
3) Xen ====== maxmem_set: Maximum amount of memory reservation of the domain mem_target_set: Set current memory usage of the domain
4) KVM & LXC ============ memory.limit_in_bytes: Memory hard limit memory.soft_limit_in_bytes: Memory limit held during contention
"held" might not be the right word for soft limit.
memory.memsw_limit_in_bytes: Memory+Swap hard limit memory.swapiness: Controls the tendency of moving the VM processes to the swap. Value range is 0-100, where 0 means, avoid swapping as long as possible and 100 means aggressively swap processes.
Statistics: memory.usage_in_bytes: Current memory usage memory.memsw_usage_in_bytes: Current memory+swap usage memory.max_usage_in_bytes: Maximum memory usage recorded memory.memsw_max_usage_in_bytes: Maximum memory+swap usage
We also have memory.stat, memory.hierarchy - The question is do we care about hierarchical control? We also have controls to decide whether to move memory on moving from one cgroup to another, that might not apply to the LXC/QEMU case. There is also memory.failcnt which I am not sure makes sense to export. Balbir

On Mon, 30 Aug 2010 18:40:34 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
On Mon, Aug 30, 2010 at 11:56 AM, Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote: <snip>
4) KVM & LXC ============ memory.limit_in_bytes: Memory hard limit memory.soft_limit_in_bytes: Memory limit held during contention
"held" might not be the right word for soft limit. How about - Memory limit ensured during contention
memory.memsw_limit_in_bytes: Memory+Swap hard limit memory.swapiness: Controls the tendency of moving the VM processes to the swap. Value range is 0-100, where 0 means, avoid swapping as long as possible and 100 means aggressively swap processes.
Statistics: memory.usage_in_bytes: Current memory usage memory.memsw_usage_in_bytes: Current memory+swap usage memory.max_usage_in_bytes: Maximum memory usage recorded memory.memsw_max_usage_in_bytes: Maximum memory+swap usage
We also have memory.stat, memory.hierarchy - The question is do we care about hierarchical control? We also have controls to decide whether to move memory on moving from one cgroup to another, that might not apply to the LXC/QEMU case. There is also memory.failcnt which I am not sure makes sense to export. memory.hierarchy is being used in libvirt, there is an internal API (virCgroupSetMemoryUseHierarchy) for enabling. I am not sure if this should be exported.
memory.stat will be good one for getting the stats. I will add this to the statistics section. Nikunj

2010/8/24 Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>:
On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte <matthias.bolte@googlemail.com> wrote:
<snip>
Regarding future additions:
CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage
The CPU part of this is already possible via the virDomainSetSchedulerParameters API. But they aren't expressed in the domain XML config, maybe your suggesting to do this? Yes, thats correct for CPU.
IO would need API as well as XML config changes. Does ESX also support Block IO bandwidth control?
ESX 4.1 added StorageIOAllocation for VirtualDisks that allows to set a limit (upper bound) and a shares (priority) value for storage I/O per disk. Matthias

On 08/24/2010 07:47 PM, Nikunj A. Dadhania wrote:
On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte<matthias.bolte@googlemail.com> wrote: <snip>
Regarding future additions:
IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage
IO would need API as well as XML config changes. Does ESX also support Block IO bandwidth control?
Does anyone know if it's possible to control/manage/limit/(etc) the IO in terms other than bandwidth? Just thinking that for hard drive technology, the # of IO's per second can be just as important as the overall bandwidth. If it's not something that's controllable by anything we support though, then it probably doesn't matter. ;)

On Tue, Aug 24, 2010 at 01:05:26PM +0530, Balbir Singh wrote:
* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 11:53:27]:
Subject: [RFC] Memory controller exploitation in libvirt
Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that.
At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area.
Design 1: Provide new API and XML changes for resource management =================================================================
All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface.
1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into
virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats()
Nope, we don't want cgroups exposed in the public API, since this has to be applicable to the VMWare and OpenVZ drivers too.
Parameter list supported:
MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory
Soft limits allows you to set memory limit on contention.
MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started)
SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space
We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The semantics are just too hard to get right at the moment.
That's not a huge problem. Since we have many hypervisors to support in libvirt, I expect the set of tunables will expand over time, and not every hypervisor driver in libvirt will support every tunable. They'll just pick the tunables that apply to them. We can leave SwapSoftLimits out of the public API until we find a HV that needs it
Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel.
I am not an expert here, so just listing what new elements need to be added to the XML schema:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
I'd prefer a syntax that integrates well with what we currently have
<cgroup> <path>...</path> <controller> <name>..</name> <soft limit>...</> <hard limit>...</> </controller> ... </cgroup>
That is exposing far too much info about the cgroups implementation details. The XML representation needs to be decouple from the implementation. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

* Daniel P. Berrange <berrange@redhat.com> [2010-08-24 11:02:44]:
On Tue, Aug 24, 2010 at 01:05:26PM +0530, Balbir Singh wrote:
* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> [2010-08-24 11:53:27]:
Subject: [RFC] Memory controller exploitation in libvirt
Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that.
At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area.
Design 1: Provide new API and XML changes for resource management =================================================================
All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface.
1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into
virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats()
Nope, we don't want cgroups exposed in the public API, since this has to be applicable to the VMWare and OpenVZ drivers too.
I am not talking about exposing these as public API, but be a part of src/util/cgroup.c and utilized by the qemu driver. It is good to abstract out the OS independent parts, but my concern was double exposure through API like driver->setMemory() that is currently used and the newer API.
Parameter list supported:
MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory
Soft limits allows you to set memory limit on contention.
MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started)
SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space
We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The semantics are just too hard to get right at the moment.
That's not a huge problem. Since we have many hypervisors to support in libvirt, I expect the set of tunables will expand over time, and not every hypervisor driver in libvirt will support every tunable. They'll just pick the tunables that apply to them. We can leave SwapSoftLimits out of the public API until we find a HV that needs it
Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel.
I am not an expert here, so just listing what new elements need to be added to the XML schema:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
I'd prefer a syntax that integrates well with what we currently have
<cgroup> <path>...</path> <controller> <name>..</name> <soft limit>...</> <hard limit>...</> </controller> ... </cgroup>
That is exposing far too much info about the cgroups implementation details. The XML representation needs to be decouple from the implementation.
Don't we already expose a lot of information about qemu for example about vhost net's or cmdline's/virtio etc in the qemu configuration of a guest. I am not opposed to having a higher level abstraction but concerned that some of the nitty-gritty details like swappiness (yes that is a tunable) or the interpretation of stats might vary widely across operating systems. Hence, I felt it is better to expose it as a part of the qemu-cgroup-linux driver combo. -- Three Cheers, Balbir

On Tue, Aug 24, 2010 at 11:53:27AM +0530, Nikunj A. Dadhania wrote:
Subject: [RFC] Memory controller exploitation in libvirt
Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that.
At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area.
Exposing the tunables is nice, but there is another related problem. We don't provide apps enough information to effectively use them. eg, they configure a guest with 500 MB of RAM. How much RAM does QEMU actually use. 500 MB + X MB more. We need to give apps an indication of what the 'X' overhead is. Some of it comes from the video RAM. Some is pure QEMU emulation overhead.
Design 1: Provide new API and XML changes for resource management =================================================================
All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface.
1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters
Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams);
Parameter list supported:
MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started)
SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space
Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel.
I am not an expert here, so just listing what new elements need to be added to the XML schema:
<define name="resource"> <element memory> <element memoryHardLimit/> <element memorySoftLimit/> <element memoryMinGaurantee/> <element swapHardLimit/> <element swapSoftLimit/> </element> </define>
Pros: * Support all the tunables exported by the kernel * More tunables can be added as and when required
Cons: * Code changes would touch various levels
Not a problem.
* Might need to redefine(changing the scope) of existing memory API. Currently, domainSetMemory is used to set limit_in_bytes in LXC and memory ballooning in QEmu. While the domainSetMaxMemory is not defined in QEmu and in case of LXC it is setting the internal object's maxmem variable.
Yep, might need to clarify LXC a little bit.
Future:
* Later on, CPU/IO/Network controllers related tunables can be added/enhanced along with the APIs/XML elements:
CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage
We have APIs to cope with CPU tunables, but no persistent XML representation. We have nothing for IO
* libvirt-cim support for resource management
Design 2: Reuse the current memory APIs in libvirt ==================================================
Use memory.limit_in_bytes to tweak memory hard limits Init - Set the memory.limit_in_bytes to maximum mem.
Claiming memory from guest: a) Reduce balloon size b) If the guest does not co-operate(How do we know?), reduce memory.limit_in_bytes.
Allocating memory more than max memory: How to solve this? As we have already set the max balloon size. We can only play within this!
Pros: * Few changes * Is not intrusive
Cons: * SetMemory and SetMaxMemory usage is confusing. * SetMemory is too generic a name, it does not cover all the tunables. * Does not support memory softlimit * Does not have support to reserve the memory swap region * This solution is not extensible
IMO, "Design 1" is more generic and extensible for various memory tuneables.
Agreed, the current approach to memory is not flexible enough. It only really fits into control of the over all memory allocation + balloon level. In things like LXC we've rather twisted the meaning. Design 1 will clear up alot of this mess. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Tue, Aug 24, 2010 at 11:53:27AM +0530, Nikunj A. Dadhania wrote:
Subject: [RFC] Memory controller exploitation in libvirt
Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that.
At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area.
Exposing the tunables is nice, but there is another related problem. We don't provide apps enough information to effectively use them.
eg, they configure a guest with 500 MB of RAM. How much RAM does QEMU actually use. 500 MB + X MB more. We need to give apps an indication of what the 'X' overhead is. Some of it comes from the video RAM. Some is pure QEMU emulation overhead. Can we cover them in memory statistics API, ie, domainGetMemoryParameters with
On Tue, 24 Aug 2010 10:59:52 +0100, "Daniel P. Berrange" <berrange@redhat.com> wrote: parameter type MemoryUsage, MemoryOverhead?
* Might need to redefine(changing the scope) of existing memory API. Currently, domainSetMemory is used to set limit_in_bytes in LXC and memory ballooning in QEmu. While the domainSetMaxMemory is not defined in QEmu and in case of LXC it is setting the internal object's maxmem variable.
Yep, might need to clarify LXC a little bit. Sure, when domainSetMaxMemory is called in case of LXC, sets vm->def->maxmem (vm is of type virDomainObjPtr and def is of type virDomainDefPtr) if the newmaxmem is greater than current memory.
domainSetMemory for LXC: Sets the memory cgroup file memory.limit_in_bytes if newmem being set is less than the maximum permissible VM memory(vm->def->maxmem) Nikunj
participants (5)
-
Balbir Singh
-
Daniel P. Berrange
-
Justin Clift
-
Matthias Bolte
-
Nikunj A. Dadhania