[libvirt] [PATCH] qemu: Drop qemuDomainMemoryLimit

This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index dc949db..9673e8e 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -428,8 +428,7 @@ qemuSetupMemoryCgroup(virDomainObjPtr vm) } } - if (virCgroupSetMemoryHardLimit(priv->cgroup, - qemuDomainMemoryLimit(vm->def)) < 0) + if (virCgroupSetMemoryHardLimit(priv->cgroup, vm->def->mem.hard_limit) < 0) return -1; if (vm->def->mem.soft_limit != 0 && diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index b811e1d..a0a1773 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -9220,7 +9220,7 @@ qemuBuildCommandLine(virConnectPtr conn, } if (mlock) - virCommandSetMaxMemLock(cmd, qemuDomainMemoryLimit(def) * 1024); + virCommandSetMaxMemLock(cmd, def->mem.hard_limit * 1024); virObjectUnref(cfg); return cmd; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 393af6b..7f4d17d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2306,55 +2306,6 @@ cleanup: return ret; } - -unsigned long long -qemuDomainMemoryLimit(virDomainDefPtr def) -{ - unsigned long long mem; - size_t i; - - if (def->mem.hard_limit) { - mem = def->mem.hard_limit; - } else { - /* If there is no hard_limit set, compute a reasonable one to avoid - * system thrashing caused by exploited qemu. A 'reasonable - * limit' has been chosen: - * (1 + k) * (domain memory + total video memory) + (32MB for - * cache per each disk) + F - * where k = 0.5 and F = 400MB. The cache for disks is important as - * kernel cache on the host side counts into the RSS limit. - * Moreover, VFIO requires some amount for IO space. Alex Williamson - * suggested adding 1GiB for IO space just to be safe (some finer - * tuning might be nice, though). - * - * Technically, the disk cache does not have to be included in - * RLIMIT_MEMLOCK but it doesn't hurt as it's just an upper limit and - * it makes this function and its usage simpler. - */ - mem = def->mem.max_balloon; - for (i = 0; i < def->nvideos; i++) - mem += def->videos[i]->vram; - mem *= 1.5; - mem += def->ndisks * 32768; - mem += 409600; - - for (i = 0; i < def->nhostdevs; i++) { - virDomainHostdevDefPtr hostdev = def->hostdevs[i]; - if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && - hostdev->source.subsys.type == - VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - hostdev->source.subsys.u.pci.backend == - VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - mem += 1024 * 1024; - break; - } - } - } - - return mem; -} - - int qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, virDomainObjPtr vm) diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 0a4a51e..21f116c 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -365,8 +365,6 @@ extern virDomainXMLPrivateDataCallbacks virQEMUDriverPrivateDataCallbacks; extern virDomainXMLNamespace virQEMUDriverDomainXMLNamespace; extern virDomainDefParserConfig virQEMUDriverDomainDefParserConfig; -unsigned long long qemuDomainMemoryLimit(virDomainDefPtr def); - int qemuDomainUpdateDeviceList(virQEMUDriverPtr driver, virDomainObjPtr vm); diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index c9748d9..fa64dd7 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1030,7 +1030,7 @@ int qemuDomainAttachHostPciDevice(virQEMUDriverPtr driver, */ vm->def->hostdevs[vm->def->nhostdevs++] = hostdev; virProcessSetMaxMemLock(vm->pid, - qemuDomainMemoryLimit(vm->def) * 1024); + vm->def->mem.hard_limit * 1024); vm->def->hostdevs[vm->def->nhostdevs--] = NULL; } -- 1.8.1.5

On 08/09/2013 06:56 AM, Michal Privoznik wrote:
This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-)
ACK. Users that put an explicit limit in their XML are taking on their own risk at guessing correctly; all other users should not be forced to suffer from a bad guess on our part killing their domain. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote:
On 08/09/2013 06:56 AM, Michal Privoznik wrote:
This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-)
ACK. Users that put an explicit limit in their XML are taking on their own risk at guessing correctly; all other users should not be forced to suffer from a bad guess on our part killing their domain.
If we don't understand how to calculate a default limit that works, how are users with even less knowledge than us, suppose to calculate an explicit level of their own ? This limit was designed so that the hosts are not vulnerable to DOS attack from a compromised QEMU, so removing this is arguably introducing a security weakness in our default deployment. I think I'd like to see some feedback / agreement from QEMU developers that this problem really can't be solved, before we remove it. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

[CC'ing qemu-devel list] On 09.08.2013 15:17, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote:
On 08/09/2013 06:56 AM, Michal Privoznik wrote:
This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-)
ACK. Users that put an explicit limit in their XML are taking on their own risk at guessing correctly; all other users should not be forced to suffer from a bad guess on our part killing their domain.
If we don't understand how to calculate a default limit that works, how are users with even less knowledge than us, suppose to calculate an explicit level of their own ?
This limit was designed so that the hosts are not vulnerable to DOS attack from a compromised QEMU, so removing this is arguably introducing a security weakness in our default deployment.
I think I'd like to see some feedback / agreement from QEMU developers that this problem really can't be solved, before we remove it.
Daniel
In libvirt I've introduced a heuristic to guess the maximum limit for a memory for a given VM definition. The rationale was "it's better to be safe by default" and not let leaking qemu trash the host. The heuristic is only used if user has not configured any limit himself. However, over the time the number of users reporting OOM kills due to my heuristic has grown. Finally, I've full nose of this problem so I've made a patch [1] that removes this 'functionality' completely (I'd say it's bug after all). In the patch you can see the heuristic we've converged to. But Dan has his point. If libvirt & qemu devels aren't able to come up with proper heuristic, how can an ordinary user (who doesn't have any knowledge of code) do so? So before I apply my patch, I want to ask you guys, what do you think about it. Michal 1: https://www.redhat.com/archives/libvir-list/2013-August/msg00437.html

Michal Privoznik <mprivozn@redhat.com> writes:
[CC'ing qemu-devel list] On 09.08.2013 15:17, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote:
On 08/09/2013 06:56 AM, Michal Privoznik wrote:
This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-)
ACK. Users that put an explicit limit in their XML are taking on their own risk at guessing correctly; all other users should not be forced to suffer from a bad guess on our part killing their domain.
If we don't understand how to calculate a default limit that works, how are users with even less knowledge than us, suppose to calculate an explicit level of their own ?
This limit was designed so that the hosts are not vulnerable to DOS attack from a compromised QEMU, so removing this is arguably introducing a security weakness in our default deployment.
I think I'd like to see some feedback / agreement from QEMU developers that this problem really can't be solved, before we remove it.
Daniel
In libvirt I've introduced a heuristic to guess the maximum limit for a memory for a given VM definition. The rationale was "it's better to be safe by default" and not let leaking qemu trash the host. The heuristic is only used if user has not configured any limit himself. However, over the time the number of users reporting OOM kills due to my heuristic has grown. Finally, I've full nose of this problem so I've made a patch [1] that removes this 'functionality' completely (I'd say it's bug after all). In the patch you can see the heuristic we've converged to. But Dan has his point. If libvirt & qemu devels aren't able to come up with proper heuristic, how can an ordinary user (who doesn't have any knowledge of code) do so? So before I apply my patch, I want to ask you guys, what do you think about it.
Even if we had an algorithm for calculating memory overhead (we don't), glibc will still introduce uncertainty since malloc(size) doesn't translate to allocating size bytes from the kernel. When you throw in fragmentation too it becomes extremely hard to predict. The only practical way of doing this would be to have QEMU gracefully handle malloc() == NULL so that you could set a limit and gracefully degrade. We don't though so setting a limit is likely to get you in trouble. Regards, Anthony Liguori
Michal
1: https://www.redhat.com/archives/libvir-list/2013-August/msg00437.html

On Fri, Aug 09, 2013 at 10:58:55AM -0500, Anthony Liguori wrote:
Michal Privoznik <mprivozn@redhat.com> writes:
[CC'ing qemu-devel list] On 09.08.2013 15:17, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote:
On 08/09/2013 06:56 AM, Michal Privoznik wrote:
This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-)
ACK. Users that put an explicit limit in their XML are taking on their own risk at guessing correctly; all other users should not be forced to suffer from a bad guess on our part killing their domain.
If we don't understand how to calculate a default limit that works, how are users with even less knowledge than us, suppose to calculate an explicit level of their own ?
This limit was designed so that the hosts are not vulnerable to DOS attack from a compromised QEMU, so removing this is arguably introducing a security weakness in our default deployment.
I think I'd like to see some feedback / agreement from QEMU developers that this problem really can't be solved, before we remove it.
Daniel
In libvirt I've introduced a heuristic to guess the maximum limit for a memory for a given VM definition. The rationale was "it's better to be safe by default" and not let leaking qemu trash the host. The heuristic is only used if user has not configured any limit himself. However, over the time the number of users reporting OOM kills due to my heuristic has grown. Finally, I've full nose of this problem so I've made a patch [1] that removes this 'functionality' completely (I'd say it's bug after all). In the patch you can see the heuristic we've converged to. But Dan has his point. If libvirt & qemu devels aren't able to come up with proper heuristic, how can an ordinary user (who doesn't have any knowledge of code) do so? So before I apply my patch, I want to ask you guys, what do you think about it.
Even if we had an algorithm for calculating memory overhead (we don't), glibc will still introduce uncertainty since malloc(size) doesn't translate to allocating size bytes from the kernel. When you throw in fragmentation too it becomes extremely hard to predict.
The only practical way of doing this would be to have QEMU gracefully handle malloc() == NULL so that you could set a limit and gracefully degrade. We don't though so setting a limit is likely to get you in trouble.
So you're saying there's no way we can define a reasonable limit on a QEMU guest to prevent a compomised QEMU exhausting all host memory ? It rather sucks if that's the position we're in :-( Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 09.08.2013 18:29, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 10:58:55AM -0500, Anthony Liguori wrote:
Michal Privoznik <mprivozn@redhat.com> writes:
[CC'ing qemu-devel list] On 09.08.2013 15:17, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote:
On 08/09/2013 06:56 AM, Michal Privoznik wrote:
So what's the conclusion? Should I push the patch until qemu gets stabilized in mem consumption (which can take a while - no offense to qemu devels, I can imagine it's nontrivial). Michal

On Mon, Aug 19, 2013 at 10:29:10AM +0200, Michal Privoznik wrote:
On 09.08.2013 18:29, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 10:58:55AM -0500, Anthony Liguori wrote:
Michal Privoznik <mprivozn@redhat.com> writes:
[CC'ing qemu-devel list] On 09.08.2013 15:17, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote:
On 08/09/2013 06:56 AM, Michal Privoznik wrote:
So what's the conclusion? Should I push the patch until qemu gets stabilized in mem consumption (which can take a while - no offense to qemu devels, I can imagine it's nontrivial).
Given the lack of any useful info to make memory limits work reliably, I guess we don't have any choice but to remove this default memory limit, and also recommend against people setting explicit memory limits too. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 19.08.2013 11:06, Daniel P. Berrange wrote:
On Mon, Aug 19, 2013 at 10:29:10AM +0200, Michal Privoznik wrote:
On 09.08.2013 18:29, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 10:58:55AM -0500, Anthony Liguori wrote:
Michal Privoznik <mprivozn@redhat.com> writes:
[CC'ing qemu-devel list] On 09.08.2013 15:17, Daniel P. Berrange wrote:
On Fri, Aug 09, 2013 at 07:13:58AM -0600, Eric Blake wrote: > On 08/09/2013 06:56 AM, Michal Privoznik wrote:
So what's the conclusion? Should I push the patch until qemu gets stabilized in mem consumption (which can take a while - no offense to qemu devels, I can imagine it's nontrivial).
Given the lack of any useful info to make memory limits work reliably, I guess we don't have any choice but to remove this default memory limit, and also recommend against people setting explicit memory limits too.
Regards, Daniel
Okay, I've pushed the patch. And I've just proposed patch to discourage users setting this limit: https://www.redhat.com/archives/libvir-list/2013-August/msg00820.html Michal

Am 09.08.2013 17:58, schrieb Anthony Liguori:
Even if we had an algorithm for calculating memory overhead (we don't), glibc will still introduce uncertainty since malloc(size) doesn't translate to allocating size bytes from the kernel. When you throw in fragmentation too it becomes extremely hard to predict.
The only practical way of doing this would be to have QEMU gracefully handle malloc() == NULL so that you could set a limit and gracefully degrade. We don't though so setting a limit is likely to get you in trouble.
FWIW my QOM realize work is targetted at reducing likelihood that device_add blows up QEMU due to OOM in object_new(). But before I can change qdev-monitor.c I still need to tweak core QOM to either get at TypeImpl::instance_size or to introduce an object_try_new() function using g_try_malloc0() rather than g_malloc0(). That's where proper child struct composition comes into play. The major variance in runtime memory consumption was so far attributed to block and network I/O, without ever getting exact proof points... Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

On 08/09/2013 02:56 PM, Michal Privoznik wrote:
This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again. --- src/qemu/qemu_cgroup.c | 3 +-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 49 ------------------------------------------------- src/qemu/qemu_domain.h | 2 -- src/qemu/qemu_hotplug.c | 2 +- 5 files changed, 3 insertions(+), 55 deletions(-)
I can't start any domains that don't have the hard limit set with this patch. It seems we should only set the limit if its non-zero. Jan
participants (6)
-
Andreas Färber
-
Anthony Liguori
-
Daniel P. Berrange
-
Eric Blake
-
Ján Tomko
-
Michal Privoznik