[libvirt] [BUG] mlock support breakage

[libvirt] [PATCH v3 00/15]...

[libvirt] [PATCH 0/3] Debug:...

Luiz Capitulino

Monday, 13 March 2017 Mon, 13 Mar '17

11:58 a.m.

Libvirt commit c2e60ad0e51 added a new check to the XML validation logic where XMLs containing <memoryBacking><mlocked/> must also contain <memtune><hard_limit>. This causes two breakages where working guests won't start anymore: 1. Systems where mlock limit was set in /etc/security/limits.conf 2. Guests using hugeTLB pages. In this case, guests were relying on the fact that libvirt automagically increases mlock limit to 1GB While it's true that <memoryBacking><mlocked/> documentation says that <memtune><hard_limit> is required, this is actually an extremely bad request because: A. <memtune><hard_limit> own documention strongly recommends NOT using it B. <memtune><hard_limit> does more than setting memory locking limit C. <memtune><hard_limit> does not support infinity, so you have to guess a limit D. If <memtune><hard_limit> is less than 1GB, it will lower VFIO's automatic limit of "guest memory + 1GB" Here's two possible solutions to fix this all: 1. Drop change c2e60ad0e51 and drop automatic increases. Let users configure limits in /etc/security/limits.conf pros: this is the most correct way to do it, and how it should be done originally IMHO cons: will break working VFIO setups, so probably undoable 2. Drop change c2e60ad0e51 and automtically increase memory locking limit to infinity when seeing <memoryBacking><locked/> pros: make all cases work, no more <hard_limit> requirement cons: allows guests with <locked/> to lock all memory assigned to them plus QEMU allocations. While this seems undesirable or even a security issue, using <hard_limit> will have the same effect Lastly, <locked/> doesn't belong to <memoryBacking>, it should be in <memtune>. I recommend deprecating it from <memoryBacking> and adding it where it belongs.

Show replies by date

Daniel P. Berrange

Monday, 13 March Mon, 13 Mar

12:08 p.m.

On Mon, Mar 13, 2017 at 11:58:24AM -0400, Luiz Capitulino wrote:

...

I'm surprised if that has any effect, unless you were setting it against the root user. The limits.conf file is loaded by PAM, and when libvirtd spawns QEMU, PAM is not involved, so limits.conf will never be activated. This is why libvirt provides max_processes/max_files/max_core settings in /etc/libvirt/qemu.conf - you can't set those in limits.conf and have them work - unless you set them against root, so libvirtd itself got the higher limits which are then inherited by QEMU.

...

2. Guests using hugeTLB pages. In this case, guests were relying on the fact that libvirt automagically increases mlock limit to 1GB

Yep, that's bad - we mustn't break previously working scenarios like this, even if there were not following documented practice.

...

While it's true that <memoryBacking><mlocked/> documentation says that <memtune><hard_limit> is required, this is actually an extremely bad request because: A. <memtune><hard_limit> own documention strongly recommends NOT using it

Yep, hard limit is impossible to calculate reliably since no one has been able to provide an accurate way to predict QEMU's peak memory usage. When libvirt previously set hard_limit by default, we got many bug reports about guest's killed by the OOM killer, no matter what algorithm we tried.

...

B. <memtune><hard_limit> does more than setting memory locking limit C. <memtune><hard_limit> does not support infinity, so you have to guess a limit D. If <memtune><hard_limit> is less than 1GB, it will lower VFIO's automatic limit of "guest memory + 1GB" Here's two possible solutions to fix this all: 1. Drop change c2e60ad0e51 and drop automatic increases. Let users configure limits in /etc/security/limits.conf pros: this is the most correct way to do it, and how it should be done originally IMHO cons: will break working VFIO setups, so probably undoable

limits.conf is useless - see above.

...

2. Drop change c2e60ad0e51 and automtically increase memory locking limit to infinity when seeing <memoryBacking><locked/> pros: make all cases work, no more <hard_limit> requirement cons: allows guests with <locked/> to lock all memory assigned to them plus QEMU allocations. While this seems undesirable or even a security issue, using <hard_limit> will have the same effect

I think this is the only viable approach, given that no one can provide a way to reliably calculate QEMU peak memory usage. Unless we want to take guest RAM + $LARGE NUMBER - eg just blindly assume that 2 GB is enough for QEMU working set, so for an 8 GB guest, just set 10 GB as the limit.

...

Lastly, <locked/> doesn't belong to <memoryBacking>, it should be in <memtune>. I recommend deprecating it from <memoryBacking> and adding it where it belongs.

We never make these kind of changes in libvirt XML. It is sub-optimal location, but it has no functional problem, so there's no functional benefit to moving it and clear backcompat downsides. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Luiz Capitulino

12:35 p.m.

On Mon, 13 Mar 2017 16:08:58 +0000 "Daniel P. Berrange" <berrange(a)redhat.com> wrote:

...

On Mon, Mar 13, 2017 at 11:58:24AM -0400, Luiz Capitulino wrote: > > Libvirt commit c2e60ad0e51 added a new check to the XML validation > logic where XMLs containing <memoryBacking><mlocked/> must also > contain <memtune><hard_limit>. This causes two breakages where > working guests won't start anymore: > > 1. Systems where mlock limit was set in /etc/security/limits.conf I'm surprised if that has any effect, unless you were setting it against the root user. The limits.conf file is loaded by PAM, and when libvirtd spawns QEMU, PAM is not involved, so limits.conf will never be activated. This is why libvirt provides max_processes/max_files/max_core settings in /etc/libvirt/qemu.conf - you can't set those in limits.conf and have them work - unless you set them against root, so libvirtd itself got the higher limits which are then inherited by QEMU.

My quick test seemed to work, but I may have declared success before the guest had time to crash. I'll double check this, but we agree about the best way to fix it.

...

> 2. Drop change c2e60ad0e51 and automtically increase memory > locking limit to infinity when seeing <memoryBacking><locked/> > > pros: make all cases work, no more <hard_limit> requirement > > cons: allows guests with <locked/> to lock all memory > assigned to them plus QEMU allocations. While this seems > undesirable or even a security issue, using <hard_limit> > will have the same effect I think this is the only viable approach, given that no one can provide a way to reliably calculate QEMU peak memory usage. Unless we want to take guest RAM + $LARGE NUMBER - eg just blindly assume that 2 GB is enough for QEMU working set, so for an 8 GB guest, just set 10 GB as the limit.

Better to set it to infinity and be done with it.

...

> Lastly, <locked/> doesn't belong to <memoryBacking>, it should > be in <memtune>. I recommend deprecating it from <memoryBacking> > and adding it where it belongs. We never make these kind of changes in libvirt XML. It is sub-optimal location, but it has no functional problem, so there's no functional benefit to moving it and clear backcompat downsides.

Fine with me.

Daniel P. Berrange

12:43 p.m.

On Mon, Mar 13, 2017 at 12:35:42PM -0400, Luiz Capitulino wrote:

...

On Mon, 13 Mar 2017 16:08:58 +0000 "Daniel P. Berrange" <berrange(a)redhat.com> wrote: > > 2. Drop change c2e60ad0e51 and automtically increase memory > > locking limit to infinity when seeing <memoryBacking><locked/> > > > > pros: make all cases work, no more <hard_limit> requirement > > > > cons: allows guests with <locked/> to lock all memory > > assigned to them plus QEMU allocations. While this seems > > undesirable or even a security issue, using <hard_limit> > > will have the same effect > > I think this is the only viable approach, given that no one can > provide a way to reliably calculate QEMU peak memory usage. > Unless we want to take guest RAM + $LARGE NUMBER - eg just blindly > assume that 2 GB is enough for QEMU working set, so for an 8 GB > guest, just set 10 GB as the limit. Better to set it to infinity and be done with it.

Not neccessarily, no. If we set $RAM + $LARGE_NUMBER, we are still likely to be well below the overall physical RAM of a host. IOW, a single compromised QEMU would still be restricted in how much it can pin. If we set it to infinity then a compromised QEMU can lock all of physical RAM doing a very effective DOS on the host, as it can't even swap the guest out to recover. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Luiz Capitulino

1:53 p.m.

On Mon, 13 Mar 2017 16:43:46 +0000 "Daniel P. Berrange" <berrange(a)redhat.com> wrote:

...

On Mon, Mar 13, 2017 at 12:35:42PM -0400, Luiz Capitulino wrote: > On Mon, 13 Mar 2017 16:08:58 +0000 > "Daniel P. Berrange" <berrange(a)redhat.com> wrote: > > > > 2. Drop change c2e60ad0e51 and automtically increase memory > > > locking limit to infinity when seeing <memoryBacking><locked/> > > > > > > pros: make all cases work, no more <hard_limit> requirement > > > > > > cons: allows guests with <locked/> to lock all memory > > > assigned to them plus QEMU allocations. While this seems > > > undesirable or even a security issue, using <hard_limit> > > > will have the same effect > > > > I think this is the only viable approach, given that no one can > > provide a way to reliably calculate QEMU peak memory usage. > > Unless we want to take guest RAM + $LARGE NUMBER - eg just blindly > > assume that 2 GB is enough for QEMU working set, so for an 8 GB > > guest, just set 10 GB as the limit.

I forgot to say that I'm fine with this solution, provided that we drop the requirement on using <hard_limit> with <locked/> and revert commit c2e60ad0e51.

...

> Better to set it to infinity and be done with it. Not neccessarily, no. If we set $RAM + $LARGE_NUMBER, we are still likely to be well below the overall physical RAM of a host. IOW, a single compromised QEMU would still be restricted in how much it can pin. If we set it to infinity then a compromised QEMU can lock all of physical RAM doing a very effective DOS on the host, as it can't even swap the guest out to recover.

OK, you're right. I personally don't like we're putting a random cap on QEMU memory allocations, but if it's large enough it shouldn't be a problem (I hope).

Luiz Capitulino

2:08 p.m.

On Mon, 13 Mar 2017 13:53:33 -0400 Luiz Capitulino <lcapitulino(a)redhat.com> wrote:

...

OK, you're right. I personally don't like we're putting a random cap on QEMU memory allocations, but if it's large enough it shouldn't be a problem (I hope).

The I hope part meaning, if we do find legitimate reasons for QEMU's address space to go beyond $LARGE_NUMBER, it will be means of guests randomly crashing when using <locked/>.

Daniel P. Berrange

2:16 p.m.

On Mon, Mar 13, 2017 at 02:08:30PM -0400, Luiz Capitulino wrote:

...

On Mon, 13 Mar 2017 13:53:33 -0400 Luiz Capitulino <lcapitulino(a)redhat.com> wrote: > OK, you're right. I personally don't like we're putting a random cap > on QEMU memory allocations, but if it's large enough it shouldn't be > a problem (I hope). The I hope part meaning, if we do find legitimate reasons for QEMU's address space to go beyond $LARGE_NUMBER, it will be means of guests randomly crashing when using <locked/>.

NB if we did enforce $RAM + $LARGE_NUMBER, then I'd suggest we did set a default hard_limit universally once more, not only set a mlock limit when using <locked/>. This would at least ensure we see consistent (bad) behaviour rather than have edge cases that only appeared when <locked/> was present. We would need $LARGE_NUMBER to be much more conservative than what we used in the past though, to avoid hitting the same problems once again. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Luiz Capitulino

2:24 p.m.

On Mon, 13 Mar 2017 18:16:49 +0000 "Daniel P. Berrange" <berrange(a)redhat.com> wrote:

...

On Mon, Mar 13, 2017 at 02:08:30PM -0400, Luiz Capitulino wrote: > On Mon, 13 Mar 2017 13:53:33 -0400 > Luiz Capitulino <lcapitulino(a)redhat.com> wrote: > > > OK, you're right. I personally don't like we're putting a random cap > > on QEMU memory allocations, but if it's large enough it shouldn't be > > a problem (I hope). > > The I hope part meaning, if we do find legitimate reasons for QEMU's > address space to go beyond $LARGE_NUMBER, it will be means of guests > randomly crashing when using <locked/>. NB if we did enforce $RAM + $LARGE_NUMBER, then I'd suggest we did set a default hard_limit universally once more, not only set a mlock limit when using <locked/>. This would at least ensure we see consistent (bad) behaviour rather than have edge cases that only appeared when <locked/> was present.

Makes me even more nervous, but I agree with your reasoning. Btw, do we have a volunteer to do this work? Andrea?

...

We would need $LARGE_NUMBER to be much more conservative than what we used in the past though, to avoid hitting the same problems once again. Regards, Daniel

Andrea Bolognani

Tuesday, 14 March Tue, 14 Mar

8:50 a.m.

On Mon, 2017-03-13 at 14:24 -0400, Luiz Capitulino wrote:

...

> NB if we did enforce $RAM + $LARGE_NUMBER, then I'd suggest we did > set a default hard_limit universally once more, not only set a mlock > limit when using <locked/>. This would at least ensure we see consistent > (bad) behaviour rather than have edge cases that only appeared when > <locked/> was present.

Setting <hard_limit> doesn't limit just the amount of memory the QEMU process is allowed to lock, but also the amount of memory it's allowed to allocate at all (through cgroups). Is it fair to assume that QEMU will never *allocate* more than 2 GiB (or whatever $LARGE_NUMBER we end up picking) in addition to what's needed to fit the guest memory? I would certainly hope so, but maybe there are use cases that require more than that. Were you thinking about adding <hard_limit> automatically to all guests where it's not already present, or just setting the limits silently? The latter sounds like it would be very opaque to users, so I'm guessing the former. If we added <hard_limit> to the XML automatically, we would have the problem of keeping it updated with changes to the amount of guest memory... Or maybe we could expect users to remove the <hard_limit> element, to get it regenerated, every time they change the amount of guest memory? One nice side-effect of doing this unconditionally is that we could get rid of some of the special-casing we are doing for VFIO-using guest, especially saving and restoring the memory locking limit when host devices are assigned and unassigned.

...

Makes me even more nervous, but I agree with your reasoning. Btw, do we have a volunteer to do this work? Andrea?

Since I've already spent a significant amount of time researching the issue, and wrote the patch that caused it to rear its ugly head in the first place, I guess it makes sense for me to be volunteered to fix it ;) -- Andrea Bolognani / Red Hat / Virtualization

Peter Krempa

10:55 a.m.

On Mon, Mar 13, 2017 at 18:16:49 +0000, Daniel Berrange wrote:

...

We were already attempting to do this and we reverted it since there's no deterministic $LARGE_NUMBER which would work for all cases. I can imagine this working if $LARGE_NUMBER is infinity or host memory+swap size. commit 16bcb3b61675a88bff00317336b9610080c31000 Author: Michal Privoznik <mprivozn(a)redhat.com> Date: Fri Aug 9 14:46:54 2013 +0200 qemu: Drop qemuDomainMemoryLimit This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again.

Andrea Bolognani

2:40 p.m.

On Tue, 2017-03-14 at 15:55 +0100, Peter Krempa wrote:

...

> NB if we did enforce $RAM + $LARGE_NUMBER, then I'd suggest we did > set a default hard_limit universally once more, not only set a mlock We were already attempting to do this and we reverted it since there's no deterministic $LARGE_NUMBER which would work for all cases. I can imagine this working if $LARGE_NUMBER is infinity or host memory+swap size. commit 16bcb3b61675a88bff00317336b9610080c31000 Author: Michal Privoznik <mprivozn(a)redhat.com> Date: Fri Aug 9 14:46:54 2013 +0200 qemu: Drop qemuDomainMemoryLimit This function is to guess the correct limit for maximal memory usage by qemu for given domain. This can never be guessed correctly, not to mention all the pains and sleepless nights this code has caused. Once somebody discovers algorithm to solve the Halting Problem, we can compute the limit algorithmically. But till then, this code should never see the light of the release again.

Right. Even the current accidental limit, guest memory + 1 GiB, doesn't give any real guarantee about the guests working in the future: a simple QEMU upgrade could be enough to push memory usage past the current allowance and cause them to crash. Peter mentioned that just adding (a lot) more disks could actually be enough. Even for the host admin, setting the memory limits appropriately will always be a balancing act between making sure the host can swap out guests when pressured for memory and making sure the guests themselves can allocate the memory they need. Different use cases will call for different balances, and since there is no one-size-fits-all solution we shouldn't pretend that this complexity doesn't exist and hide it from the administrator. It's unfortunate that the current, buggy behavior made it look like you didn't necessarily have to worry about this. If we fix it, existing guests will fail to start right away instead of possibly crashing in the future: while that's going to be very annoying in the short run, it's arguably better than illuding people their guests will be good in the long run while in reality we can't provide such guarantee. Luiz mentioned the fact that you can't set the memory locking limit to "unlimited" with the current <hard_limit> element: that's something we can, and should, address. With that implemented, the administrator will have full control on the memory limit and will be able to implement the policy that best suits the use case at hand. -- Andrea Bolognani / Red Hat / Virtualization

Luiz Capitulino

2:54 p.m.

On Tue, 14 Mar 2017 19:40:38 +0100 Andrea Bolognani <abologna(a)redhat.com> wrote:

...

It's unfortunate that the current, buggy behavior made it look like you didn't necessarily have to worry about this. If we fix it, existing guests will fail to start right away instead of possibly crashing in the future: while that's going to be very annoying in the short run,

It breaks existing guests, so it's beyond annoying.

...

it's arguably better than illuding people their guests will be good in the long run while in reality we can't provide such guarantee. Luiz mentioned the fact that you can't set the memory locking limit to "unlimited" with the current <hard_limit> element: that's something we can, and should, address. With that implemented, the administrator will have full control on the memory limit and will be able to implement the policy that best suits the use case at hand.

Asking <locked/> users to set <hard_limit> to "unlimited" is a much worse solution than automatically setting the memory lock limit to infinity in libvirt, for the reasons I outlined in my first email. PS: Still, we should have "unlimited" support for <hard_limit>

Andrea Bolognani

3:28 p.m.

On Tue, 2017-03-14 at 14:54 -0400, Luiz Capitulino wrote:

...

> It's unfortunate that the current, buggy behavior made > it look like you didn't necessarily have to worry about > this. If we fix it, existing guests will fail to start > right away instead of possibly crashing in the future: > while that's going to be very annoying in the short run, It breaks existing guests, so it's beyond annoying.

Existing guests are already broken, it's just that the breakage has not hit them yet ;)

...

> Luiz mentioned the fact that you can't set the memory > locking limit to "unlimited" with the current <hard_limit> > element: that's something we can, and should, address. > With that implemented, the administrator will have full > control on the memory limit and will be able to implement > the policy that best suits the use case at hand. Asking <locked/> users to set <hard_limit> to "unlimited" is a much worse solution than automatically setting the memory lock limit to infinity in libvirt, for the reasons I outlined in my first email.

Removing all memory locking limits should be something that admins very carefully opt-in into, because of the potential host DoS consequences. Certainly not the default. That's the same with /etc/security/limits.conf, where the default memory locking limit is extremely low (64 KiB) and the admin can decide to raise it or even remove it entirely if needed.

...

PS: Still, we should have "unlimited" support for <hard_limit>

Definitely. -- Andrea Bolognani / Red Hat / Virtualization

Luiz Capitulino

3:54 p.m.

On Tue, 14 Mar 2017 20:28:12 +0100 Andrea Bolognani <abologna(a)redhat.com> wrote:

...

On Tue, 2017-03-14 at 14:54 -0400, Luiz Capitulino wrote: > > It's unfortunate that the current, buggy behavior made > > it look like you didn't necessarily have to worry about > > this. If we fix it, existing guests will fail to start > > right away instead of possibly crashing in the future: > > while that's going to be very annoying in the short run, > > It breaks existing guests, so it's beyond annoying. Existing guests are already broken, it's just that the breakage has not hit them yet ;)

We should prevent that from happening.

...

> > Luiz mentioned the fact that you can't set the memory > > locking limit to "unlimited" with the current <hard_limit> > > element: that's something we can, and should, address. > > With that implemented, the administrator will have full > > control on the memory limit and will be able to implement > > the policy that best suits the use case at hand. > > Asking <locked/> users to set <hard_limit> to "unlimited" > is a much worse solution than automatically setting the > memory lock limit to infinity in libvirt, for the reasons > I outlined in my first email. Removing all memory locking limits should be something that admins very carefully opt-in into, because of the potential host DoS consequences. Certainly not the default.

There's no opt-in with <locked/>, it is mandatory to increase the mlock limit. Asking users to do this themselves is only adding an extra step that's causing breakage right now.

...

That's the same with /etc/security/limits.conf, where the default memory locking limit is extremely low (64 KiB) and the admin can decide to raise it or even remove it entirely if needed.

But that's a bad example, we have to help our users not contribute to make their life miserable. Users want to use <locked/> without having to guess limits that we can't figure out ourselves.

Jiri Denemark

Wednesday, 15 March Wed, 15 Mar

3:59 a.m.

On Tue, Mar 14, 2017 at 15:54:25 -0400, Luiz Capitulino wrote:

...

On Tue, 14 Mar 2017 20:28:12 +0100 Andrea Bolognani <abologna(a)redhat.com> wrote: > On Tue, 2017-03-14 at 14:54 -0400, Luiz Capitulino wrote: > > > It's unfortunate that the current, buggy behavior made > > > it look like you didn't necessarily have to worry about > > > this. If we fix it, existing guests will fail to start > > > right away instead of possibly crashing in the future: > > > while that's going to be very annoying in the short run, > > > > It breaks existing guests, so it's beyond annoying.

Because they were relying on a bug in our code rather than following the documentation.

...

> Existing guests are already broken, it's just that the > breakage has not hit them yet ;) We should prevent that from happening.

The only way we could achieve this is to set the memory locking limit to "unlimited" by default, computing the limit is impossible unless the result will be bigger than host's memory which is effectively equivalent to "unlimited".

...

> > > Luiz mentioned the fact that you can't set the memory > > > locking limit to "unlimited" with the current <hard_limit> > > > element: that's something we can, and should, address. > > > With that implemented, the administrator will have full > > > control on the memory limit and will be able to implement > > > the policy that best suits the use case at hand. > > > > Asking <locked/> users to set <hard_limit> to "unlimited" > > is a much worse solution than automatically setting the > > memory lock limit to infinity in libvirt, for the reasons > > I outlined in my first email.

Memory locking limit and memory hard limit are separate things even though they can be set by a single value in domain XML. Hard limit is by default unlimited and a domain will just be killed if it consumes too much memory. However setting memory locking limit to unlimited may cause an OOM condition inside the host kernel, which is why it is not and shouldn't be the default value. However, ...

...

> Removing all memory locking limits should be something that > admins very carefully opt-in into, because of the potential > host DoS consequences. Certainly not the default. There's no opt-in with <locked/>, it is mandatory to increase the mlock limit. Asking users to do this themselves is only adding an extra step that's causing breakage right now.

... we could consider <locked/> to be the explicit request for setting an infinite memory locking limit and letting users set a lower limit with hard_limit if they want. There are several other cases in which memory is locked implicitly and we should definitely not set the unlimited default for them. Jirka

Andrea Bolognani

5:11 a.m.

On Wed, 2017-03-15 at 08:59 +0100, Jiri Denemark wrote:

...

> > Removing all memory locking limits should be something that > > admins very carefully opt-in into, because of the potential > > host DoS consequences. Certainly not the default. > > There's no opt-in with <locked/>, it is mandatory to increase > the mlock limit. Asking users to do this themselves is only > adding an extra step that's causing breakage right now. ... we could consider <locked/> to be the explicit request for setting an infinite memory locking limit and letting users set a lower limit with hard_limit if they want. There are several other cases in which memory is locked implicitly and we should definitely not set the unlimited default for them.

That would still be suboptimal because the risk involved in allowing QEMU to allocate unlimited amounts of locked memory might not be immediately apparent to the user, but at least we would have an explicit opt-in (the presence of the <locked/> element) and we would not lose the ability to set the limit explicitly to a lower value, so it sounds like a decent enough balance. Anyone opposed to implementing it this way? PS: Luiz, please check your MUA. It's messing with the recipients in a way that makes me surprised messages managed to get to the list at all. -- Andrea Bolognani / Red Hat / Virtualization

Luiz Capitulino

Wednesday, 22 March Wed, 22 Mar

8:54 a.m.

On Wed, 15 Mar 2017 10:11:50 +0100 Andrea Bolognani <abologna(a)redhat.com> wrote:

...

On Wed, 2017-03-15 at 08:59 +0100, Jiri Denemark wrote: > > > Removing all memory locking limits should be something that > > > admins very carefully opt-in into, because of the potential > > > host DoS consequences. Certainly not the default. > > > > There's no opt-in with <locked/>, it is mandatory to increase > > the mlock limit. Asking users to do this themselves is only > > adding an extra step that's causing breakage right now. > > ... we could consider <locked/> to be the explicit request for > setting an infinite memory locking limit and letting users set a lower > limit with hard_limit if they want. > > There are several other cases in which memory is locked implicitly and > we should definitely not set the unlimited default for them. That would still be suboptimal because the risk involved in allowing QEMU to allocate unlimited amounts of locked memory might not be immediately apparent to the user, but at least we would have an explicit opt-in (the presence of the <locked/> element) and we would not lose the ability to set the limit explicitly to a lower value, so it sounds like a decent enough balance. Anyone opposed to implementing it this way?

Ping?

Luiz Capitulino

Wednesday, 15 March Wed, 15 Mar

2:29 p.m.

On Wed, 15 Mar 2017 08:59:34 +0100 Jiri Denemark <jdenemar(a)redhat.com> wrote:

...

On Tue, Mar 14, 2017 at 15:54:25 -0400, Luiz Capitulino wrote: > On Tue, 14 Mar 2017 20:28:12 +0100 > Andrea Bolognani <abologna(a)redhat.com> wrote: > > > On Tue, 2017-03-14 at 14:54 -0400, Luiz Capitulino wrote: > > > > It's unfortunate that the current, buggy behavior made > > > > it look like you didn't necessarily have to worry about > > > > this. If we fix it, existing guests will fail to start > > > > right away instead of possibly crashing in the future: > > > > while that's going to be very annoying in the short run, > > > > > > It breaks existing guests, so it's beyond annoying. Because they were relying on a bug in our code rather than following the documentation.

The documentation was strictly followed. The documentation strongly advises against using <hard_limit>. We did that, everything worked as expected. I don't think it's a bug that things just worked. That's how it's supposed to be and that's how it's for VFIO support. The bug is that it's not working now.

...

> > Existing guests are already broken, it's just that the > > breakage has not hit them yet ;) > > We should prevent that from happening. The only way we could achieve this is to set the memory locking limit to "unlimited" by default, computing the limit is impossible unless the result will be bigger than host's memory which is effectively equivalent to "unlimited".

I agree. I also suggested setting mlock limit to infinity when libvirt sees <locked/> in the XML.

...

> > > > Luiz mentioned the fact that you can't set the memory > > > > locking limit to "unlimited" with the current <hard_limit> > > > > element: that's something we can, and should, address. > > > > With that implemented, the administrator will have full > > > > control on the memory limit and will be able to implement > > > > the policy that best suits the use case at hand. > > > > > > Asking <locked/> users to set <hard_limit> to "unlimited" > > > is a much worse solution than automatically setting the > > > memory lock limit to infinity in libvirt, for the reasons > > > I outlined in my first email. Memory locking limit and memory hard limit are separate things even though they can be set by a single value in domain XML. Hard limit is by default unlimited and a domain will just be killed if it consumes too much memory. However setting memory locking limit to unlimited may cause an OOM condition inside the host kernel, which is why it is not and shouldn't be the default value. However, ...

Well, I was wondering if OOM killer (and Linux's memory reclaim logic) would work the same way (except for swapping pages, of course) for processes with infinity mlock limit. If the answer is yes, then a malicious guest trying a DDoS attack would only be partially successful.

...

That's exactly how I see it! It seems we're total agreement. Now, something has just occurred to me: shouldn't VFIO have the same problem? It's the same hard limit that's set.

Luiz Capitulino

Thursday, 16 March Thu, 16 Mar

5:20 p.m.

On Wed, 15 Mar 2017 14:29:13 -0400 Luiz Capitulino <lcapitulino(a)redhat.com> wrote:

...

> ... we could consider <locked/> to be the explicit request for > setting an infinite memory locking limit and letting users set a lower > limit with hard_limit if they want. That's exactly how I see it! It seems we're total agreement. Now, something has just occurred to me: shouldn't VFIO have the same problem? It's the same hard limit that's set.

I took a look at this today. While it's the same mlock limit that's set and while QEMU's allocations can surpass that limit, I didn't get a crash when using VFIO. The most probable obvious reason for this is that VFIO is probably mlocking a small region, although I could not find where this is done in QEMU. In that case, VFIO is not affected. This issue is specific to <locked/>, where the 1GB limit set by libvirt conflicts with QEMU memory needs.

2999

days inactive

3008

days old

devel@lists.libvirt.org

Manage subscription

18 comments

5 participants

tags (0)

participants (5)

Andrea Bolognani
Daniel P. Berrange
Jiri Denemark
Luiz Capitulino
Peter Krempa

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[libvirt] [BUG] mlock support breakage