Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

22 Aug 2019

      (Adding Alex Williamson to Cc so he can correct any mistakes)

On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:
...
On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine@redhat.com> wrote:
...
On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
...
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
...
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
...
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
API resources. In this case, libvirtd is running inside an
unprivileged pod, with some host mounts / capabilities added to the
pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup
inside a pod is SYS_RESOURCE. This capability is used to adjust
RLIMIT_MEMLOCK ulimit value depending on devices attached to the
managed guest, both on startup and during hotplug. AFAIU the need to
lock the memory is to avoid pages being pushed out from RAM into swap.
I recall successfully testing GPU assignment from an unprivileged
libvirtd several years ago by setting a high enough ulimit for the uid
used to run libvirtd in advance (. I think we check if the current
setting is high enough, and don't try to set it unless we think we need to.
The PR I linked to in the original email does just that: it starts
libvirtd; then, if domain is going to use VFIO, sets ulimit of
libvirtd process to VM memory size + 1Gb (mimicking libvirt code) +
256Mb (to stay conservative) using prlimit() syscall; then defines the
domain.
So you're making an educated guess, which is essentially what libvirt is 
doing (based on advice from other people with better information than 
us, but still a guess).
...
...
If I understand you correctly, you're saying that in your case it's okay
for the memlock limit to be lower than we try to set it to, because swap
is disabled anyway, is that correct?
I'm honestly not exactly sure about the reason why we need to set the
limit, but I assume it's because of swap. I can be totally confused on
that part though.
What I understand from an IRC conversation with Alex just now is that 
increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages 
being swapped out. It's done because "all GPAs (Guest Physical 
Addresses) that could potentially be DMA targets need to have fixed 
mappings through the iommu, therefore all need to be allocated and 
mappings fixed [...] setting rlimit allows us to perform all the 
necessary pins within the user's locked memory limit".

So even if swap is disabled, it still needs to be done (either by 
libvirt, or by someone else who has the necessary privileges and control 
over the libvirtd process).
...
...
...
...
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
...
- hard limit memory value is present
   - host PCI device passthrough is requested
We are using passthrough
(If you want to make Alex happy, use the term "VFIO device assignment"
rather than passthrough :-).)
Not sure who Alex is but I'll try to make everyone happy! :)
The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO 
maintainer.
...
...
...
to pass SR-IOV NIC VFs into guests. We also
plan to do the same for GPUs in the near future.
...
...
...
I believe we would benefit from one of the following features on
libvirt side (or both):
a) expose the memory lock value calculated by libvirtd through
libvirt ABI so that we can use it when calling prlimit() on libvirtd
process;
b) allow to disable setrlimit() calls via libvirtd config file knob
or domain definition.
(b) sounds much more reasonable, as long as qemu doesn't complain (I
don't know whether or not it checks)
Slightly related to this - I'm currently working on patches to avoid
making any ioctl calls that would fail in an unprivileged libvirtd when
using tap/macvtap devices. ATM, I'm doing this by adding an attribute
"unmanaged='yes'" to the interface <target> element. The idea is that if
someone sets unmanaged='yes', they're stating that the caller (i.e.
kubevirt) is responsible for all device setup, and that libvirt should
just use it without further setup. A similar approach could be applied
to hostdev devices - if unmanaged is set, we assume that the caller has
done everything to make the associated device usable.
(Of course this all makes me realize the inanity of adding a <target
dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have
<hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So
to prevent setting the locklimit for hostdev, would we make a new
setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I
*hate* trying to make config consistent :-/)
(alternately, we could just automatically fail the attempt to set the
lock limit in a graceful manner and allow the guest to continue)
If that's something maintainers feel good about, I am all for it since
it simplifies the implementation.
Well, after talking to Alex, I think that since a) libvirt only attempts 
to increase the limit after determining that it isn't already high 
enough, and b) if it isn't high enough and we can't increase it, then 
qemu is going to fail anyway, that c) we can't just fail gracefully and 
continue.

So *somebody* needs to increase the limit, and if you want libvirt to be 
unprivileged, that means it needs to be you doing the increase. And 
since the amount that libvirt increases it is just some number based on 
oral folklore (and not on a specific value we learn by querying 
somewhere), I don't think it's worthwhile figuring out some way for 
libvirt to report it via an official API - that would end up just being 
this:

"Hey, you know that number that you guys are just making a guess about 
based on some advice someone gave you once? Yeah, send me *that* number 
so I can claim to be basing my actions on real science instead of 
slightly educated voodoo! K THX BYE!" :-)
...
...
BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather
than <interface type='hostdev'>, correct? The latter would require that
you have enough capabilities to set MAC addresses on the VFs (that's the
entire point of using <interface type='hostdev'> instead of plain <hostdev>)
Yes, we use <hostdev> exactly because interface sets MAC address: in
kubevirt scenario, the container that is running libvirtd has its own
network namespace and doesn't have access to PF to set the VF MAC
address on. Instead, we rely on CNI plugin that is running in the root
namespace context to configure the VF interface as needed. (I've
contributed custom MAC support to SR-IOV CNI plugin very recently.)
Ihar