[libvirt-users] RLIMIT_MEMLOCK in container environment

Hi all, KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services. One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap. In KubeVirt world, several libvirtd assumptions do not apply: 1. In Kubernetes environments, swap is usually disabled. (e.g. kubeadm official deployment tool won't even initialize a cluster until you disable it.) This is documented in lots of places, f.e.: https://docs.platform9.com/kubernetes/disabling-swap-kubernetes-node/ (note: while it's vendor docs, regardless it's well known community recommendation.) 2. hotplug is not supported. Domain definition is stable through its whole lifetime. We are working on a series of patches that would remove the need for SYS_RESOURCE capability from the pod running libvirtd: https://github.com/kubevirt/kubevirt/pull/2584 We achieve it by making another, *privileged* component to set RLIMIT_MEMLOCK for libvirtd process using prlimit() syscall, using the value that is higher than the final value libvirtd uses with setrlimit() [Linux kernel will allow to lower the value without the capability.] Since the formula to calculate the actual MEMLOCK value is embedded in libvirt and is not simple to reproduce outside, we pick the upper limit value set for libvirtd process quite conservatively even if ideally we would use the exact same value as libvirtd would do. The estimation code is here: https://github.com/kubevirt/kubevirt/pull/2584/files#diff-6edccf5f0d11c09e70... While the solution works, there are some drawbacks: 1. the value we use for prlimit() is not exactly equal to the final value used by libvirtd; 2. we are doing all this work in environment that is not prone to issues because of disabled swap space. I believe we would benefit from one of the following features on libvirt side (or both): a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition. Do you think it would be acceptable to have one of these enhancements in libvirtd, or perhaps both, for degenerate cases like KubeVirt? Thanks for attention, Ihar

On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap.
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of - hard limit memory value is present - host PCI device passthrough is requested - memory is locked into RAM which of these are you actually using ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap.
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
- hard limit memory value is present - host PCI device passthrough is requested
We are using passthrough to pass SR-IOV NIC VFs into guests. We also plan to do the same for GPUs in the near future.
- memory is locked into RAM
which of these are you actually using ?
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap.
I recall successfully testing GPU assignment from an unprivileged libvirtd several years ago by setting a high enough ulimit for the uid used to run libvirtd in advance (. I think we check if the current setting is high enough, and don't try to set it unless we think we need to. If I understand you correctly, you're saying that in your case it's okay for the memlock limit to be lower than we try to set it to, because swap is disabled anyway, is that correct?
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
- hard limit memory value is present - host PCI device passthrough is requested
We are using passthrough
(If you want to make Alex happy, use the term "VFIO device assignment" rather than passthrough :-).)
to pass SR-IOV NIC VFs into guests. We also plan to do the same for GPUs in the near future.
I believe we would benefit from one of the following features on libvirt side (or both):
a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition.
(b) sounds much more reasonable, as long as qemu doesn't complain (I don't know whether or not it checks) Slightly related to this - I'm currently working on patches to avoid making any ioctl calls that would fail in an unprivileged libvirtd when using tap/macvtap devices. ATM, I'm doing this by adding an attribute "unmanaged='yes'" to the interface <target> element. The idea is that if someone sets unmanaged='yes', they're stating that the caller (i.e. kubevirt) is responsible for all device setup, and that libvirt should just use it without further setup. A similar approach could be applied to hostdev devices - if unmanaged is set, we assume that the caller has done everything to make the associated device usable. (Of course this all makes me realize the inanity of adding a <target dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So to prevent setting the locklimit for hostdev, would we make a new setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I *hate* trying to make config consistent :-/) (alternately, we could just automatically fail the attempt to set the lock limit in a graceful manner and allow the guest to continue) BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather than <interface type='hostdev'>, correct? The latter would require that you have enough capabilities to set MAC addresses on the VFs (that's the entire point of using <interface type='hostdev'> instead of plain <hostdev>)

On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine@redhat.com> wrote:
On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap.
I recall successfully testing GPU assignment from an unprivileged libvirtd several years ago by setting a high enough ulimit for the uid used to run libvirtd in advance (. I think we check if the current setting is high enough, and don't try to set it unless we think we need to.
The PR I linked to in the original email does just that: it starts libvirtd; then, if domain is going to use VFIO, sets ulimit of libvirtd process to VM memory size + 1Gb (mimicking libvirt code) + 256Mb (to stay conservative) using prlimit() syscall; then defines the domain.
If I understand you correctly, you're saying that in your case it's okay for the memlock limit to be lower than we try to set it to, because swap is disabled anyway, is that correct?
I'm honestly not exactly sure about the reason why we need to set the limit, but I assume it's because of swap. I can be totally confused on that part though.
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
- hard limit memory value is present - host PCI device passthrough is requested
We are using passthrough
(If you want to make Alex happy, use the term "VFIO device assignment" rather than passthrough :-).)
Not sure who Alex is but I'll try to make everyone happy! :)
to pass SR-IOV NIC VFs into guests. We also plan to do the same for GPUs in the near future.
I believe we would benefit from one of the following features on libvirt side (or both):
a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition.
(b) sounds much more reasonable, as long as qemu doesn't complain (I don't know whether or not it checks)
Slightly related to this - I'm currently working on patches to avoid making any ioctl calls that would fail in an unprivileged libvirtd when using tap/macvtap devices. ATM, I'm doing this by adding an attribute "unmanaged='yes'" to the interface <target> element. The idea is that if someone sets unmanaged='yes', they're stating that the caller (i.e. kubevirt) is responsible for all device setup, and that libvirt should just use it without further setup. A similar approach could be applied to hostdev devices - if unmanaged is set, we assume that the caller has done everything to make the associated device usable.
(Of course this all makes me realize the inanity of adding a <target dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So to prevent setting the locklimit for hostdev, would we make a new setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I *hate* trying to make config consistent :-/)
(alternately, we could just automatically fail the attempt to set the lock limit in a graceful manner and allow the guest to continue)
If that's something maintainers feel good about, I am all for it since it simplifies the implementation.
BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather than <interface type='hostdev'>, correct? The latter would require that you have enough capabilities to set MAC addresses on the VFs (that's the entire point of using <interface type='hostdev'> instead of plain <hostdev>)
Yes, we use <hostdev> exactly because interface sets MAC address: in kubevirt scenario, the container that is running libvirtd has its own network namespace and doesn't have access to PF to set the VF MAC address on. Instead, we rely on CNI plugin that is running in the root namespace context to configure the VF interface as needed. (I've contributed custom MAC support to SR-IOV CNI plugin very recently.) Ihar

(Adding Alex Williamson to Cc so he can correct any mistakes) On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine@redhat.com> wrote:
On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap.
I recall successfully testing GPU assignment from an unprivileged libvirtd several years ago by setting a high enough ulimit for the uid used to run libvirtd in advance (. I think we check if the current setting is high enough, and don't try to set it unless we think we need to.
The PR I linked to in the original email does just that: it starts libvirtd; then, if domain is going to use VFIO, sets ulimit of libvirtd process to VM memory size + 1Gb (mimicking libvirt code) + 256Mb (to stay conservative) using prlimit() syscall; then defines the domain.
So you're making an educated guess, which is essentially what libvirt is doing (based on advice from other people with better information than us, but still a guess).
If I understand you correctly, you're saying that in your case it's okay for the memlock limit to be lower than we try to set it to, because swap is disabled anyway, is that correct?
I'm honestly not exactly sure about the reason why we need to set the limit, but I assume it's because of swap. I can be totally confused on that part though.
What I understand from an IRC conversation with Alex just now is that increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages being swapped out. It's done because "all GPAs (Guest Physical Addresses) that could potentially be DMA targets need to have fixed mappings through the iommu, therefore all need to be allocated and mappings fixed [...] setting rlimit allows us to perform all the necessary pins within the user's locked memory limit". So even if swap is disabled, it still needs to be done (either by libvirt, or by someone else who has the necessary privileges and control over the libvirtd process).
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
- hard limit memory value is present - host PCI device passthrough is requested
We are using passthrough
(If you want to make Alex happy, use the term "VFIO device assignment" rather than passthrough :-).)
Not sure who Alex is but I'll try to make everyone happy! :)
The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO maintainer.
to pass SR-IOV NIC VFs into guests. We also plan to do the same for GPUs in the near future.
I believe we would benefit from one of the following features on libvirt side (or both):
a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition.
(b) sounds much more reasonable, as long as qemu doesn't complain (I don't know whether or not it checks)
Slightly related to this - I'm currently working on patches to avoid making any ioctl calls that would fail in an unprivileged libvirtd when using tap/macvtap devices. ATM, I'm doing this by adding an attribute "unmanaged='yes'" to the interface <target> element. The idea is that if someone sets unmanaged='yes', they're stating that the caller (i.e. kubevirt) is responsible for all device setup, and that libvirt should just use it without further setup. A similar approach could be applied to hostdev devices - if unmanaged is set, we assume that the caller has done everything to make the associated device usable.
(Of course this all makes me realize the inanity of adding a <target dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So to prevent setting the locklimit for hostdev, would we make a new setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I *hate* trying to make config consistent :-/)
(alternately, we could just automatically fail the attempt to set the lock limit in a graceful manner and allow the guest to continue)
If that's something maintainers feel good about, I am all for it since it simplifies the implementation.
Well, after talking to Alex, I think that since a) libvirt only attempts to increase the limit after determining that it isn't already high enough, and b) if it isn't high enough and we can't increase it, then qemu is going to fail anyway, that c) we can't just fail gracefully and continue. So *somebody* needs to increase the limit, and if you want libvirt to be unprivileged, that means it needs to be you doing the increase. And since the amount that libvirt increases it is just some number based on oral folklore (and not on a specific value we learn by querying somewhere), I don't think it's worthwhile figuring out some way for libvirt to report it via an official API - that would end up just being this: "Hey, you know that number that you guys are just making a guess about based on some advice someone gave you once? Yeah, send me *that* number so I can claim to be basing my actions on real science instead of slightly educated voodoo! K THX BYE!" :-)
BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather than <interface type='hostdev'>, correct? The latter would require that you have enough capabilities to set MAC addresses on the VFs (that's the entire point of using <interface type='hostdev'> instead of plain <hostdev>)
Yes, we use <hostdev> exactly because interface sets MAC address: in kubevirt scenario, the container that is running libvirtd has its own network namespace and doesn't have access to PF to set the VF MAC address on. Instead, we rely on CNI plugin that is running in the root namespace context to configure the VF interface as needed. (I've contributed custom MAC support to SR-IOV CNI plugin very recently.)
Ihar

On Fri, 23 Aug 2019, 0:27 Laine Stump, <laine@redhat.com> wrote:
(Adding Alex Williamson to Cc so he can correct any mistakes)
On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine@redhat.com> wrote:
On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <
berrange@redhat.com> wrote:
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into
swap.
I recall successfully testing GPU assignment from an unprivileged libvirtd several years ago by setting a high enough ulimit for the uid used to run libvirtd in advance (. I think we check if the current setting is high enough, and don't try to set it unless we think we need to.
The PR I linked to in the original email does just that: it starts libvirtd; then, if domain is going to use VFIO, sets ulimit of libvirtd process to VM memory size + 1Gb (mimicking libvirt code) + 256Mb (to stay conservative) using prlimit() syscall; then defines the domain.
So you're making an educated guess, which is essentially what libvirt is doing (based on advice from other people with better information than us, but still a guess).
If I understand you correctly, you're saying that in your case it's okay for the memlock limit to be lower than we try to set it to, because swap is disabled anyway, is that correct?
I'm honestly not exactly sure about the reason why we need to set the limit, but I assume it's because of swap. I can be totally confused on that part though.
What I understand from an IRC conversation with Alex just now is that increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages being swapped out. It's done because "all GPAs (Guest Physical Addresses) that could potentially be DMA targets need to have fixed mappings through the iommu, therefore all need to be allocated and mappings fixed [...] setting rlimit allows us to perform all the necessary pins within the user's locked memory limit".
So even if swap is disabled, it still needs to be done (either by libvirt, or by someone else who has the necessary privileges and control over the libvirtd process).
Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
- hard limit memory value is present - host PCI device passthrough is requested
We are using passthrough
(If you want to make Alex happy, use the term "VFIO device assignment" rather than passthrough :-).)
Not sure who Alex is but I'll try to make everyone happy! :)
The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO maintainer.
to pass SR-IOV NIC VFs into guests. We also plan to do the same for GPUs in the near future.
I believe we would benefit from one of the following features on libvirt side (or both):
a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition.
(b) sounds much more reasonable, as long as qemu doesn't complain (I don't know whether or not it checks)
Slightly related to this - I'm currently working on patches to avoid making any ioctl calls that would fail in an unprivileged libvirtd when using tap/macvtap devices.
This is music to my ears, great to hear. ATM, I'm doing this by adding an attribute
"unmanaged='yes'" to the interface <target> element. The idea is that if someone sets unmanaged='yes', they're stating that the caller (i.e. kubevirt) is responsible for all device setup, and that libvirt should just use it without further setup. A similar approach could be applied to hostdev devices - if unmanaged is set, we assume that the caller has done everything to make the associated device usable.
(Of course this all makes me realize the inanity of adding a <target dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So to prevent setting the locklimit for hostdev, would we make a new setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I *hate* trying to make config consistent :-/)
Sounds tough indeed. I'd try to avoid negatively-named knobs. managed=no is simpler to perceive than unmanaged=yes. It may be just me, but I'd even assume managed=no if the target dev name is specified. If libvirt manages the tap device, it should create a fresh one, too. But all of this is a big digression.
(alternately, we could just automatically fail the attempt to set the lock limit in a graceful manner and allow the guest to continue)
If that's something maintainers feel good about, I am all for it since it simplifies the implementation.
Well, after talking to Alex, I think that since a) libvirt only attempts to increase the limit after determining that it isn't already high enough, and b) if it isn't high enough and we can't increase it, then qemu is going to fail anyway, that c) we can't just fail gracefully and continue.
So *somebody* needs to increase the limit, and if you want libvirt to be unprivileged, that means it needs to be you doing the increase. And since the amount that libvirt increases it is just some number based on oral folklore (and not on a specific value we learn by querying somewhere), I don't think it's worthwhile figuring out some way for libvirt to report it via an official API - that would end up just being this:
"Hey, you know that number that you guys are just making a guess about based on some advice someone gave you once? Yeah, send me *that* number so I can claim to be basing my actions on real science instead of slightly educated voodoo! K THX BYE!" :-)
Well, it's more like: "you know that voodoo you do to guess the number? If you ever educate yourself about it, e.g by querying qemu, send me *that* number. I'd rather not think about it ever again, BYE."
BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather than <interface type='hostdev'>, correct? The latter would require that you have enough capabilities to set MAC addresses on the VFs (that's the entire point of using <interface type='hostdev'> instead of plain
<hostdev>)
Yes, we use <hostdev> exactly because interface sets MAC address: in kubevirt scenario, the container that is running libvirtd has its own network namespace and doesn't have access to PF to set the VF MAC address on. Instead, we rely on CNI plugin that is running in the root namespace context to configure the VF interface as needed. (I've contributed custom MAC support to SR-IOV CNI plugin very recently.)
Ihar
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

On 8/24/19 3:08 AM, Dan Kenigsberg wrote:
On Fri, 23 Aug 2019, 0:27 Laine Stump, <laine@redhat.com <mailto:laine@redhat.com>> wrote:
(Adding Alex Williamson to Cc so he can correct any mistakes)
On 8/22/19 4:39 PM, Ihar Hrachyshka wrote: > On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine@redhat.com <mailto:laine@redhat.com>> wrote: >> >> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote: >>> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com <mailto:berrange@redhat.com>> wrote: >>>> >>>> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote: >>>>> Hi all, >>>>> >>>>> KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes >>>>> API resources. In this case, libvirtd is running inside an >>>>> unprivileged pod, with some host mounts / capabilities added to the >>>>> pod, needed by libvirtd and other services. >>>>> >>>>> One of the capabilities libvirtd requires for successful startup >>>>> inside a pod is SYS_RESOURCE. This capability is used to adjust >>>>> RLIMIT_MEMLOCK ulimit value depending on devices attached to the >>>>> managed guest, both on startup and during hotplug. AFAIU the need to >>>>> lock the memory is to avoid pages being pushed out from RAM into swap. >> >> >> I recall successfully testing GPU assignment from an unprivileged >> libvirtd several years ago by setting a high enough ulimit for the uid >> used to run libvirtd in advance (. I think we check if the current >> setting is high enough, and don't try to set it unless we think we need to. >> > > The PR I linked to in the original email does just that: it starts > libvirtd; then, if domain is going to use VFIO, sets ulimit of > libvirtd process to VM memory size + 1Gb (mimicking libvirt code) + > 256Mb (to stay conservative) using prlimit() syscall; then defines the > domain.
So you're making an educated guess, which is essentially what libvirt is doing (based on advice from other people with better information than us, but still a guess).
> >> If I understand you correctly, you're saying that in your case it's okay >> for the memlock limit to be lower than we try to set it to, because swap >> is disabled anyway, is that correct? >> > > I'm honestly not exactly sure about the reason why we need to set the > limit, but I assume it's because of swap. I can be totally confused on > that part though.
What I understand from an IRC conversation with Alex just now is that increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages being swapped out. It's done because "all GPAs (Guest Physical Addresses) that could potentially be DMA targets need to have fixed mappings through the iommu, therefore all need to be allocated and mappings fixed [...] setting rlimit allows us to perform all the necessary pins within the user's locked memory limit".
So even if swap is disabled, it still needs to be done (either by libvirt, or by someone else who has the necessary privileges and control over the libvirtd process).
>>>> >>>> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's >>>> something in the XML that requires it - one of >>> >>> You are right, sorry. We add SYS_RESOURCE only for particular domains. >>> >>>> >>>> - hard limit memory value is present >>>> - host PCI device passthrough is requested >>> >>> We are using passthrough >> >> (If you want to make Alex happy, use the term "VFIO device assignment" >> rather than passthrough :-).) >> > > Not sure who Alex is but I'll try to make everyone happy! :)
The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO maintainer.
>>> to pass SR-IOV NIC VFs into guests. We also >>> plan to do the same for GPUs in the near future. >> >> >>> I believe we would benefit from one of the following features on >> >>> libvirt side (or both): >> >>> >> >>> a) expose the memory lock value calculated by libvirtd through >> >>> libvirt ABI so that we can use it when calling prlimit() on libvirtd >> >>> process; >> >>> b) allow to disable setrlimit() calls via libvirtd config file knob >> >>> or domain definition. >> >> (b) sounds much more reasonable, as long as qemu doesn't complain (I >> don't know whether or not it checks) >> >> Slightly related to this - I'm currently working on patches to avoid >> making any ioctl calls that would fail in an unprivileged libvirtd when >> using tap/macvtap devices.
This is music to my ears, great to hear.
ATM, I'm doing this by adding an attribute >> "unmanaged='yes'" to the interface <target> element. The idea is that if >> someone sets unmanaged='yes', they're stating that the caller (i.e. >> kubevirt) is responsible for all device setup, and that libvirt should >> just use it without further setup. A similar approach could be applied >> to hostdev devices - if unmanaged is set, we assume that the caller has >> done everything to make the associated device usable. >> >> (Of course this all makes me realize the inanity of adding a <target >> dev='blah' unmanaged='yes'/> for interfaces when hostdevs already have >> <hostdev managed='yes'> and <interface type='hostdev' managed='yes'>. So >> to prevent setting the locklimit for hostdev, would we make a new >> setting like <hostdev managed='no-never-not-even-a-tiny-bit'>? Sigh. I >> *hate* trying to make config consistent :-/)
Sounds tough indeed. I'd try to avoid negatively-named knobs. managed=no is simpler to perceive than unmanaged=yes.
Yeah, I don't like double negatives either, but since the default needs to preserve existing behavior, and it's easier for a default setting to be "no" rather than "yes"..., Still, I'm not married to this name, just using it so that I can get *something* going.
It may be just me, but I'd even assume managed=no if the target dev name is specified. If libvirt manages the tap device, it should create a fresh one, too.
If we were starting from scratch, that's what I would prefer too. The only problem is that we have to maintain existing behavior for current users. The way that it works currently is that if you specify the tap device name and it exists, then libvirt will still set the MAC address and some IFF_* flags. If we suddenly stop doing that, then the existing users' configs will be broken :-/
But all of this is a big digression.
>> >> (alternately, we could just automatically fail the attempt to set the >> lock limit in a graceful manner and allow the guest to continue) >> > > If that's something maintainers feel good about, I am all for it since > it simplifies the implementation.
Well, after talking to Alex, I think that since a) libvirt only attempts to increase the limit after determining that it isn't already high enough, and b) if it isn't high enough and we can't increase it, then qemu is going to fail anyway, that c) we can't just fail gracefully and continue.
So *somebody* needs to increase the limit, and if you want libvirt to be unprivileged, that means it needs to be you doing the increase. And since the amount that libvirt increases it is just some number based on oral folklore (and not on a specific value we learn by querying somewhere), I don't think it's worthwhile figuring out some way for libvirt to report it via an official API - that would end up just being this:
"Hey, you know that number that you guys are just making a guess about based on some advice someone gave you once? Yeah, send me *that* number so I can claim to be basing my actions on real science instead of slightly educated voodoo! K THX BYE!" :-)
Well, it's more like: "you know that voodoo you do to guess the number? If you ever educate yourself about it, e.g by querying qemu, send me *that* number. I'd rather not think about it ever again, BYE."
I can see the motivation. But that assumes that qemu knows the right answer, and had a way to query it. If they ever do that, then maybe we could think about supporting it, but until then reporting any value is tantamount to lying.
> >> BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs rather >> than <interface type='hostdev'>, correct? The latter would require that >> you have enough capabilities to set MAC addresses on the VFs (that's the >> entire point of using <interface type='hostdev'> instead of plain <hostdev>) > > Yes, we use <hostdev> exactly because interface sets MAC address: in > kubevirt scenario, the container that is running libvirtd has its own > network namespace and doesn't have access to PF to set the VF MAC > address on. Instead, we rely on CNI plugin that is running in the root > namespace context to configure the VF interface as needed. (I've > contributed custom MAC support to SR-IOV CNI plugin very recently.) > > Ihar >
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com <mailto:libvirt-users@redhat.com> https://www.redhat.com/mailman/listinfo/libvirt-users
participants (4)
-
Dan Kenigsberg
-
Daniel P. Berrangé
-
Ihar Hrachyshka
-
Laine Stump