[libvirt] [PATCH] qemu: process: Fix automatic setting of locked memory limit for VFIO

For VFIO passthrough the guest memory including the device memory to be resident in memory. Previously we used a magic constant of 1GiB that we added to the current memory size of the VM to accomodate all this. It is possible though that the device that is passed through exposes more than that. To avoid guessing wrong again in the future set the memory lock limit to unlimited at the point when VFIO will be used. This problem is similar to the issue where we tried to infer the hard limit for a VM according to it's configuration. This proved to be really problematic so we moved this burden to the user. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480 Additionally this patch also fixes a similar bug, where the mlock limit was not increased prior to a memory hotplug operation. https://bugzilla.redhat.com/show_bug.cgi?id=1273491 --- src/qemu/qemu_command.c | 17 +++++------------ src/qemu/qemu_hotplug.c | 20 +++++--------------- src/util/virprocess.c | 3 +++ 3 files changed, 13 insertions(+), 27 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8824541..183aa85 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -11448,20 +11448,13 @@ qemuBuildCommandLine(virConnectPtr conn, } if (mlock) { - unsigned long long memKB; - - /* VFIO requires all of the guest's memory to be - * locked resident, plus some amount for IO - * space. Alex Williamson suggested adding 1GiB for IO - * space just to be safe (some finer tuning might be - * nice, though). - */ + /* VFIO requires all of the guest's memory and the IO space of the + * device to be locked resident. Since we can't really know what the + * users decide to plug in we need to set the limit to unlimited. */ if (virMemoryLimitIsSet(def->mem.hard_limit)) - memKB = def->mem.hard_limit; + virCommandSetMaxMemLock(cmd, def->mem.hard_limit << 10); else - memKB = virDomainDefGetMemoryActual(def) + 1024 * 1024; - - virCommandSetMaxMemLock(cmd, memKB * 1024); + virCommandSetMaxMemLock(cmd, ULLONG_MAX); } if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MSG_TIMESTAMP) && diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 8f2fda9..1cf61ac 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1254,7 +1254,6 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, bool teardowncgroup = false; bool teardownlabel = false; int backend; - unsigned long long memKB; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); unsigned int flags = 0; @@ -1279,20 +1278,11 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, goto error; } - /* VFIO requires all of the guest's memory to be locked - * resident (plus an additional 1GiB to cover IO space). During - * hotplug, the guest's memory may already be locked, but it - * doesn't hurt to "change" the limit to the same value. - * NB: the domain's memory tuning parameters are stored as - * Kibibytes, but virProcessSetMaxMemLock expects the value in - * bytes. - */ - if (virMemoryLimitIsSet(vm->def->mem.hard_limit)) - memKB = vm->def->mem.hard_limit; - else - memKB = virDomainDefGetMemoryActual(vm->def) + (1024 * 1024); - - virProcessSetMaxMemLock(vm->pid, memKB * 1024); + /* VFIO requires all of the guest's memory and the IO space of the + * device to be locked resident. Since we can't really know what the + * users decide to plug in we need to set the limit to unlimited. */ + if (!virMemoryLimitIsSet(vm->def->mem.hard_limit)) + virProcessSetMaxMemLock(vm->pid, ULLONG_MAX); break; default: diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 103c114..92195df 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -749,6 +749,9 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) if (bytes == 0) return 0; + if (bytes == ULLONG_MAX) + bytes = RLIM_INFINITY; + rlim.rlim_cur = rlim.rlim_max = bytes; if (pid == 0) { if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) { -- 2.6.2

On Wed, 2015-11-04 at 16:14 +0100, Peter Krempa wrote:
For VFIO passthrough the guest memory including the device memory to be resident in memory. Previously we used a magic constant of 1GiB that we added to the current memory size of the VM to accomodate all this. It is possible though that the device that is passed through exposes more than that. To avoid guessing wrong again in the future set the memory lock limit to unlimited at the point when VFIO will be used.
This problem is similar to the issue where we tried to infer the hard limit for a VM according to it's configuration. This proved to be really problematic so we moved this burden to the user.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480
Additionally this patch also fixes a similar bug, where the mlock limit was not increased prior to a memory hotplug operation. https://bugzilla.redhat.com/show_bug.cgi?id=1273491 ---
IOW, tracking how much memory a VM should be able to lock is too hard, let's circumvent the security that the kernel is trying to add here and let assigned device VMs again lock as much memory as they want. This may solve some bugs, but it does so by ignoring the security limits we're trying to impose. Thanks, Alex
src/qemu/qemu_command.c | 17 +++++------------ src/qemu/qemu_hotplug.c | 20 +++++--------------- src/util/virprocess.c | 3 +++ 3 files changed, 13 insertions(+), 27 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8824541..183aa85 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -11448,20 +11448,13 @@ qemuBuildCommandLine(virConnectPtr conn, }
if (mlock) { - unsigned long long memKB; - - /* VFIO requires all of the guest's memory to be - * locked resident, plus some amount for IO - * space. Alex Williamson suggested adding 1GiB for IO - * space just to be safe (some finer tuning might be - * nice, though). - */ + /* VFIO requires all of the guest's memory and the IO space of the + * device to be locked resident. Since we can't really know what the + * users decide to plug in we need to set the limit to unlimited. */ if (virMemoryLimitIsSet(def->mem.hard_limit)) - memKB = def->mem.hard_limit; + virCommandSetMaxMemLock(cmd, def->mem.hard_limit << 10); else - memKB = virDomainDefGetMemoryActual(def) + 1024 * 1024; - - virCommandSetMaxMemLock(cmd, memKB * 1024); + virCommandSetMaxMemLock(cmd, ULLONG_MAX); }
if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MSG_TIMESTAMP) && diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 8f2fda9..1cf61ac 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1254,7 +1254,6 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, bool teardowncgroup = false; bool teardownlabel = false; int backend; - unsigned long long memKB; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); unsigned int flags = 0;
@@ -1279,20 +1278,11 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, goto error; }
- /* VFIO requires all of the guest's memory to be locked - * resident (plus an additional 1GiB to cover IO space). During - * hotplug, the guest's memory may already be locked, but it - * doesn't hurt to "change" the limit to the same value. - * NB: the domain's memory tuning parameters are stored as - * Kibibytes, but virProcessSetMaxMemLock expects the value in - * bytes. - */ - if (virMemoryLimitIsSet(vm->def->mem.hard_limit)) - memKB = vm->def->mem.hard_limit; - else - memKB = virDomainDefGetMemoryActual(vm->def) + (1024 * 1024); - - virProcessSetMaxMemLock(vm->pid, memKB * 1024); + /* VFIO requires all of the guest's memory and the IO space of the + * device to be locked resident. Since we can't really know what the + * users decide to plug in we need to set the limit to unlimited. */ + if (!virMemoryLimitIsSet(vm->def->mem.hard_limit)) + virProcessSetMaxMemLock(vm->pid, ULLONG_MAX); break;
default: diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 103c114..92195df 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -749,6 +749,9 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) if (bytes == 0) return 0;
+ if (bytes == ULLONG_MAX) + bytes = RLIM_INFINITY; + rlim.rlim_cur = rlim.rlim_max = bytes; if (pid == 0) { if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) {

On Wed, Nov 04, 2015 at 08:43:34 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:14 +0100, Peter Krempa wrote:
For VFIO passthrough the guest memory including the device memory to be resident in memory. Previously we used a magic constant of 1GiB that we added to the current memory size of the VM to accomodate all this. It is possible though that the device that is passed through exposes more than that. To avoid guessing wrong again in the future set the memory lock limit to unlimited at the point when VFIO will be used.
This problem is similar to the issue where we tried to infer the hard limit for a VM according to it's configuration. This proved to be really problematic so we moved this burden to the user.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480
Additionally this patch also fixes a similar bug, where the mlock limit was not increased prior to a memory hotplug operation. https://bugzilla.redhat.com/show_bug.cgi?id=1273491 ---
IOW, tracking how much memory a VM should be able to lock is too hard, let's circumvent the security that the kernel is trying to add here and let assigned device VMs again lock as much memory as they want. This may solve some bugs, but it does so by ignoring the security limits we're trying to impose. Thanks,
Well, the default here is 64KiB, which obviously is not enough in this case and we are trying to set something reasonable so that it will not break. Other option would be to force the users to specify <hard_limit> so that they are hold responsible for the value. So the actual question here is: Is there a 100% working alogrithm/formula that will allow us to calculate the locked memory size any point/configuration? I'll happily do something that. We tried to do a similar thing to automatically infer the hard limit size. There were many bugs resultin of this since we weren't able to accurately guess qemu's memory usage. Additionally if users wish to impose a limit on this they still might want to use the <hard_limit> setting. Peter

On 11/04/2015 10:54 AM, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 08:43:34 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:14 +0100, Peter Krempa wrote:
For VFIO passthrough the guest memory including the device memory to be resident in memory. Previously we used a magic constant of 1GiB that we added to the current memory size of the VM to accomodate all this. It is possible though that the device that is passed through exposes more than that. To avoid guessing wrong again in the future set the memory lock limit to unlimited at the point when VFIO will be used.
This problem is similar to the issue where we tried to infer the hard limit for a VM according to it's configuration. This proved to be really problematic so we moved this burden to the user.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480
Additionally this patch also fixes a similar bug, where the mlock limit was not increased prior to a memory hotplug operation. https://bugzilla.redhat.com/show_bug.cgi?id=1273491 --- IOW, tracking how much memory a VM should be able to lock is too hard, let's circumvent the security that the kernel is trying to add here and let assigned device VMs again lock as much memory as they want. This may solve some bugs, but it does so by ignoring the security limits we're trying to impose. Thanks, Well, the default here is 64KiB, which obviously is not enough in this case and we are trying to set something reasonable so that it will not break. Other option would be to force the users to specify <hard_limit> so that they are hold responsible for the value.
So the actual question here is: Is there a 100% working alogrithm/formula that will allow us to calculate the locked memory size any point/configuration? I'll happily do something that.
We tried to do a similar thing to automatically infer the hard limit size. There were many bugs resultin of this since we weren't able to accurately guess qemu's memory usage.
Additionally if users wish to impose a limit on this they still might want to use the <hard_limit> setting.
I also don't like the idea of having libvirt automatically give the ability to lock all memory in RAM to every process that needs an assigned device, so I would rather see "some kind of reasonable limit" by default, along with the ability to increase it when necessary. (seriously, though, "the amount of IO space required by this device" sounds like something that should be discoverable. Of course if it is, then I should have been ambitious enough to figure that out in the first place so that we wouldn't have this problem now :-P)

On Wed, Nov 04, 2015 at 12:36:58 -0500, Laine Stump wrote:
On 11/04/2015 10:54 AM, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 08:43:34 -0700, Alex Williamson wrote:
IOW, tracking how much memory a VM should be able to lock is too hard, let's circumvent the security that the kernel is trying to add here and let assigned device VMs again lock as much memory as they want. This may solve some bugs, but it does so by ignoring the security limits we're trying to impose. Thanks, Well, the default here is 64KiB, which obviously is not enough in this case and we are trying to set something reasonable so that it will not break. Other option would be to force the users to specify <hard_limit> so that they are hold responsible for the value.
So the actual question here is: Is there a 100% working alogrithm/formula that will allow us to calculate the locked memory size any point/configuration? I'll happily do something that.
We tried to do a similar thing to automatically infer the hard limit size. There were many bugs resultin of this since we weren't able to accurately guess qemu's memory usage.
Additionally if users wish to impose a limit on this they still might want to use the <hard_limit> setting.
I also don't like the idea of having libvirt automatically give the ability to lock all memory in RAM to every process that needs an assigned device, so I would rather see "some kind of reasonable limit" by default, along with the ability to increase it when necessary.
(seriously, though, "the amount of IO space required by this device" sounds like something that should be discoverable. Of course if it is, then I should have been ambitious enough to figure that out in the first place so that we wouldn't have this problem now :-P)
But even if you somehow discover what IO space a device needs you still need to have an idea about the memory required to run a domain without this device and that is impossible to compute. So unless QEMU is clever enough to only lock memory used as the IO space, there's not much we can do, I'm afraid. As Peter said, anyone who doesn't want this unlimited behavior can always set memory hard limits for their domains. Jirka

On Wed, 2015-11-04 at 16:54 +0100, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 08:43:34 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:14 +0100, Peter Krempa wrote:
For VFIO passthrough the guest memory including the device memory to be resident in memory. Previously we used a magic constant of 1GiB that we added to the current memory size of the VM to accomodate all this. It is possible though that the device that is passed through exposes more than that. To avoid guessing wrong again in the future set the memory lock limit to unlimited at the point when VFIO will be used.
This problem is similar to the issue where we tried to infer the hard limit for a VM according to it's configuration. This proved to be really problematic so we moved this burden to the user.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480
Additionally this patch also fixes a similar bug, where the mlock limit was not increased prior to a memory hotplug operation. https://bugzilla.redhat.com/show_bug.cgi?id=1273491 ---
IOW, tracking how much memory a VM should be able to lock is too hard, let's circumvent the security that the kernel is trying to add here and let assigned device VMs again lock as much memory as they want. This may solve some bugs, but it does so by ignoring the security limits we're trying to impose. Thanks,
Well, the default here is 64KiB, which obviously is not enough in this case and we are trying to set something reasonable so that it will not break. Other option would be to force the users to specify <hard_limit> so that they are hold responsible for the value.
So the actual question here is: Is there a 100% working alogrithm/formula that will allow us to calculate the locked memory size any point/configuration? I'll happily do something that.
We tried to do a similar thing to automatically infer the hard limit size. There were many bugs resultin of this since we weren't able to accurately guess qemu's memory usage.
Additionally if users wish to impose a limit on this they still might want to use the <hard_limit> setting.
What's wrong with the current algorithm? The 1G fudge factor certainly isn't ideal, but the 2nd bug you reference above is clearly a result that more memory is being added to the VM but the locked memory limit is not adjusted to account for it. That's just an implementation oversight. I'm not sure what's going on in the first bug, but why does using hard_limit to override the locked limit to something smaller than we think it should be set to automatically solve the problem? Is it not getting set as we expect on power? Do we simply need to set the limit using max memory rather than current memory? It seems like there's a whole lot of things we could do that are better than allowing the VM unlimited locked memory. Thanks, Alex

On Wed, Nov 04, 2015 at 17:16:53 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:54 +0100, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 08:43:34 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:14 +0100, Peter Krempa wrote:
[...]
Additionally if users wish to impose a limit on this they still might want to use the <hard_limit> setting.
What's wrong with the current algorithm? The 1G fudge factor certainly
The wrong think is that it doesn't work all the time. See below...
isn't ideal, but the 2nd bug you reference above is clearly a result that more memory is being added to the VM but the locked memory limit is not adjusted to account for it. That's just an implementation
Indeed, that is a separate bug, and if we figure out how to make this work all the time I'll fix that separately.
oversight. I'm not sure what's going on in the first bug, but why does using hard_limit to override the locked limit to something smaller than we think it should be set to automatically solve the problem? Is it not getting set as we expect on power? Do we simply need to set the limit using max memory rather than current memory? It seems like there's a
Setting it to max memory won't actually fix the first referenced bug. The bug can be reproduced even if you don't use max memory at all on power pc (I didn't manage to update the BZ yet though with this information). Using max memory for that would basically add yet another workaround for setting the mlock size large enough. The bug happens if you set up a guest with 1GiB of ram and pass a AMD FirePro 2270 graphics card into it. Libvirt sets the memory limit to 1+1GiB and starts qemu. qemu then aborts as the VFIO code cannot lock the memory. This does not happen on larger guests though (2GiB of ram or more) which leads to the suspicion that the limit doesn't take into account some kind of overhead. As the original comment in the code was hinting that this is just a guess and it proved to be unreliable we shouldn't special case such configurations. Setting it to max memory + 1G would actually work around the second mentioned bug to the current state.
whole lot of things we could do that are better than allowing the VM unlimited locked memory. Thanks,
I'm happy to set something large enough, but the value we set must be the absolute upper bound of anything that might be necessary so that it will 'just work'. We decided to be nice enough to the users to set the limit to somethig that works so we shouldn't special case any configuration. Peter

On Thu, 2015-11-05 at 16:27 +0100, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 17:16:53 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:54 +0100, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 08:43:34 -0700, Alex Williamson wrote:
On Wed, 2015-11-04 at 16:14 +0100, Peter Krempa wrote:
[...]
Additionally if users wish to impose a limit on this they still might want to use the <hard_limit> setting.
What's wrong with the current algorithm? The 1G fudge factor certainly
The wrong think is that it doesn't work all the time. See below...
isn't ideal, but the 2nd bug you reference above is clearly a result that more memory is being added to the VM but the locked memory limit is not adjusted to account for it. That's just an implementation
Indeed, that is a separate bug, and if we figure out how to make this work all the time I'll fix that separately.
oversight. I'm not sure what's going on in the first bug, but why does using hard_limit to override the locked limit to something smaller than we think it should be set to automatically solve the problem? Is it not getting set as we expect on power? Do we simply need to set the limit using max memory rather than current memory? It seems like there's a
Setting it to max memory won't actually fix the first referenced bug. The bug can be reproduced even if you don't use max memory at all on power pc (I didn't manage to update the BZ yet though with this information). Using max memory for that would basically add yet another workaround for setting the mlock size large enough.
The bug happens if you set up a guest with 1GiB of ram and pass a AMD FirePro 2270 graphics card into it. Libvirt sets the memory limit to 1+1GiB and starts qemu. qemu then aborts as the VFIO code cannot lock the memory.
This seems like a power bug, on x86 mmap'd memory, such as the MMIO space of an assigned device, doesn't count against locked memory limits.
This does not happen on larger guests though (2GiB of ram or more) which leads to the suspicion that the limit doesn't take into account some kind of overhead. As the original comment in the code was hinting that this is just a guess and it proved to be unreliable we shouldn't special case such configurations.
Power has a different IOMMU model than x86, there may be some lower bound at which trying to approximate it using the x86 limits doesn't work.
Setting it to max memory + 1G would actually work around the second mentioned bug to the current state.
whole lot of things we could do that are better than allowing the VM unlimited locked memory. Thanks,
I'm happy to set something large enough, but the value we set must be the absolute upper bound of anything that might be necessary so that it will 'just work'. We decided to be nice enough to the users to set the limit to somethig that works so we shouldn't special case any configuration.
The power devs will need to speak to what their locked memory requirements are and maybe we can come up with a combined algorithm that works good enough for both, or maybe libvirt needs to apply different algorithms based on the machine type. The x86 limit has been working well for us, so I see no reason to abandon it simply because we tried to apply it to a different platform and it didn't work. Thanks, Alex

On Thu, Nov 05, 2015 at 08:41:46 -0700, Alex Williamson wrote:
On Thu, 2015-11-05 at 16:27 +0100, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 17:16:53 -0700, Alex Williamson wrote:
[...]
The power devs will need to speak to what their locked memory requirements are and maybe we can come up with a combined algorithm that works good enough for both, or maybe libvirt needs to apply different algorithms based on the machine type. The x86 limit has been workinga
Okay, that seems reasonable. I'll extract the code that sets the limit into a helper, so that there's a central point where this stuff can be tweaked and made machine dependent.
well for us, so I see no reason to abandon it simply because we tried to apply it to a different platform and it didn't work. Thanks,
As of the x86 lock limit. Currently the code states the following: /* VFIO requires all of the guest's memory to be * locked resident, plus some amount for IO * space. Alex Williamson suggested adding 1GiB for IO * space just to be safe (some finer tuning might be * nice, though). */ Disregarding the empirical evidence that this worked until now on x86 the comment's message here is rather unsure about the actual size used at this point and hints it might need tweaking. If I'm going to move this I'd really like to remove the uncertainity from the message. So will I be able to justify: - dropping "some" from some ammount for IO space - replacing "Alex Williamson suggested adding" by something more scientifical I think the mlock size will be justified far more with comment like: /* VFIO requires all of guest's memory and memory for the IO space to * be locked persistent. * * On x86 the IO space size is 1GiB. * [some lines for possibly other platforms] */ Is it valid to make such change or do we need to keep the ambiguity and just reference the empirical fact that it didn't break yet? Peter

On Fri, 2015-11-06 at 08:30 +0100, Peter Krempa wrote:
On Thu, Nov 05, 2015 at 08:41:46 -0700, Alex Williamson wrote:
On Thu, 2015-11-05 at 16:27 +0100, Peter Krempa wrote:
On Wed, Nov 04, 2015 at 17:16:53 -0700, Alex Williamson wrote:
[...]
The power devs will need to speak to what their locked memory requirements are and maybe we can come up with a combined algorithm that works good enough for both, or maybe libvirt needs to apply different algorithms based on the machine type. The x86 limit has been workinga
Okay, that seems reasonable. I'll extract the code that sets the limit into a helper, so that there's a central point where this stuff can be tweaked and made machine dependent.
well for us, so I see no reason to abandon it simply because we tried to apply it to a different platform and it didn't work. Thanks,
As of the x86 lock limit. Currently the code states the following:
/* VFIO requires all of the guest's memory to be * locked resident, plus some amount for IO * space. Alex Williamson suggested adding 1GiB for IO * space just to be safe (some finer tuning might be * nice, though). */
Disregarding the empirical evidence that this worked until now on x86 the comment's message here is rather unsure about the actual size used at this point and hints it might need tweaking. If I'm going to move this I'd really like to remove the uncertainity from the message.
So will I be able to justify: - dropping "some" from some ammount for IO space - replacing "Alex Williamson suggested adding" by something more scientifical
I think the mlock size will be justified far more with comment like:
/* VFIO requires all of guest's memory and memory for the IO space to * be locked persistent. * * On x86 the IO space size is 1GiB. * [some lines for possibly other platforms] */
Is it valid to make such change or do we need to keep the ambiguity and just reference the empirical fact that it didn't break yet?
The reason we use the 1G "fudge factor" is that VFIO maps MMIO for other devices through the IOMMU. This theoretically allows peer-to-peer DMA between devices in the VM, for instance, an assigned GPU could DMA directly into the emulated VGA framebuffer. The reason we picked 1G is that it's often the size of reserved MMIO space below 4G on x86 systems, and so far we've never hit it. We do have 64bit MMIO space, but an important detail is that MMIO space for assigned devices doesn't count against the limit because it maps mmap'd MMIO space through the IOMMU, so there's not actually any RAM locked, it's MMIO space on the device that the user already owns. So things like a Tesla card with 4G PCI BARs is still going to work just fine w/ that 1G fudge factor. The 1G would need to come entirely from emulated/virtio devices. Now the next question is whether this peer-to-peer DMA is useful, or even used. I have heard of users assigning multiple AMD cards to a system and making use of AMD's XDMA crossfire support, which certainly seems like it's using this capability. NVIDIA may follow along, but they seem to limit SLI in their driver. But as I said, these aren't counting against the locked memory limit and there's probably very little potential use of peer-to-peer between assigned and emulated devices. At the time I first implemented VFIO, I couldn't see a way to tell the difference between mappings for emulated and assigned devices, they're all just opaque mappings. However, I think we could now be a bit smarter to figure out the device associated with the MemorySection and skip emulated device regions. If we were to do that, we could expose it as a new option to vfio-pci, so libvirt could probe for it and maybe remove the fudge factor on the lock limit. Thanks, Alex

On Tue, Nov 10, 2015 at 09:40:34 -0700, Alex Williamson wrote:
On Fri, 2015-11-06 at 08:30 +0100, Peter Krempa wrote:
[...]
Now the next question is whether this peer-to-peer DMA is useful, or even used. I have heard of users assigning multiple AMD cards to a system and making use of AMD's XDMA crossfire support, which certainly seems like it's using this capability. NVIDIA may follow along, but they seem to limit SLI in their driver. But as I said, these aren't counting against the locked memory limit and there's probably very little potential use of peer-to-peer between assigned and emulated devices. At the time I first implemented VFIO, I couldn't see a way to tell the difference between mappings for emulated and assigned devices, they're all just opaque mappings. However, I think we could now be a bit smarter to figure out the device associated with the MemorySection and skip emulated device regions. If we were to do that, we could expose it as a new option to vfio-pci, so libvirt could probe for it and
Hmm, this might be interresting in the future. We'll probably have to work out though the case when the VM is started fresh, since we really can't ask qemu before starting it.
maybe remove the fudge factor on the lock limit. Thanks,
Thank you very much for the explanation. I've tried to condense the key points into a improved comment in our code: http://www.redhat.com/archives/libvir-list/2015-November/msg00341.html Peter

On Wed, Nov 04, 2015 at 16:14:29 +0100, Peter Krempa wrote:
For VFIO passthrough the guest memory including the device memory to be
a verb is missing here --------------------------------------------^ :-)
resident in memory. Previously we used a magic constant of 1GiB that we added to the current memory size of the VM to accomodate all this. It is possible though that the device that is passed through exposes more than that. To avoid guessing wrong again in the future set the memory lock limit to unlimited at the point when VFIO will be used.
To avoid any confusion in the future, I'd explicitly mention that if there is a need to limit the memory which a QEMU process can lock, memory hard_limit in domain XML needs to be set.
This problem is similar to the issue where we tried to infer the hard limit for a VM according to it's configuration. This proved to be really problematic so we moved this burden to the user.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480
Additionally this patch also fixes a similar bug, where the mlock limit was not increased prior to a memory hotplug operation. https://bugzilla.redhat.com/show_bug.cgi?id=1273491 --- src/qemu/qemu_command.c | 17 +++++------------ src/qemu/qemu_hotplug.c | 20 +++++--------------- src/util/virprocess.c | 3 +++ 3 files changed, 13 insertions(+), 27 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8824541..183aa85 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -11448,20 +11448,13 @@ qemuBuildCommandLine(virConnectPtr conn, }
if (mlock) { - unsigned long long memKB; - - /* VFIO requires all of the guest's memory to be - * locked resident, plus some amount for IO - * space. Alex Williamson suggested adding 1GiB for IO - * space just to be safe (some finer tuning might be - * nice, though). - */ + /* VFIO requires all of the guest's memory and the IO space of the + * device to be locked resident. Since we can't really know what the + * users decide to plug in we need to set the limit to unlimited. */ if (virMemoryLimitIsSet(def->mem.hard_limit)) - memKB = def->mem.hard_limit; + virCommandSetMaxMemLock(cmd, def->mem.hard_limit << 10); else - memKB = virDomainDefGetMemoryActual(def) + 1024 * 1024; - - virCommandSetMaxMemLock(cmd, memKB * 1024); + virCommandSetMaxMemLock(cmd, ULLONG_MAX); }
if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MSG_TIMESTAMP) && diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 8f2fda9..1cf61ac 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1254,7 +1254,6 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, bool teardowncgroup = false; bool teardownlabel = false; int backend; - unsigned long long memKB; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); unsigned int flags = 0;
@@ -1279,20 +1278,11 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, goto error; }
- /* VFIO requires all of the guest's memory to be locked - * resident (plus an additional 1GiB to cover IO space). During - * hotplug, the guest's memory may already be locked, but it - * doesn't hurt to "change" the limit to the same value. - * NB: the domain's memory tuning parameters are stored as - * Kibibytes, but virProcessSetMaxMemLock expects the value in - * bytes. - */ - if (virMemoryLimitIsSet(vm->def->mem.hard_limit)) - memKB = vm->def->mem.hard_limit; - else - memKB = virDomainDefGetMemoryActual(vm->def) + (1024 * 1024); - - virProcessSetMaxMemLock(vm->pid, memKB * 1024); + /* VFIO requires all of the guest's memory and the IO space of the + * device to be locked resident. Since we can't really know what the + * users decide to plug in we need to set the limit to unlimited. */ + if (!virMemoryLimitIsSet(vm->def->mem.hard_limit)) + virProcessSetMaxMemLock(vm->pid, ULLONG_MAX);
I think this needs to set the limit in the same way qemuBuildCommandLine does it. If there was no need to set the limit (no VFIO device, not explicitly requested, no RDMA migration) in qemuBuildCommandLine we have to set it here, otherwise hotplugging the first VFIO device into such domain would kill the domain.
break;
default: diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 103c114..92195df 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -749,6 +749,9 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) if (bytes == 0) return 0;
+ if (bytes == ULLONG_MAX) + bytes = RLIM_INFINITY; + rlim.rlim_cur = rlim.rlim_max = bytes; if (pid == 0) { if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) {
Jirka
participants (4)
-
Alex Williamson
-
Jiri Denemark
-
Laine Stump
-
Peter Krempa