[libvirt-users] HugePages - can't start guest that requires them

Hello All, I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl. I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest. Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty). HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either. Any help is much appreciated. HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB Regards, Richard

On Fri, Jan 30, 2015 at 03:33:43PM -0800, G. Richard Bellamy wrote:
Hello All,
I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl.
I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest.
Looking at the XML: <memoryBacking> <hugepages> <page size='16777216' unit='KiB' nodeset='0'/> This means you want the guest's memory to be allocated from 16GiB hugepages. You probably wanted to put this there: <page size='2' unit='MiB' ...
Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty).
HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either.
Any help is much appreciated.
HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB
Regards, Richard
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

ack. Yeah, I had seen that and thought I corrected it. Thank you very much, the instances start. Now I just have to figure out why I seem to be using 2x the number of hugepages I think I should be. Numstat seems to think that now that I've started up the two VMs, I am using twice as many hugepages as the amount I had allocated via sysctl [1]. [1] http://sprunge.us/LLNM On Sat, Jan 31, 2015 at 9:55 AM, Martin Kletzander <mkletzan@redhat.com> wrote:
On Fri, Jan 30, 2015 at 03:33:43PM -0800, G. Richard Bellamy wrote:
Hello All,
I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl.
I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest.
Looking at the XML:
<memoryBacking> <hugepages> <page size='16777216' unit='KiB' nodeset='0'/>
This means you want the guest's memory to be allocated from 16GiB hugepages. You probably wanted to put this there:
<page size='2' unit='MiB' ...
Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted
via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty).
HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either.
Any help is much appreciated.
HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB
Regards, Richard
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

Did you create a mount for the hugepages? If you did, that's maybe the problem. I did that also at first but with libvirt it isn't necessary and in my case, it broke hugepages... If I'm not mistaking, libvirt takes care of the hugepages mount. A while ago, I've written a wiki to use hugepages in libvirt and Ubuntu. https://help.ubuntu.com/community/KVM%20-%20Using%20Hugepages Maybe this helps? ________________________________________ Van: G. Richard Bellamy [rbellamy@pteradigm.com] Verzonden: zaterdag 31 januari 2015 0:33 Aan: libvirt-users@redhat.com Onderwerp: [libvirt-users] HugePages - can't start guest that requires them Hello All, I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl. I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest. Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty). HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either. Any help is much appreciated. HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB Regards, Richard _______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

Yeah, Dominique, your wiki was one of the many docs I read through before/during/after starting down this primrose path... thanks for writing it. I'm an Arch user, and I couldn't find anything to indicate qemu, as its compiled for Arch, will look in /etc/default/qemu-kvm. And now that I've got the right page size, the instances are starting... The reason I want to use the page element to the hugepages directive is that I want to target a numa node directly - in other words, I like the idea of one VM running on Node 0, and the other running on Node 2. Your comment about libvirt taking care of the hugepages mount isn't consistent with my reading or experience - on a systemd-based system, systemd takes care of the hugetlbfs mount to /dev/hugepages, and the libvirt builds the /dev/hugepages/qemu... directory structure. At least that's what I've seen. -rb On Sat, Jan 31, 2015 at 11:43 AM, Dominique Ramaekers < dominique.ramaekers@cometal.be> wrote:
Did you create a mount for the hugepages? If you did, that's maybe the problem. I did that also at first but with libvirt it isn't necessary and in my case, it broke hugepages...
If I'm not mistaking, libvirt takes care of the hugepages mount.
A while ago, I've written a wiki to use hugepages in libvirt and Ubuntu. https://help.ubuntu.com/community/KVM%20-%20Using%20Hugepages
Maybe this helps?
________________________________________ Van: G. Richard Bellamy [rbellamy@pteradigm.com] Verzonden: zaterdag 31 januari 2015 0:33 Aan: libvirt-users@redhat.com Onderwerp: [libvirt-users] HugePages - can't start guest that requires them
Hello All,
I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl.
I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest.
Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty).
HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either.
Any help is much appreciated.
HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB
Regards, Richard
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

Regarding fine tuning my explanation about what system does de actual mounting of Hugepages, you’re probably right…. Thanks for the correction. On upstart systems (like Ubuntu) the mounting of Hugepages is done by the init script qemu-kvm.conf Van: G. Richard Bellamy [mailto:rbellamy@pteradigm.com] Verzonden: zondag 1 februari 2015 0:02 Aan: Dominique Ramaekers CC: libvirt-users@redhat.com Onderwerp: Re: [libvirt-users] HugePages - can't start guest that requires them Yeah, Dominique, your wiki was one of the many docs I read through before/during/after starting down this primrose path... thanks for writing it. I'm an Arch user, and I couldn't find anything to indicate qemu, as its compiled for Arch, will look in /etc/default/qemu-kvm. And now that I've got the right page size, the instances are starting... The reason I want to use the page element to the hugepages directive is that I want to target a numa node directly - in other words, I like the idea of one VM running on Node 0, and the other running on Node 2. Your comment about libvirt taking care of the hugepages mount isn't consistent with my reading or experience - on a systemd-based system, systemd takes care of the hugetlbfs mount to /dev/hugepages, and the libvirt builds the /dev/hugepages/qemu... directory structure. At least that's what I've seen. -rb On Sat, Jan 31, 2015 at 11:43 AM, Dominique Ramaekers <dominique.ramaekers@cometal.be<mailto:dominique.ramaekers@cometal.be>> wrote: Did you create a mount for the hugepages? If you did, that's maybe the problem. I did that also at first but with libvirt it isn't necessary and in my case, it broke hugepages... If I'm not mistaking, libvirt takes care of the hugepages mount. A while ago, I've written a wiki to use hugepages in libvirt and Ubuntu. https://help.ubuntu.com/community/KVM%20-%20Using%20Hugepages Maybe this helps? ________________________________________ Van: G. Richard Bellamy [rbellamy@pteradigm.com<mailto:rbellamy@pteradigm.com>] Verzonden: zaterdag 31 januari 2015 0:33 Aan: libvirt-users@redhat.com<mailto:libvirt-users@redhat.com> Onderwerp: [libvirt-users] HugePages - can't start guest that requires them Hello All, I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl. I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest. Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty). HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either. Any help is much appreciated. HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB Regards, Richard _______________________________________________ libvirt-users mailing list libvirt-users@redhat.com<mailto:libvirt-users@redhat.com> https://www.redhat.com/mailman/listinfo/libvirt-users

As I mentioned, I got the instances to launch... but they're only taking HugePages from "Node 0", when I believe my setup should pull from both nodes. [atlas] http://sprunge.us/FSEf [prometheus] http://sprunge.us/PJcR 2015-02-03 16:51:48 root@eanna i ~ # virsh start atlas Domain atlas started 2015-02-03 16:51:58 root@eanna i ~ # virsh start prometheus Domain prometheus started 2015-02-03 16:52:53 root@eanna i ~ # numastat -m Per-node system memory usage (in MBs): Node 0 Node 2 Total --------------- --------------- --------------- MemTotal 32113.93 32238.27 64352.20 MemFree 7030.30 7175.87 14206.17 MemUsed 25083.63 25062.40 50146.04 Active 3737.86 4089.77 7827.63 Inactive 3423.46 2832.16 6255.61 Active(anon) 1658.46 2830.59 4489.05 Inactive(anon) 54.35 64.71 119.05 Active(file) 2079.39 1259.18 3338.57 Inactive(file) 3369.11 2767.45 6136.56 Unevictable 15.68 64.39 80.07 Mlocked 15.68 64.39 80.07 Dirty 11.45 6.98 18.43 Writeback 0.00 0.00 0.00 FilePages 5515.35 4078.96 9594.32 Mapped 396.62 336.48 733.10 AnonPages 1661.74 2906.95 4568.69 Shmem 62.90 50.22 113.12 KernelStack 12.89 9.81 22.70 PageTables 46.08 36.50 82.58 NFS_Unstable 0.00 0.00 0.00 Bounce 0.00 0.00 0.00 WritebackTmp 0.00 0.00 0.00 Slab 192.31 160.72 353.03 SReclaimable 137.67 118.47 256.14 SUnreclaim 54.64 42.25 96.89 AnonHugePages 0.00 0.00 0.00 HugePages_Total 17408.00 17408.00 34816.00 HugePages_Free 2048.00 0.00 2048.00 HugePages_Surp 0.00 0.00 0.00 2015-02-03 16:53:47 root@eanna i ~ # numastat -p qemu Per-node process memory usage (in MBs) PID Node 0 Node 2 Total ----------------------- --------------- --------------- --------------- 10315 (qemu-system-x86) 589.76 0.00 32391.84 10346 (qemu-system-x86) 14839.83 0.00 18128.85 ----------------------- --------------- --------------- --------------- Total 15429.59 0.00 50520.68 On Sun, Feb 1, 2015 at 10:43 PM, Dominique Ramaekers <dominique.ramaekers@cometal.be> wrote:
Regarding fine tuning my explanation about what system does de actual mounting of Hugepages, you’re probably right…. Thanks for the correction.
On upstart systems (like Ubuntu) the mounting of Hugepages is done by the init script qemu-kvm.conf
Van: G. Richard Bellamy [mailto:rbellamy@pteradigm.com] Verzonden: zondag 1 februari 2015 0:02 Aan: Dominique Ramaekers CC: libvirt-users@redhat.com Onderwerp: Re: [libvirt-users] HugePages - can't start guest that requires them
Yeah, Dominique, your wiki was one of the many docs I read through before/during/after starting down this primrose path... thanks for writing it. I'm an Arch user, and I couldn't find anything to indicate qemu, as its compiled for Arch, will look in /etc/default/qemu-kvm. And now that I've got the right page size, the instances are starting...
The reason I want to use the page element to the hugepages directive is that I want to target a numa node directly - in other words, I like the idea of one VM running on Node 0, and the other running on Node 2.
Your comment about libvirt taking care of the hugepages mount isn't consistent with my reading or experience - on a systemd-based system, systemd takes care of the hugetlbfs mount to /dev/hugepages, and the libvirt builds the /dev/hugepages/qemu... directory structure. At least that's what I've seen.
-rb
On Sat, Jan 31, 2015 at 11:43 AM, Dominique Ramaekers <dominique.ramaekers@cometal.be> wrote:
Did you create a mount for the hugepages? If you did, that's maybe the problem. I did that also at first but with libvirt it isn't necessary and in my case, it broke hugepages...
If I'm not mistaking, libvirt takes care of the hugepages mount.
A while ago, I've written a wiki to use hugepages in libvirt and Ubuntu. https://help.ubuntu.com/community/KVM%20-%20Using%20Hugepages
Maybe this helps?
________________________________________ Van: G. Richard Bellamy [rbellamy@pteradigm.com] Verzonden: zaterdag 31 januari 2015 0:33 Aan: libvirt-users@redhat.com Onderwerp: [libvirt-users] HugePages - can't start guest that requires them
Hello All,
I'm trying to enable hugepages, I've turned off THP (Transparent Huge Pages), and enabled hugepages in memoryBacking, and set my 2MB hugepages count via sysctl.
I'm getting "libvirtd[5788]: Failed to autostart VM 'atlas': internal error: Unable to find any usable hugetlbfs mount for 16777216 KiB" where atlas is one of my guests and 16777216 KiB is the amount of memory I'm trying to give to the guest.
Yes, i can see the hugepages via numastat -m and hugetlbfs is mounted via /dev/hugepages and there is a dir structure /dev/hugepages/libvirt/qemu (it's empty).
HugePages is big enough to accommodate the 16G i'm allocating... and changing the perms on that directory structure to 777 doesn't work either.
Any help is much appreciated.
HOST: http://sprunge.us/SEdc GUEST: http://sprunge.us/VCYB
Regards, Richard
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

On 04.02.2015 01:59, G. Richard Bellamy wrote:
As I mentioned, I got the instances to launch... but they're only taking HugePages from "Node 0", when I believe my setup should pull from both nodes.
[atlas] http://sprunge.us/FSEf [prometheus] http://sprunge.us/PJcR
[pasting interesting nits from both XMLs] <domain type='kvm' id='2'> <name>atlas</name> <uuid>d9991b1c-2f2d-498a-9d21-51f3cf8e6cd9</uuid> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> </hugepages> <nosharepages/> </memoryBacking> <!-- no numa pining --> </domain> <domain type='kvm' id='3'> <name>prometheus</name> <uuid>dda7d085-701b-4d0a-96d4-584678104fb3</uuid> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='2'/> </hugepages> <nosharepages/> </memoryBacking> <!-- again no numa pining --> </domain> So, at start, the @nodeset attribute to <page/> element refers to guest numa nodes, not host ones. And since you don't define any numa nodes for your guests, it's useless. Side note - I wonder if we should make libvirt fail explicitly in this case. Moreover, you haven't pinned your guests onto any host numa nodes. This means it's up to the host kernel and its scheduler where guest will take memory from. And subsequently hugepages as well. I think you want to add: <numatune> <memory mode='strict' nodeset='0'/> </numatune> to guest XMLs, where @nodeset refers to host numa nodes and tells where the guest should be placed. There are other modes too so please see documentation to tune the XML to match your use case perfectly. Michal

*facepalm* Now that I'm re-reading the documentation it's obvious that <page/> and @nodeset are for the guest, "This tells the hypervisor that the guest should have its memory allocated using hugepages instead of the normal native page size." Pretty clear there. Thank you SO much for the guidance, I'll return to my tweaking. I'll report back here with my results. On Wed, Feb 4, 2015 at 12:17 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 04.02.2015 01:59, G. Richard Bellamy wrote:
As I mentioned, I got the instances to launch... but they're only taking HugePages from "Node 0", when I believe my setup should pull from both nodes.
[atlas] http://sprunge.us/FSEf [prometheus] http://sprunge.us/PJcR
[pasting interesting nits from both XMLs]
<domain type='kvm' id='2'> <name>atlas</name> <uuid>d9991b1c-2f2d-498a-9d21-51f3cf8e6cd9</uuid> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> </hugepages> <nosharepages/> </memoryBacking> <!-- no numa pining --> </domain>
<domain type='kvm' id='3'> <name>prometheus</name> <uuid>dda7d085-701b-4d0a-96d4-584678104fb3</uuid> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='2'/> </hugepages> <nosharepages/> </memoryBacking> <!-- again no numa pining --> </domain>
So, at start, the @nodeset attribute to <page/> element refers to guest numa nodes, not host ones. And since you don't define any numa nodes for your guests, it's useless. Side note - I wonder if we should make libvirt fail explicitly in this case.
Moreover, you haven't pinned your guests onto any host numa nodes. This means it's up to the host kernel and its scheduler where guest will take memory from. And subsequently hugepages as well. I think you want to add:
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
to guest XMLs, where @nodeset refers to host numa nodes and tells where the guest should be placed. There are other modes too so please see documentation to tune the XML to match your use case perfectly.
Michal

First I'll quickly summarize my understanding of how to configure numa... In "//memoryBacking/hugepages/page[@nodeset]" I am telling libvirt to use hugepages for the guest, and to get those hugepages from a particular host NUMA node. In "//numatune/memory[@nodeset]" I am telling libvirt to pin the memory allocation to the guest from a particular host numa node. In "//numatune/memnode[@nodeset]" I am telling libvirt which guest NUMA node (cellid) should come from which host NUMA node (nodeset). In "//cpu/numa/cell[@id]" I am telling libvirt how much memory to allocate to each guest NUMA node (cell). Basically, I thought "nodeset", regardless of where it existed in the domain xml, referred to the host's NUMA node, and "cell" (<cell id=/> or @cellid) refers to the guest's NUMA node. However.... Atlas [1] starts without issue, prometheus [2] fails with "libvirtd[]: hugepages: node 2 not found". I found a patch that contains the code responsible for throwing this error [3], + if (def->cpu && def->cpu->ncells) { + /* Fortunately, we allow only guest NUMA nodes to be continuous + * starting from zero. */ + pos = def->cpu->ncells - 1; + } + + next_bit = virBitmapNextSetBit(page->nodemask, pos); + if (next_bit >= 0) { + virReportError(VIR_ERR_XML_DETAIL, + _("hugepages: node %zd not found"), + next_bit); + return -1; + } Without digging too deeply into the actual code, and just inferring from the above, it looks like we are reading the number of cells set in "//cpu/numa" with def->cpu->ncells, and comparing it to the number of nodesets in "//memoryBacking//hugepages". I think this means that I misunderstand what the nodeset is for in that element... Of note is the fact that my host has non-contiguous NUMA node numbers: 2015-02-09 08:53:06 root@eanna i ~ # numastat node0 node2 numa_hit 216225024 440311113 numa_miss 0 795018 numa_foreign 795018 0 interleave_hit 15835 15783 local_node 214029815 221903122 other_node 2195209 219203009 Thanks again for any help. [1]: http://sprunge.us/jZgS [2]: http://sprunge.us/iETF [3] https://www.redhat.com/archives/libvir-list/2014-September/msg00090.html On Wed, Feb 4, 2015 at 12:03 PM, G. Richard Bellamy <rbellamy@pteradigm.com> wrote:
*facepalm*
Now that I'm re-reading the documentation it's obvious that <page/> and @nodeset are for the guest, "This tells the hypervisor that the guest should have its memory allocated using hugepages instead of the normal native page size." Pretty clear there.
Thank you SO much for the guidance, I'll return to my tweaking. I'll report back here with my results.
On Wed, Feb 4, 2015 at 12:17 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 04.02.2015 01:59, G. Richard Bellamy wrote:
As I mentioned, I got the instances to launch... but they're only taking HugePages from "Node 0", when I believe my setup should pull from both nodes.
[atlas] http://sprunge.us/FSEf [prometheus] http://sprunge.us/PJcR
[pasting interesting nits from both XMLs]
<domain type='kvm' id='2'> <name>atlas</name> <uuid>d9991b1c-2f2d-498a-9d21-51f3cf8e6cd9</uuid> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0'/> </hugepages> <nosharepages/> </memoryBacking> <!-- no numa pining --> </domain>
<domain type='kvm' id='3'> <name>prometheus</name> <uuid>dda7d085-701b-4d0a-96d4-584678104fb3</uuid> <memory unit='KiB'>16777216</memory> <currentMemory unit='KiB'>16777216</currentMemory> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='2'/> </hugepages> <nosharepages/> </memoryBacking> <!-- again no numa pining --> </domain>
So, at start, the @nodeset attribute to <page/> element refers to guest numa nodes, not host ones. And since you don't define any numa nodes for your guests, it's useless. Side note - I wonder if we should make libvirt fail explicitly in this case.
Moreover, you haven't pinned your guests onto any host numa nodes. This means it's up to the host kernel and its scheduler where guest will take memory from. And subsequently hugepages as well. I think you want to add:
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
to guest XMLs, where @nodeset refers to host numa nodes and tells where the guest should be placed. There are other modes too so please see documentation to tune the XML to match your use case perfectly.
Michal

On 09.02.2015 18:19, G. Richard Bellamy wrote:
First I'll quickly summarize my understanding of how to configure numa...
In "//memoryBacking/hugepages/page[@nodeset]" I am telling libvirt to use hugepages for the guest, and to get those hugepages from a particular host NUMA node.
No, @nodeset refers to guest NUMA nodes.
In "//numatune/memory[@nodeset]" I am telling libvirt to pin the memory allocation to the guest from a particular host numa node.
The <memory/> element tells what to do with not explicitly pinned guest NUMA nodes.
In "//numatune/memnode[@nodeset]" I am telling libvirt which guest NUMA node (cellid) should come from which host NUMA node (nodeset).
Correct. This way you can explicitly pin guest onto host NUMA nodes.
In "//cpu/numa/cell[@id]" I am telling libvirt how much memory to allocate to each guest NUMA node (cell).
Yes. Each <cell/> creates guest NUMA node. It interconnects vCPUs and guest memory - which vCPUs should lie in which guest NUMA node, and how much memory should be available for that particular guest NUMA node.
Basically, I thought "nodeset", regardless of where it existed in the domain xml, referred to the host's NUMA node, and "cell" (<cell id=/> or @cellid) refers to the guest's NUMA node.
However....
Atlas [1] starts without issue, prometheus [2] fails with "libvirtd[]: hugepages: node 2 not found". I found a patch that contains the code responsible for throwing this error [3],
+ if (def->cpu && def->cpu->ncells) { + /* Fortunately, we allow only guest NUMA nodes to be continuous + * starting from zero. */ + pos = def->cpu->ncells - 1; + } + + next_bit = virBitmapNextSetBit(page->nodemask, pos); + if (next_bit >= 0) { + virReportError(VIR_ERR_XML_DETAIL, + _("hugepages: node %zd not found"), + next_bit); + return -1; + }
Without digging too deeply into the actual code, and just inferring from the above, it looks like we are reading the number of cells set in "//cpu/numa" with def->cpu->ncells, and comparing it to the number of nodesets in "//memoryBacking//hugepages". I think this means that I misunderstand what the nodeset is for in that element...
Of note is the fact that my host has non-contiguous NUMA node numbers: 2015-02-09 08:53:06 root@eanna i ~ # numastat node0 node2 numa_hit 216225024 440311113 numa_miss 0 795018 numa_foreign 795018 0 interleave_hit 15835 15783 local_node 214029815 221903122 other_node 2195209 219203009
Thanks again for any help.
Libvirt should be perfectly able to cope with noncontinuous host NUMA nodes. However, noncontinuous guest NUMA nodes are not supported yet - but it shouldn't matter since users have full control over creating guest NUMA nodes. Anyway, if you find the documentation incomplete in any sense, any part, or you feel that rewording some paragraphs may help, feel free to propose a patch and I'll review it. Michal

On Tue, Feb 10, 2015 at 1:14 AM, Michal Privoznik <mprivozn@redhat.com> wrote:
On 09.02.2015 18:19, G. Richard Bellamy wrote:
First I'll quickly summarize my understanding of how to configure numa...
In "//memoryBacking/hugepages/page[@nodeset]" I am telling libvirt to use hugepages for the guest, and to get those hugepages from a particular host NUMA node.
No, @nodeset refers to guest NUMA nodes.
In "//numatune/memory[@nodeset]" I am telling libvirt to pin the memory allocation to the guest from a particular host numa node.
The <memory/> element tells what to do with not explicitly pinned guest NUMA nodes.
In "//numatune/memnode[@nodeset]" I am telling libvirt which guest NUMA node (cellid) should come from which host NUMA node (nodeset).
Correct. This way you can explicitly pin guest onto host NUMA nodes.
In "//cpu/numa/cell[@id]" I am telling libvirt how much memory to allocate to each guest NUMA node (cell).
Yes. Each <cell/> creates guest NUMA node. It interconnects vCPUs and guest memory - which vCPUs should lie in which guest NUMA node, and how much memory should be available for that particular guest NUMA node.
Basically, I thought "nodeset", regardless of where it existed in the domain xml, referred to the host's NUMA node, and "cell" (<cell id=/> or @cellid) refers to the guest's NUMA node.
However....
Atlas [1] starts without issue, prometheus [2] fails with "libvirtd[]: hugepages: node 2 not found". I found a patch that contains the code responsible for throwing this error [3],
+ if (def->cpu && def->cpu->ncells) { + /* Fortunately, we allow only guest NUMA nodes to be continuous + * starting from zero. */ + pos = def->cpu->ncells - 1; + } + + next_bit = virBitmapNextSetBit(page->nodemask, pos); + if (next_bit >= 0) { + virReportError(VIR_ERR_XML_DETAIL, + _("hugepages: node %zd not found"), + next_bit); + return -1; + }
Without digging too deeply into the actual code, and just inferring from the above, it looks like we are reading the number of cells set in "//cpu/numa" with def->cpu->ncells, and comparing it to the number of nodesets in "//memoryBacking//hugepages". I think this means that I misunderstand what the nodeset is for in that element...
Of note is the fact that my host has non-contiguous NUMA node numbers: 2015-02-09 08:53:06 root@eanna i ~ # numastat node0 node2 numa_hit 216225024 440311113 numa_miss 0 795018 numa_foreign 795018 0 interleave_hit 15835 15783 local_node 214029815 221903122 other_node 2195209 219203009
Thanks again for any help.
Libvirt should be perfectly able to cope with noncontinuous host NUMA nodes. However, noncontinuous guest NUMA nodes are not supported yet - but it shouldn't matter since users have full control over creating guest NUMA nodes.
Anyway, if you find the documentation incomplete in any sense, any part, or you feel that rewording some paragraphs may help, feel free to propose a patch and I'll review it.
Thanks again Michal, I'm slowly zeroing in to a good resolution here. I think the documentation is clear enough - it's the fact that a guest NUMA node can be referred to as either cell(id) or nodeset, depending on element context - that's what threw me. I've modified my config [1] based on my understanding, and am running into a new error. Basically I'm hitting the oom-killer [2] even though the hard_limit [3] of memtune is below the total number of hugepages set for that NUMA nodeset. [1] http://sprunge.us/BadI [2] http://sprunge.us/eELZ [3] http://sprunge.us/GYXM

On 20.02.2015 21:32, G. Richard Bellamy wrote:
<snip/>
I've modified my config [1] based on my understanding, and am running into a new error. Basically I'm hitting the oom-killer [2] even though the hard_limit [3] of memtune is below the total number of hugepages set for that NUMA nodeset.
Just drop the hard_limit. It's a blackbox we should had never introduced. In Linux, from kernel's POV, there's no difference between guest RAM and hypervisor memory to store its internal state. It's all one big chunk of memory. And even if you know the first part (how much memory you're letting guest to have), you don't know anything about the other part - how much memory does hypervisor need to store its internal state (which may even change over the time), therefore you can't tell the sum of both parts. Also, in the config of your VM, you're not using hugepages. Or you've just posted wrong XML? Then again, kernel's approach to hugepages is not as awesome as to regular system pages. Either on boot (1GB) or at runtime (2MB) one must cut a slice of memory off to be used by hugepages and nothing else. So even if you have ~17GB RAM free on both nodes, they are reserved for hugepages, hence the OOM. Michal

On Sun, Feb 22, 2015 at 11:01 PM, Michal Privoznik <mprivozn@redhat.com> wrote:
Just drop the hard_limit. It's a blackbox we should had never introduced. In Linux, from kernel's POV, there's no difference between guest RAM and hypervisor memory to store its internal state. It's all one big chunk of memory. And even if you know the first part (how much memory you're letting guest to have), you don't know anything about the other part - how much memory does hypervisor need to store its internal state (which may even change over the time), therefore you can't tell the sum of both parts.
Also, in the config of your VM, you're not using hugepages. Or you've just posted wrong XML?
Then again, kernel's approach to hugepages is not as awesome as to regular system pages. Either on boot (1GB) or at runtime (2MB) one must cut a slice of memory off to be used by hugepages and nothing else. So even if you have ~17GB RAM free on both nodes, they are reserved for hugepages, hence the OOM.
Yeah, I dropped the hard limit, and have set <hugepages/>. I had the hard limit set since I had also tried locking the pages in memory... found out that was no bueno too. I also ported numad to Arch Linux [1] and am using placement='auto', which seems to be working reasonably well. I'll keep you posted. Course, this required a custom build of libvirt [2] to get the NUMAD defines set... [1.0] https://aur.archlinux.org/packages/numad-git/ [1.1] https://aur.archlinux.org/packages/numad/ [2] https://github.com/rbellamy/pkgbuilds/tree/master/libvirt-git
participants (4)
-
Dominique Ramaekers
-
G. Richard Bellamy
-
Martin Kletzander
-
Michal Privoznik