[libvirt] adding a new libvirt xml element for File Descriptor backed memory for use with vhost-user

Hi I work mostly in OpenStack on enabling ovs with dpdk. When deploying vms on host running ovs with dpdk vms are booted utilizing Vhost-user interfaces. Qemu support creating vms with vhost-user network interfaces as of v2.1. Libvirt currently has support for requesting the use of vhost-user interfaces by added the following xml fragment <interface type='vhostuser'> <mac address='fa:16:3e:ea:2a:08'/> <source type='unix' path='/var/run/openvswitch/vhuf1204e0c-98' mode='client'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> Send traffic via a vhost-user interface, the vhost-user frontend (provided by qemu) must share the virtios ring to the vhost-backed via passing a file descript to the memory backing object of the qemu instance as part of the port creation. Today the only native way to request Libvirt to create a vm who's memory is backed by a memdev which can be Accessed via a file descriptor is to request hugepage backed memory. This requires the operator to Manage and configure hugepage on each of their compute hosts and take special care to ensure that vms are Not placed on host were vhost-user interface are used if they do not request hugepages. Today it is possible to use Libvirt to spawn a vm without hugepage memory and a file descriptor backed memdev Via the use of the qemu:commandline element. <qemu:commandline> <qemu:arg value='-object'/> <qemu:arg value='memory-backend-file,id=mem,size=1024M,mem-path=/var/lib/libvirt/qemu,share=on'/> <qemu:arg value='-numa'/> <qemu:arg value='node,memdev=mem'/> <qemu:arg value='-mem-prealloc'/> </qemu:commandline> I created a proof of concept patch to nova to demonstrate that this works however to support this usecase in Nova a new xml element is required. https://review.openstack.org/#/c/309565/1 I would like to propose the introduction of a new subelemnt to the memorybacking element to request file discrptro backed memory <memoryBacking> <filedescriptor size_mb="1024" path="/var/lib/libvirt/qemu" prealloc="true" shared="on" /> </memoryBacking> The above filedescriptor xml fragment above would then be parsed to generate the same qemu argument as the qemu:commandline fragment Therefor allowing the creation of a vm with vhost-user interface without hugepage memory backing. Before I start looking at the Libvirt code base I wanted to ask if the Libvirt community would be open to this change and what would be the best way To approach enabling this feature. Regards Sean.

On Tue, May 10, 2016 at 09:01:20PM +0000, Mooney, Sean K wrote:
Today the only native way to request Libvirt to create a vm who's memory is backed by a memdev which can be Accessed via a file descriptor is to request hugepage backed memory. This requires the operator to Manage and configure hugepage on each of their compute hosts and take special care to ensure that vms are Not placed on host were vhost-user interface are used if they do not request hugepages.
Today it is possible to use Libvirt to spawn a vm without hugepage memory and a file descriptor backed memdev Via the use of the qemu:commandline element.
<qemu:commandline> <qemu:arg value='-object'/> <qemu:arg value='memory-backend-file,id=mem,size=1024M,mem-path=/var/lib/libvirt/qemu,share=on'/> <qemu:arg value='-numa'/> <qemu:arg value='node,memdev=mem'/> <qemu:arg value='-mem-prealloc'/> </qemu:commandline>
I created a proof of concept patch to nova to demonstrate that this works however to support this usecase in Nova a new xml element is required. https://review.openstack.org/#/c/309565/1
I would like to propose the introduction of a new subelemnt to the memorybacking element to request file discrptro backed memory
<memoryBacking> <filedescriptor size_mb="1024" path="/var/lib/libvirt/qemu" prealloc="true" shared="on" /> </memoryBacking>
Specifying a size is not required - we already know how big memory must be for the guest. We already have a memAccess='shared' attribute against the <numa> element that is used to determine if the underlying memory should be setup as shared. We could define a further element that lets us control memory access mode for guests without NUMA topology specified. <memoryBacking> <access mode="shared"/> </memoryBacking> For huge pages it seems we unconditionally pass --mem-prealloc. I'm thinking we could perhaps make that configurable via an element <memoryBacking> <allocation mode="immediate|ondemand"/> </memoryBacking> to control use of -mem-prealloc or not. So all that remains is a way to request file based backing of RAM. As with huge pages, I think we should hide the actual path from the user. We should just use /dev/shm as the backing for non-hugepage RAM. For this we could define something like <memoryBacking> <source type="file|anonymous"/> </memoryBacking> Putting that all together, to get what you want we'd have <memoryBacking> <source type="file"/> <access mode="shared"/> <allocation mode="immediate"/> </memoryBacking> Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Hi Daniel Thanks for your response. Comments inline. Regards Sean.
-----Original Message----- From: Daniel P. Berrange [mailto:berrange@redhat.com] Sent: Wednesday, May 11, 2016 10:44 AM To: Mooney, Sean K <sean.k.mooney@intel.com> Cc: libvir-list@redhat.com Subject: Re: [libvirt] adding a new libvirt xml element for File Descriptor backed memory for use with vhost-user
Today the only native way to request Libvirt to create a vm who's memory is backed by a memdev which can be Accessed via a file descriptor is to request hugepage backed memory. This requires the operator to Manage and configure hugepage on each of their compute hosts and take special care to ensure that vms are Not placed on host were vhost-user interface are used if they do not request hugepages.
Today it is possible to use Libvirt to spawn a vm without hugepage memory and a file descriptor backed memdev Via the use of the qemu:commandline element.
<qemu:commandline> <qemu:arg value='-object'/> <qemu:arg value='memory-backend-file,id=mem,size=1024M,mem-
On Tue, May 10, 2016 at 09:01:20PM +0000, Mooney, Sean K wrote: path=/var/lib/libvirt/qemu,share=on'/>
<qemu:arg value='-numa'/> <qemu:arg value='node,memdev=mem'/> <qemu:arg value='-mem-prealloc'/> </qemu:commandline>
I created a proof of concept patch to nova to demonstrate that this works however to support this usecase in Nova a new xml element is
required.
https://review.openstack.org/#/c/309565/1
I would like to propose the introduction of a new subelemnt to the memorybacking element to request file discrptro backed memory
<memoryBacking> <filedescriptor size_mb="1024" path="/var/lib/libvirt/qemu" prealloc="true" shared="on" /> </memoryBacking>
Specifying a size is not required - we already know how big memory must be for the guest.
We already have a memAccess='shared' attribute against the <numa> element that is used to determine if the underlying memory should be setup as shared. We could define a further element that lets us control memory access mode for guests without NUMA topology specified. [Mooney, Sean K] hi yes the reason I added the shared attribute was to cater for The case of guest without numa topology. For guest with numa topology I agree that Using the memAcess='shared' on the cell is better for consistency with hugepage memory.
<memoryBacking> <access mode="shared"/> </memoryBacking>
For huge pages it seems we unconditionally pass --mem-prealloc. I'm thinking we could perhaps make that configurable via an element
<memoryBacking> <allocation mode="immediate|ondemand"/> </memoryBacking>
to control use of -mem-prealloc or not. [Mooney, Sean K] for the vhost user case the the mem-prealloc is required Because you are basically doing dma so you really want memory to allocated. Generically though from a Libvirt point of view I do think It makes sense for this To be configurable to allow over subscript of memory for higher density.
So all that remains is a way to request file based backing of RAM. As with huge pages, I think we should hide the actual path from the user. We should just use /dev/shm as the backing for non-hugepage RAM. For this we could define something like
<memoryBacking> <source type="file|anonymous"/> </memoryBacking>
[Mooney, Sean K] for some reason when I used /dev/shm I could only boot one instance at a time. that was my first choice but maybe we would have to create a file per instance under /dev/shm to make it work.
Putting that all together, to get what you want we'd have
<memoryBacking> <source type="file"/> <access mode="shared"/> <allocation mode="immediate"/> </memoryBacking>
[Mooney, Sean K] Yes this seems like it would be a clean way to address this use case. Can you guage how small/large of a change this would be. Its been A while since I worked with c directly but if you could point me in the Right direction in the Libvirt codebase I would be happy to look at creating an RFC patch.
From a nova side assuming Libvirt was extended for this feature should I open a blueprint to extend the existing guest memory backing support In parallel to the Libvirt implementation or wait until after it is support in Libvirt to start the Nova discussion? In either case I think we agree that any support in nova Would Depend on Libvirt support to be accepted in upstream nova.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt- manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk- vnc :|

On Thu, May 12, 2016 at 04:00:29PM +0000, Mooney, Sean K wrote:
Today it is possible to use Libvirt to spawn a vm without hugepage memory and a file descriptor backed memdev Via the use of the qemu:commandline element.
<qemu:commandline> <qemu:arg value='-object'/> <qemu:arg value='memory-backend-file,id=mem,size=1024M,mem- path=/var/lib/libvirt/qemu,share=on'/> <qemu:arg value='-numa'/> <qemu:arg value='node,memdev=mem'/> <qemu:arg value='-mem-prealloc'/> </qemu:commandline>
I created a proof of concept patch to nova to demonstrate that this works however to support this usecase in Nova a new xml element is required. https://review.openstack.org/#/c/309565/1
I would like to propose the introduction of a new subelemnt to the memorybacking element to request file discrptro backed memory
<memoryBacking> <filedescriptor size_mb="1024" path="/var/lib/libvirt/qemu" prealloc="true" shared="on" /> </memoryBacking>
Specifying a size is not required - we already know how big memory must be for the guest.
We already have a memAccess='shared' attribute against the <numa> element that is used to determine if the underlying memory should be setup as shared. We could define a further element that lets us control memory access mode for guests without NUMA topology specified. [Mooney, Sean K] hi yes the reason I added the shared attribute was to cater for The case of guest without numa topology. For guest with numa topology I agree that Using the memAcess='shared' on the cell is better for consistency with hugepage memory.
<memoryBacking> <access mode="shared"/> </memoryBacking>
For huge pages it seems we unconditionally pass --mem-prealloc. I'm thinking we could perhaps make that configurable via an element
<memoryBacking> <allocation mode="immediate|ondemand"/> </memoryBacking>
to control use of -mem-prealloc or not. [Mooney, Sean K] for the vhost user case the the mem-prealloc is required Because you are basically doing dma so you really want memory to allocated. Generically though from a Libvirt point of view I do think It makes sense for this To be configurable to allow over subscript of memory for higher density.
So all that remains is a way to request file based backing of RAM. As with huge pages, I think we should hide the actual path from the user. We should just use /dev/shm as the backing for non-hugepage RAM. For this we could define something like
<memoryBacking> <source type="file|anonymous"/> </memoryBacking>
[Mooney, Sean K] for some reason when I used /dev/shm I could only boot one instance at a time. that was my first choice but maybe we would have to create a file per instance under /dev/shm to make it work.
QEMU should create the file itself - its not different to our use of hugetlbfs in fact. Possibly you hit a limit on amount of memory allowed to be used via /dev/shm - iirc the mount poin tis limited to 50% by default If you use /var/lib/libvirt/ as the location you get a real file backed by disk, so akin to putting the VM on swap IIUC !
Putting that all together, to get what you want we'd have
<memoryBacking> <source type="file"/> <access mode="shared"/> <allocation mode="immediate"/> </memoryBacking>
[Mooney, Sean K] Yes this seems like it would be a clean way to address this use case. Can you guage how small/large of a change this would be. Its been A while since I worked with c directly but if you could point me in the Right direction in the Libvirt codebase I would be happy to look at creating an RFC patch.
First there's defining the XML extensions - needs docs/schemas/domaincommon.rng and src/conf/domain_conf.{c,h} to be changed. Then there's wiring up QEMU XML -> ARGV conversion - src/qemu/qemu_command.c and adding test cases in tests/qemuxml2argvtest.c
From a nova side assuming Libvirt was extended for this feature should I open a blueprint to extend the existing guest memory backing support In parallel to the Libvirt implementation or wait until after it is support in Libvirt to start the Nova discussion? In either case I think we agree that any support in nova Would Depend on Libvirt support to be accepted in upstream nova.
You're going to hit the deadline for approval of Newton specs in Nova fairly soon, and unless the libvirt impl is done before then, I think it is unlikely you'd get a spec approved. So by all means work on this in parallel, but be realistic about chances of approval in Nova for this cycle. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

-----Original Message----- From: Daniel P. Berrange [mailto:berrange@redhat.com] Sent: Thursday, May 12, 2016 5:28 PM To: Mooney, Sean K <sean.k.mooney@intel.com> Cc: libvir-list@redhat.com Subject: Re: [libvirt] adding a new libvirt xml element for File Descriptor backed memory for use with vhost-user
On Thu, May 12, 2016 at 04:00:29PM +0000, Mooney, Sean K wrote:
Today it is possible to use Libvirt to spawn a vm without hugepage memory and a file descriptor backed memdev Via the use of the qemu:commandline element.
<qemu:commandline> <qemu:arg value='-object'/> <qemu:arg value='memory-backend-file,id=mem,size=1024M,mem- path=/var/lib/libvirt/qemu,share=on'/> <qemu:arg value='-numa'/> <qemu:arg value='node,memdev=mem'/> <qemu:arg value='-mem-prealloc'/> </qemu:commandline>
I created a proof of concept patch to nova to demonstrate that this works however to support this usecase in Nova a new xml element is required. https://review.openstack.org/#/c/309565/1
I would like to propose the introduction of a new subelemnt to the memorybacking element to request file discrptro backed memory
<memoryBacking> <filedescriptor size_mb="1024" path="/var/lib/libvirt/qemu" prealloc="true" shared="on" /> </memoryBacking>
Specifying a size is not required - we already know how big memory must be for the guest.
We already have a memAccess='shared' attribute against the <numa> element that is used to determine if the underlying memory should be setup as shared. We could define a further element that lets us control memory access mode for guests without NUMA topology specified. [Mooney, Sean K] hi yes the reason I added the shared attribute was to cater for The case of guest without numa topology. For guest with numa topology I agree that Using the memAcess='shared' on the cell is better for consistency with hugepage memory.
<memoryBacking> <access mode="shared"/> </memoryBacking>
For huge pages it seems we unconditionally pass --mem-prealloc. I'm thinking we could perhaps make that configurable via an element
<memoryBacking> <allocation mode="immediate|ondemand"/> </memoryBacking>
to control use of -mem-prealloc or not. [Mooney, Sean K] for the vhost user case the the mem-prealloc is required Because you are basically doing dma so you really want memory to allocated. Generically though from a Libvirt point of view I do think It makes sense for this To be configurable to allow over subscript of memory for higher density.
So all that remains is a way to request file based backing of RAM. As with huge pages, I think we should hide the actual path from the user. We should just use /dev/shm as the backing for non-hugepage RAM. For this we could define something like
<memoryBacking> <source type="file|anonymous"/> </memoryBacking>
[Mooney, Sean K] for some reason when I used /dev/shm I could only boot one instance at a time. that was my first choice but maybe we would have to create a file per instance under /dev/shm to make it work.
QEMU should create the file itself - its not different to our use of hugetlbfs in fact. Possibly you hit a limit on amount of memory allowed to be used via /dev/shm - iirc the mount point tis limited to 50% by default
If you use /var/lib/libvirt/ as the location you get a real file backed by disk, so akin to putting the VM on swap IIUC ! [Mooney, Sean K] That was my initial assumption too however when you use /var/lib/libvirt/ or /dev/shm qemu does not create a file in the directory. What I think is happening is it does not actually create a file and just a file descriptor that is mapped to a memory region. I believe it is merely using the path to determine what the default page size should be when allocating filebacking in memory. This is something that we can look into though.
Putting that all together, to get what you want we'd have
<memoryBacking> <source type="file"/> <access mode="shared"/> <allocation mode="immediate"/> </memoryBacking>
[Mooney, Sean K] Yes this seems like it would be a clean way to address this use case. Can you guage how small/large of a change this would be. Its been A while since I worked with c directly but if you could point me in the Right direction in the Libvirt codebase I would be happy to look at creating an RFC patch.
First there's defining the XML extensions - needs docs/schemas/domaincommon.rng and src/conf/domain_conf.{c,h} to be changed.
Then there's wiring up QEMU XML -> ARGV conversion - src/qemu/qemu_command.c and adding test cases in tests/qemuxml2argvtest.c
From a nova side assuming Libvirt was extended for this feature should I open a blueprint to extend the existing guest memory backing support In parallel to the Libvirt implementation or wait until after it is support in Libvirt to start the Nova discussion? In either case I think we agree that any support in nova Would Depend on Libvirt support to be accepted in upstream nova.
You're going to hit the deadline for approval of Newton specs in Nova fairly soon, and unless the libvirt impl is done before then, I think it is unlikely you'd get a spec approved. So by all means work on this in parallel, but be realistic about chances of approval in Nova for this cycle.
[Mooney, Sean K] actually I was assuming that this would be completed early In Ocata as it required changes in Libvirt first.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt- manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk- vnc :|
participants (2)
-
Daniel P. Berrange
-
Mooney, Sean K