[libvirt] QEMU capabilities vs machine types

Dear list, I've ran into the following bug [1]. The problem is, even though we check whether memory-backend-{ram,file} devices are supported, qemu fails to start. As I see it, on one hand, qemu lies about supported devices (when I imitated libvirt capabilities check by hand on the command line). But on the other hand, libvirt is using '-M none' so even if qemu is fixed, the domain would be unable to start anyway. So I believe my question is, does anybody have a bright idea how to fix this? I don't think we want to extend our capabilities from a list to a matrix (where a machine type would select a list). Moreover, querying and creating the matrix would take ages. But then again, machine type is important. Maybe not so for -x86_64 but for -arm definitely! Or? Michal 1: https://bugzilla.redhat.com/show_bug.cgi?id=1191567

On Wed, Feb 11, 2015 at 04:31:53PM +0100, Michal Privoznik wrote:
Dear list,
I've ran into the following bug [1]. The problem is, even though we check whether memory-backend-{ram,file} devices are supported, qemu fails to start. As I see it, on one hand, qemu lies about supported devices (when I imitated libvirt capabilities check by hand on the command line). But on the other hand, libvirt is using '-M none' so even if qemu is fixed, the domain would be unable to start anyway.
So I believe my question is, does anybody have a bright idea how to fix this? I don't think we want to extend our capabilities from a list to a matrix (where a machine type would select a list). Moreover, querying and creating the matrix would take ages. But then again, machine type is important. Maybe not so for -x86_64 but for -arm definitely! Or?
Historically we've tried to treat the machine type as a black box or a simple "tag" which just affects some set of default settings that guest sees. That assumption has already broken down a little bit on x86 where we need to distinguish PIIX vs Q35 in order todo correct PCI device address assignment, take account of different default PCI host bridge and PATA/SATA controller layout We'll get more of that as we take non-x86 arches more seriously as their machine types show even more variance in the defautl base board layout/setup. While we might try to get QEMU to provide introspection for the machine types to query the default bus topology, I'm not convinced that it will be a tractable problem to solve with an entirely metadata driven approach. IOW I wouldn't be surprised if we end up with more machine type specific code branches in libvirt for arm, etc.
There are two reasons why we query & check the supported capabilities from QEMU 1. There are multiple possible CLI args for the same feature and we need to choose the "best" one to use 2. The feature is not supported and we want to give the caller a better error message than they'd get from QEMU I'm unclear from the bug which scenario applies here. If it is scenario 2 though, I'd just mark it as CANTFIX or WONTFIX, as no matter what we do the user would get an error. It is not worth making our capability matrix a factor of 10+ bigger just to get a better error message. If it is scenario 1, I think the burden is on QEMU to solve. The memory-backend-{file,ram} CLI flags shouldn't be tied to guest machine types, as they are backend config setup options that should not impact guest ABI. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11.02.2015 16:47, Daniel P. Berrange wrote:
On Wed, Feb 11, 2015 at 04:31:53PM +0100, Michal Privoznik wrote:
There are two reasons why we query & check the supported capabilities from QEMU
1. There are multiple possible CLI args for the same feature and we need to choose the "best" one to use
2. The feature is not supported and we want to give the caller a better error message than they'd get from QEMU
I'm unclear from the bug which scenario applies here.
If it is scenario 2 though, I'd just mark it as CANTFIX or WONTFIX, as no matter what we do the user would get an error. It is not worth making our capability matrix a factor of 10+ bigger just to get a better error message.
If it is scenario 1, I think the burden is on QEMU to solve. The memory-backend-{file,ram} CLI flags shouldn't be tied to guest machine types, as they are backend config setup options that should not impact guest ABI.
It's somewhere in between 1 and 2. Back in RHEL-7.0 days libvirt would have created a guest with: -numa node,...,mem=1337 But if qemu reports it support memory-backend-ram, libvirt tries to use it: -object memory-backend-ram,id=ram-node0,size=1337M,... \ -numa node,...,memdev=ram-node0 This breaks migration to newer qemu which is in RHEL-7.1. If qemu would report the correct value, we can generate the correct command line and migration succeeds. However, our fault is, we are not asking the correct question anyway. Michal

On Wed, Feb 11, 2015 at 05:09:01PM +0100, Michal Privoznik wrote:
On 11.02.2015 16:47, Daniel P. Berrange wrote:
On Wed, Feb 11, 2015 at 04:31:53PM +0100, Michal Privoznik wrote:
There are two reasons why we query & check the supported capabilities from QEMU
1. There are multiple possible CLI args for the same feature and we need to choose the "best" one to use
2. The feature is not supported and we want to give the caller a better error message than they'd get from QEMU
I'm unclear from the bug which scenario applies here.
If it is scenario 2 though, I'd just mark it as CANTFIX or WONTFIX, as no matter what we do the user would get an error. It is not worth making our capability matrix a factor of 10+ bigger just to get a better error message.
If it is scenario 1, I think the burden is on QEMU to solve. The memory-backend-{file,ram} CLI flags shouldn't be tied to guest machine types, as they are backend config setup options that should not impact guest ABI.
It's somewhere in between 1 and 2. Back in RHEL-7.0 days libvirt would have created a guest with:
-numa node,...,mem=1337
But if qemu reports it support memory-backend-ram, libvirt tries to use it:
-object memory-backend-ram,id=ram-node0,size=1337M,... \ -numa node,...,memdev=ram-node0
This breaks migration to newer qemu which is in RHEL-7.1. If qemu would report the correct value, we can generate the correct command line and migration succeeds. However, our fault is, we are not asking the correct question anyway.
Ah so the problem is rather than QEMU's migration data stream was not compatible for these two possible NUMA setups, despite it not affecting guest ABI. In general I'd like to think we'd not get into this situation in the first place. We'd probably have to just special case the code based on machine type in this case though Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Feb 11, 2015 at 05:09:01PM +0100, Michal Privoznik wrote:
On 11.02.2015 16:47, Daniel P. Berrange wrote:
On Wed, Feb 11, 2015 at 04:31:53PM +0100, Michal Privoznik wrote:
There are two reasons why we query & check the supported capabilities from QEMU
1. There are multiple possible CLI args for the same feature and we need to choose the "best" one to use
2. The feature is not supported and we want to give the caller a better error message than they'd get from QEMU
I'm unclear from the bug which scenario applies here.
If it is scenario 2 though, I'd just mark it as CANTFIX or WONTFIX, as no matter what we do the user would get an error. It is not worth making our capability matrix a factor of 10+ bigger just to get a better error message.
If it is scenario 1, I think the burden is on QEMU to solve. The memory-backend-{file,ram} CLI flags shouldn't be tied to guest machine types, as they are backend config setup options that should not impact guest ABI.
It's somewhere in between 1 and 2. Back in RHEL-7.0 days libvirt would have created a guest with:
-numa node,...,mem=1337
But if qemu reports it support memory-backend-ram, libvirt tries to use it:
-object memory-backend-ram,id=ram-node0,size=1337M,... \ -numa node,...,memdev=ram-node0
This breaks migration to newer qemu which is in RHEL-7.1. If qemu would report the correct value, we can generate the correct command line and migration succeeds. However, our fault is, we are not asking the correct question anyway.
I understand that RHEL-7.1 QEMU is not providing enough data for libvirt to detect this before it is too late. What I am missing here is: why wasn't commit f309db1f4d51009bad0d32e12efc75530b66836b enough to fix this specific case? For reference: commit f309db1f4d51009bad0d32e12efc75530b66836b Author: Michal Privoznik <mprivozn@redhat.com> Date: Thu Dec 18 12:36:48 2014 +0100 qemu: Create memory-backend-{ram,file} iff needed Libvirt BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1175397 QEMU BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1170093 In qemu there are two interesting arguments: 1) -numa to create a guest NUMA node 2) -object memory-backend-{ram,file} to tell qemu which memory region on which host's NUMA node it should allocate the guest memory from. Combining these two together we can instruct qemu to create a guest NUMA node that is tied to a host NUMA node. And it works just fine. However, depending on machine type used, there might be some issued during migration when OVMF is enabled (see QEMU BZ). While this truly is a QEMU bug, we can help avoiding it. The problem lies within the memory backend objects somewhere. Having said that, fix on our side consists on putting those objects on the command line if and only if needed. For instance, while previously we would construct this (in all ways correct) command line: -object memory-backend-ram,size=256M,id=ram-node0 \ -numa node,nodeid=0,cpus=0,memdev=ram-node0 now we create just: -numa node,nodeid=0,cpus=0,mem=256 because the backend object is obviously not tied to any specific host NUMA node. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> -- Eduardo

On 12.02.2015 20:25, Eduardo Habkost wrote:
On Wed, Feb 11, 2015 at 05:09:01PM +0100, Michal Privoznik wrote:
On 11.02.2015 16:47, Daniel P. Berrange wrote:
On Wed, Feb 11, 2015 at 04:31:53PM +0100, Michal Privoznik wrote:
There are two reasons why we query & check the supported capabilities from QEMU
1. There are multiple possible CLI args for the same feature and we need to choose the "best" one to use
2. The feature is not supported and we want to give the caller a better error message than they'd get from QEMU
I'm unclear from the bug which scenario applies here.
If it is scenario 2 though, I'd just mark it as CANTFIX or WONTFIX, as no matter what we do the user would get an error. It is not worth making our capability matrix a factor of 10+ bigger just to get a better error message.
If it is scenario 1, I think the burden is on QEMU to solve. The memory-backend-{file,ram} CLI flags shouldn't be tied to guest machine types, as they are backend config setup options that should not impact guest ABI.
It's somewhere in between 1 and 2. Back in RHEL-7.0 days libvirt would have created a guest with:
-numa node,...,mem=1337
But if qemu reports it support memory-backend-ram, libvirt tries to use it:
-object memory-backend-ram,id=ram-node0,size=1337M,... \ -numa node,...,memdev=ram-node0
This breaks migration to newer qemu which is in RHEL-7.1. If qemu would report the correct value, we can generate the correct command line and migration succeeds. However, our fault is, we are not asking the correct question anyway.
I understand that RHEL-7.1 QEMU is not providing enough data for libvirt to detect this before it is too late. What I am missing here is: why wasn't commit f309db1f4d51009bad0d32e12efc75530b66836b enough to fix this specific case?
The numa pinning can be expressed in libvirt in this way: <numatune> <memory mode='strict' nodeset='0-7'/> <memnode cellid='0' mode='preferred' nodeset='3'/> <memnode cellid='2' mode='strict' nodeset='1-2,5,7'/> </numatune> This tells, to pin guest #0 onto host #3, guest #2 onto host #1-2,5, or 7. For the rest of guest numa nodes, they are placed onto host #0-7. As long as there explicit guest guest numa node pinning onto host nodes (the <memnode/> element), memory-object-ram is required. However, if <numatune/> has only one child <memory/> we still can guarantee the requested configuration in CGroups and don't necessarily need memory-object-ram. My patch, you've referred to, was incomplete in this case. Moreover, it was buggy, it allowed combining use of bare -numa and memory-object-ram at the same time (which is not allowed). Michal
participants (3)
-
Daniel P. Berrange
-
Eduardo Habkost
-
Michal Privoznik