Re: [libvirt] [edk2-discuss] [OVMF] resource assignment fails for passthrough PCI GPU

Friday, 22 November 2019

(+Jiri, +libvir-list)

On Fri, Nov 22, 2019 at 04:58:25PM +0000, Dr. David Alan Gilbert wrote:
...
 * Laszlo Ersek (lersek(a)redhat.com) wrote:
 > (+Dave, +Eduardo)
 > 
 > On 11/22/19 00:00, dann frazier wrote:
 > > On Tue, Nov 19, 2019 at 06:06:15AM +0100, Laszlo Ersek wrote:
 > >> On 11/19/19 01:54, dann frazier wrote:
 > >>> On Fri, Nov 15, 2019 at 11:51:18PM +0100, Laszlo Ersek wrote:
 > >>>> On 11/15/19 19:56, dann frazier wrote:
 > >>>>> Hi,
 > >>>>>   I'm trying to passthrough an Nvidia GPU to a q35 KVM
guest, but UEFI
 > >>>>> is failing to allocate resources for it. I have no issues if I
boot w/
 > >>>>> a legacy BIOS, and it works fine if I tell the linux guest to
do the
 > >>>>> allocation itself - but I'm looking for a way to make this
work w/
 > >>>>> OVMF by default.
 > >>>>>
 > >>>>> I posted a debug log here:
 > >>>>>
https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563/+attachment/5...
 > >>>>>
 > >>>>> Linux guest lspci output is also available for both
seabios/OVMF boots here:
 > >>>>>   https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563
 > >>>>
 > >>>> By default, OVMF exposes such a 64-bit MMIO aperture for PCI MMIO
BAR
 > >>>> allocation that is 32GB in size. The generic PciBusDxe driver
collects,
 > >>>> orders, and assigns / allocates the MMIO BARs, but it can work only
out
 > >>>> of the aperture that platform code advertizes.
 > >>>>
 > >>>> Your GPU's region 1 is itself 32GB in size. Given that there
are further
 > >>>> PCI devices in the system with further 64-bit MMIO BARs, the
default
 > >>>> aperture cannot accommodate everything. In such an event,
PciBusDxe
 > >>>> avoids assigning the largest BARs (to my knowledge), in order to
 > >>>> conserve the most aperture possible, for other devices -- hence
break
 > >>>> the fewest possible PCI devices.
 > >>>>
 > >>>> You can control the aperture size from the QEMU command line. You
can
 > >>>> also do it from the libvirt domain XML, technically speaking. The
knob
 > >>>> is experimental, so no stability or compatibility guarantees are
made.
 > >>>> (That's also the reason why it's a bit of a hack in the
libvirt domain XML.)
 > >>>>
 > >>>> The QEMU cmdline options is described in the following edk2 commit
message:
 > >>>>
 > >>>>   https://github.com/tianocore/edk2/commit/7e5b1b670c38
 > >>>
 > >>> Hi Laszlo,
 > >>>
 > >>>   Thanks for taking the time to describe this in detail! The -fw_cfg
 > >>> option did avoid the problem for me.
 > >>
 > >> Good to hear, thanks.
 > >>
 > >>> I also noticed that the above
 > >>> commit message mentions the existence of a 24GB card as a reasoning
 > >>> behind choosing the 32GB default aperture. From what you say below, I
 > >>> understand that bumping this above 64GB could break hosts w/ <= 37
 > >>> physical address bits.
 > >>
 > >> Right.
 > >>
 > >>> What would be the downside of bumping the
 > >>> default aperture to, say, 48GB?
 > >>
 > >> The placement of the aperture is not trivial (please see the code
 > >> comments in the linked commit). The base address of the aperture is
 > >> chosen so that the largest BAR that can fit in the aperture may be
 > >> naturally aligned. (BARs are whole powers of two.)
 > >>
 > >> The largest BAR that can fit in a 48 GB aperture is 32 GB. Therefore
 > >> such an aperture would be aligned at 32 GB -- the lowest base address
 > >> (dependent on guest RAM size) would be 32 GB. Meaning that the aperture
 > >> would end at 32 + 48 =  80 GB. That still breaches the 36-bit phys
 > >> address width.
 > >>
 > >> 32 GB is the largest aperture size that can work with 36-bit phys
 > >> address width; that's the aperture that ends at 64 GB exactly.
 > > 
 > > Thanks, yeah - now that I read the code comments that is clear (as
 > > clear as it can be w/ my low level of base knowledge). In the commit you
 > > mention Gerd (CC'd) had suggested a heuristic-based approach for
 > > sizing the aperture. When you say "PCPU address width" - is that a
 > > function of the available physical bits?
 > 
 > "PCPU address width" is not a "function" of the available
physical bits
 > -- it *is* the available physical bits. "PCPU" simply stands for
 > "physical CPU".
 > 
 > > IOW, would that approach
 > > allow OVMF to automatically grow the aperture to the max ^2 supported
 > > by the host CPU?
 > 
 > Maybe.
 > 
 > The current logic in OVMF works from the guest-physical address space
 > size -- as deduced from multiple factors, such as the 64-bit MMIO
 > aperture size, and others -- towards the guest-CPU (aka VCPU) address
 > width. The VCPU address width is important for a bunch of other purposes
 > in the firmware, so OVMF has to calculate it no matter what.
 > 
 > Again, the current logic is to calculate the highest guest-physical
 > address, and then deduce the VCPU address width from that (and then
 > expose it to the rest of the firmware).
 > 
 > Your suggestion would require passing the PCPU (physical CPU) address
 > width from QEMU/KVM into the guest, and reversing the direction of the
 > calculation. The PCPU address width would determine the VCPU address
 > width directly, and then the 64-bit PCI MMIO aperture would be
 > calculated from that.
 > 
 > However, there are two caveats.
 > 
 > (1) The larger your guest-phys address space (as exposed through the
 > VCPU address width to the rest of the firmware), the more guest RAM you
 > need for page tables. Because, just before entering the DXE phase, the
 > firmware builds 1:1 mapping page tables for the entire guest-phys
 > address space. This is necessary e.g. so you can access any PCI MMIO BAR.
 > 
 > Now consider that you have a huge beefy virtualization host with say 46
 > phys address bits, and a wimpy guest with say 1.5GB of guest RAM. Do you
 > absolutely want tens of *terabytes* for your 64-bit PCI MMIO aperture?
 > Do you really want to pay for the necessary page tables with that meager
 > guest RAM?
 > 
 > (Such machines do exist BTW, for example:
 > 
 >
http://mid.mail-archive.com/9BD73EA91F8E404F851CF3F519B14AA8036C67B5@DGGE...
 > )
 > 
 > In other words, you'd need some kind of knob anyway, because otherwise
 > your aperture could grow too *large*.
 > 
 > 
 > (2) Exposing the PCPU address width to the guest may have nasty
 > consequences at the QEMU/KVM level, regardless of guest firmware. For
 > example, that kind of "guest enlightenment" could interfere with
migration.
 > 
 > If you boot a guest let's say with 16GB of RAM, and tell it "hey friend,
 > have 40 bits of phys address width!", then you'll have a difficult time
 > migrating that guest to a host with a CPU that only has 36-bits wide
 > physical addresses -- even if the destination host has plenty of RAM
 > otherwise, such as a full 64GB.
 > 
 > There could be other QEMU/KVM / libvirt issues that I m unaware of
 > (hence the CC to Dave and Eduardo).

 host physical address width gets messy. There are differences as well
 between upstream qemu behaviour, and some downstreams.
 I think the story is that:

   a) Qemu default: 40 bits on any host
   b) -cpu blah,host-phys-bits=true   to follow the host.
   c) RHEL has host-phys-bits=true by default

 As you say, the only real problem with host-phys-bits is migration -
 between say an E3 and an E5 xeon with different widths.  The magic 40's
 is generally wrong as well - I think it came from some ancient AMD,
 but it's the default on QEMU TCG as well. 
Yes, and because it affects live migration ability, we have two
constraints:
1) It needs to be exposed in the libvirt domain XML;
2) QEMU and libvirt can't choose a value that works for everybody
   (because neither QEMU or libvirt know where the VM might be
   migrated later).

Which is why the BZ below is important:

...

 I don't think there's a way to set it in libvirt;
 https://bugzilla.redhat.com/show_bug.cgi?id=1578278  is a bz asking for
 that.

 IMHO host-phys-bits is actually pretty safe; and makes most sense in a
 lot of cases. 
Yeah, it is mostly safe and makes sense, but messy if you try to
migrate to a host with a different size.

...

 Dave

 > Thanks,
 > Laszlo
 > 
 > > 
 > >   -dann
 > > 
 > >>>> For example, to set a 64GB aperture, pass:
 > >>>>
 > >>>>   -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536
 > >>>>
 > >>>> The libvirt domain XML syntax is a bit tricky (and it might
"taint" your
 > >>>> domain, as it goes outside of the QEMU features that libvirt
directly
 > >>>> maps to):
 > >>>>
 > >>>>   <domain
 > >>>>    type='kvm'
 > >>>>   
xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
 > >>>>     <qemu:commandline>
 > >>>>       <qemu:arg value='-fw_cfg'/>
 > >>>>       <qemu:arg
value='opt/ovmf/X-PciMmio64Mb,string=65536'/>
 > >>>>     </qemu:commandline>
 > >>>>   </domain>
 > >>>>
 > >>>> Some notes:
 > >>>>
 > >>>> (1) The "xmlns:qemu" namespace definition attribute in
the <domain> root
 > >>>> element is important. You have to add it manually when you add
 > >>>> <qemu:commandline>  and <qemu:arg> too. Without the
namespace
 > >>>> definition, the latter elements will make no sense, and libvirt
will
 > >>>> delete them immediately.
 > >>>>
 > >>>> (2) The above change will grow your guest's physical address
space to
 > >>>> more than 64GB. As a consequence, on your *host*, *if* your
physical CPU
 > >>>> supports nested paging (called "ept" on Intel and
"npt" on AMD), *then*
 > >>>> the CPU will have to support at least 37 physical address bits too,
for
 > >>>> the guest to work. Otherwise, the guest will break, hard.
 > >>>>
 > >>>> Here's how to verify (on the host):
 > >>>>
 > >>>> (2a) run "egrep -w 'npt|ept' /proc/cpuinfo"
--> if this does not produce
 > >>>> output, then stop reading here; things should work. Your CPU does
not
 > >>>> support nested paging, so KVM will use shadow paging, which is
slower,
 > >>>> but at least you don't have to care about the CPU's phys
address width.
 > >>>>
 > >>>> (2b) otherwise (i.e. when you do have nested paging), run
"grep 'bits
 > >>>> physical' /proc/cpuinfo" --> if the physical address
width is >=37,
 > >>>> you're good.
 > >>>>
 > >>>> (2c) if you have nested paging but exactly 36 phys address bits,
then
 > >>>> you'll have to forcibly disable nested paging (assuming you
want to run
 > >>>> a guest with larger than 64GB guest-phys address space, that is).
On
 > >>>> Intel, issue:
 > >>>>
 > >>>> rmmod kvm_intel
 > >>>> modprobe kvm_intel ept=N
 > >>>>
 > >>>> On AMD, go with:
 > >>>>
 > >>>> rmmod kvm_amd
 > >>>> modprobe kvm_amd npt=N
 > >>>>
 > >>>> Hope this helps,
 > >>>> Laszlo
 > >>>>
 > >>>
 > >>
 > > 
 > 
 --
 Dr. David Alan Gilbert / dgilbert(a)redhat.com / Manchester, UK 
-- 
Eduardo

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005