(+Jiri, +libvir-list)
On Fri, Nov 22, 2019 at 04:58:25PM +0000, Dr. David Alan Gilbert wrote:
* Laszlo Ersek (lersek(a)redhat.com) wrote:
> (+Dave, +Eduardo)
>
> On 11/22/19 00:00, dann frazier wrote:
> > On Tue, Nov 19, 2019 at 06:06:15AM +0100, Laszlo Ersek wrote:
> >> On 11/19/19 01:54, dann frazier wrote:
> >>> On Fri, Nov 15, 2019 at 11:51:18PM +0100, Laszlo Ersek wrote:
> >>>> On 11/15/19 19:56, dann frazier wrote:
> >>>>> Hi,
> >>>>> I'm trying to passthrough an Nvidia GPU to a q35 KVM
guest, but UEFI
> >>>>> is failing to allocate resources for it. I have no issues if I
boot w/
> >>>>> a legacy BIOS, and it works fine if I tell the linux guest to
do the
> >>>>> allocation itself - but I'm looking for a way to make this
work w/
> >>>>> OVMF by default.
> >>>>>
> >>>>> I posted a debug log here:
> >>>>>
https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563/+attachment/5...
> >>>>>
> >>>>> Linux guest lspci output is also available for both
seabios/OVMF boots here:
> >>>>>
https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563
> >>>>
> >>>> By default, OVMF exposes such a 64-bit MMIO aperture for PCI MMIO
BAR
> >>>> allocation that is 32GB in size. The generic PciBusDxe driver
collects,
> >>>> orders, and assigns / allocates the MMIO BARs, but it can work only
out
> >>>> of the aperture that platform code advertizes.
> >>>>
> >>>> Your GPU's region 1 is itself 32GB in size. Given that there
are further
> >>>> PCI devices in the system with further 64-bit MMIO BARs, the
default
> >>>> aperture cannot accommodate everything. In such an event,
PciBusDxe
> >>>> avoids assigning the largest BARs (to my knowledge), in order to
> >>>> conserve the most aperture possible, for other devices -- hence
break
> >>>> the fewest possible PCI devices.
> >>>>
> >>>> You can control the aperture size from the QEMU command line. You
can
> >>>> also do it from the libvirt domain XML, technically speaking. The
knob
> >>>> is experimental, so no stability or compatibility guarantees are
made.
> >>>> (That's also the reason why it's a bit of a hack in the
libvirt domain XML.)
> >>>>
> >>>> The QEMU cmdline options is described in the following edk2 commit
message:
> >>>>
> >>>>
https://github.com/tianocore/edk2/commit/7e5b1b670c38
> >>>
> >>> Hi Laszlo,
> >>>
> >>> Thanks for taking the time to describe this in detail! The -fw_cfg
> >>> option did avoid the problem for me.
> >>
> >> Good to hear, thanks.
> >>
> >>> I also noticed that the above
> >>> commit message mentions the existence of a 24GB card as a reasoning
> >>> behind choosing the 32GB default aperture. From what you say below, I
> >>> understand that bumping this above 64GB could break hosts w/ <= 37
> >>> physical address bits.
> >>
> >> Right.
> >>
> >>> What would be the downside of bumping the
> >>> default aperture to, say, 48GB?
> >>
> >> The placement of the aperture is not trivial (please see the code
> >> comments in the linked commit). The base address of the aperture is
> >> chosen so that the largest BAR that can fit in the aperture may be
> >> naturally aligned. (BARs are whole powers of two.)
> >>
> >> The largest BAR that can fit in a 48 GB aperture is 32 GB. Therefore
> >> such an aperture would be aligned at 32 GB -- the lowest base address
> >> (dependent on guest RAM size) would be 32 GB. Meaning that the aperture
> >> would end at 32 + 48 = 80 GB. That still breaches the 36-bit phys
> >> address width.
> >>
> >> 32 GB is the largest aperture size that can work with 36-bit phys
> >> address width; that's the aperture that ends at 64 GB exactly.
> >
> > Thanks, yeah - now that I read the code comments that is clear (as
> > clear as it can be w/ my low level of base knowledge). In the commit you
> > mention Gerd (CC'd) had suggested a heuristic-based approach for
> > sizing the aperture. When you say "PCPU address width" - is that a
> > function of the available physical bits?
>
> "PCPU address width" is not a "function" of the available
physical bits
> -- it *is* the available physical bits. "PCPU" simply stands for
> "physical CPU".
>
> > IOW, would that approach
> > allow OVMF to automatically grow the aperture to the max ^2 supported
> > by the host CPU?
>
> Maybe.
>
> The current logic in OVMF works from the guest-physical address space
> size -- as deduced from multiple factors, such as the 64-bit MMIO
> aperture size, and others -- towards the guest-CPU (aka VCPU) address
> width. The VCPU address width is important for a bunch of other purposes
> in the firmware, so OVMF has to calculate it no matter what.
>
> Again, the current logic is to calculate the highest guest-physical
> address, and then deduce the VCPU address width from that (and then
> expose it to the rest of the firmware).
>
> Your suggestion would require passing the PCPU (physical CPU) address
> width from QEMU/KVM into the guest, and reversing the direction of the
> calculation. The PCPU address width would determine the VCPU address
> width directly, and then the 64-bit PCI MMIO aperture would be
> calculated from that.
>
> However, there are two caveats.
>
> (1) The larger your guest-phys address space (as exposed through the
> VCPU address width to the rest of the firmware), the more guest RAM you
> need for page tables. Because, just before entering the DXE phase, the
> firmware builds 1:1 mapping page tables for the entire guest-phys
> address space. This is necessary e.g. so you can access any PCI MMIO BAR.
>
> Now consider that you have a huge beefy virtualization host with say 46
> phys address bits, and a wimpy guest with say 1.5GB of guest RAM. Do you
> absolutely want tens of *terabytes* for your 64-bit PCI MMIO aperture?
> Do you really want to pay for the necessary page tables with that meager
> guest RAM?
>
> (Such machines do exist BTW, for example:
>
>
http://mid.mail-archive.com/9BD73EA91F8E404F851CF3F519B14AA8036C67B5@DGGE...
> )
>
> In other words, you'd need some kind of knob anyway, because otherwise
> your aperture could grow too *large*.
>
>
> (2) Exposing the PCPU address width to the guest may have nasty
> consequences at the QEMU/KVM level, regardless of guest firmware. For
> example, that kind of "guest enlightenment" could interfere with
migration.
>
> If you boot a guest let's say with 16GB of RAM, and tell it "hey friend,
> have 40 bits of phys address width!", then you'll have a difficult time
> migrating that guest to a host with a CPU that only has 36-bits wide
> physical addresses -- even if the destination host has plenty of RAM
> otherwise, such as a full 64GB.
>
> There could be other QEMU/KVM / libvirt issues that I m unaware of
> (hence the CC to Dave and Eduardo).
host physical address width gets messy. There are differences as well
between upstream qemu behaviour, and some downstreams.
I think the story is that:
a) Qemu default: 40 bits on any host
b) -cpu blah,host-phys-bits=true to follow the host.
c) RHEL has host-phys-bits=true by default
As you say, the only real problem with host-phys-bits is migration -
between say an E3 and an E5 xeon with different widths. The magic 40's
is generally wrong as well - I think it came from some ancient AMD,
but it's the default on QEMU TCG as well.
Yes, and because it affects live migration ability, we have two
constraints:
1) It needs to be exposed in the libvirt domain XML;
2) QEMU and libvirt can't choose a value that works for everybody
(because neither QEMU or libvirt know where the VM might be
migrated later).
Which is why the BZ below is important:
I don't think there's a way to set it in libvirt;
https://bugzilla.redhat.com/show_bug.cgi?id=1578278 is a bz asking for
that.
IMHO host-phys-bits is actually pretty safe; and makes most sense in a
lot of cases.
Yeah, it is mostly safe and makes sense, but messy if you try to
migrate to a host with a different size.
Dave
> Thanks,
> Laszlo
>
> >
> > -dann
> >
> >>>> For example, to set a 64GB aperture, pass:
> >>>>
> >>>> -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536
> >>>>
> >>>> The libvirt domain XML syntax is a bit tricky (and it might
"taint" your
> >>>> domain, as it goes outside of the QEMU features that libvirt
directly
> >>>> maps to):
> >>>>
> >>>> <domain
> >>>> type='kvm'
> >>>>
xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> >>>> <qemu:commandline>
> >>>> <qemu:arg value='-fw_cfg'/>
> >>>> <qemu:arg
value='opt/ovmf/X-PciMmio64Mb,string=65536'/>
> >>>> </qemu:commandline>
> >>>> </domain>
> >>>>
> >>>> Some notes:
> >>>>
> >>>> (1) The "xmlns:qemu" namespace definition attribute in
the <domain> root
> >>>> element is important. You have to add it manually when you add
> >>>> <qemu:commandline> and <qemu:arg> too. Without the
namespace
> >>>> definition, the latter elements will make no sense, and libvirt
will
> >>>> delete them immediately.
> >>>>
> >>>> (2) The above change will grow your guest's physical address
space to
> >>>> more than 64GB. As a consequence, on your *host*, *if* your
physical CPU
> >>>> supports nested paging (called "ept" on Intel and
"npt" on AMD), *then*
> >>>> the CPU will have to support at least 37 physical address bits too,
for
> >>>> the guest to work. Otherwise, the guest will break, hard.
> >>>>
> >>>> Here's how to verify (on the host):
> >>>>
> >>>> (2a) run "egrep -w 'npt|ept' /proc/cpuinfo"
--> if this does not produce
> >>>> output, then stop reading here; things should work. Your CPU does
not
> >>>> support nested paging, so KVM will use shadow paging, which is
slower,
> >>>> but at least you don't have to care about the CPU's phys
address width.
> >>>>
> >>>> (2b) otherwise (i.e. when you do have nested paging), run
"grep 'bits
> >>>> physical' /proc/cpuinfo" --> if the physical address
width is >=37,
> >>>> you're good.
> >>>>
> >>>> (2c) if you have nested paging but exactly 36 phys address bits,
then
> >>>> you'll have to forcibly disable nested paging (assuming you
want to run
> >>>> a guest with larger than 64GB guest-phys address space, that is).
On
> >>>> Intel, issue:
> >>>>
> >>>> rmmod kvm_intel
> >>>> modprobe kvm_intel ept=N
> >>>>
> >>>> On AMD, go with:
> >>>>
> >>>> rmmod kvm_amd
> >>>> modprobe kvm_amd npt=N
> >>>>
> >>>> Hope this helps,
> >>>> Laszlo
> >>>>
> >>>
> >>
> >
>
--
Dr. David Alan Gilbert / dgilbert(a)redhat.com / Manchester, UK
--
Eduardo