On Wed, 2013-02-06 at 14:13 -0500, Laine Stump wrote:
Now that qemu is getting the q35 machine type, libvirt needs to
support it.
As far as I understand, from libvirt's point of view, q35 is just
another x86_64 system, but with a different set of implicit devices,
and possibly some extra rules limiting which devices can be plugged
into which bus/slot on the guest. That means that in order to support
it, we will need to recognize when a q35-based machine type is being
created, and auto-add all the implicit devices to libvirt's config
model for that domain, then pay attention to the extra rules when
assigning addresses for all the user-added devices.
We already add implicit controllers/devices for pc-based machine
types; as a matter of fact, currently, libvirt improperly assumes (for
the purposes of adding implicit devices) that *every* virtual machine
is based on the "pc" machine type (or rather it just doesn't pay
attention), so it always adds all the implicit devices for a pc
machine type for every domain. This of course is already incorrect for
many (probably all?) non-x86 machine types, even before we add q35
into the mix. To fix this, it might be reasonable (and arguably, it's
necessary to fix the problem in a backward-compatible manner) to just
setup a table of machinetype ==> implicit device lists, look up the
machine type in this table, and add the devices needed for that
machine type. This goes against libvirt's longstanding view of
machinetype as being an opaque value that it merely passes through to
qemu, but it's manageable for the existing machine types (even
including q35), since it's a finite set. But it starts to be a pain to
maintain when you think about future additions - yet another case
where new functionality in qemu will require an update to libvirt
before it can be fully used via libvirt.
In the long term, it would be very useful / more easily maintainable
to have a qemu status command available via QMP which would return the
list of implicit devices (and their PCI addresses) for any requested
machine type. It would be necessary that this command be callable
multiple times within a single execution of qemu, giving it a
different machinetype each time. This way libvirt could first query
the list of available machinetypes in this particular qemu binary,
then request the list of implicit devices for each machine type
(libvirt runs each available qemu binary *once* the first time it's
requested, and caches all such capabilities information so that it
doesn't need to re-run qemu again and again). My limited understanding
of qemu's code is that qemu itself doesn't have a table of this
information as data, but instead has lines of code that are executed
to create it, thus making it impractical to provide the list of
devices for a machinetype without actually instantiating a machine of
that type. What's the feasibility of adding such a capability (and in
the process likely making the list of implicit devices in qemu itself
table/data driven rather than constructed with lines of code).
More questions:
1) It seems that the exact list of devices for the basic q35 machine
type hasn't been settled on yet, is that correct?
I think what we have currently is just a stepping stone to a base
configuration. At a minimum, we're missing the PCI bridge attached to
the ICH, which is where I think libvirt should attach non-chipset
component devices. Next would be PCIe root ports where emulated and
assigned PCIe devices could be attached.
2) Are there other issues aside from implicit controller devices I
need to consider for q35? For example, are there any devices that (as
I recall is the case for some devices on "pc") may or may not be
present, but if they are present they are always at a particular PCI
address (meaning that address must be reserved)? I've also just
learned that certain types of PCIe devices must be plugged into
certain locations on the guest bus? ("root complex" devices - is there
a good source of background info to learn the meaning of terms like
that, and the rules of engagement? libvirt will need to know/follow
these rules.)
The GMCH (Graphics & Memory Controller Hub) defines:
00.0 - Host bridge
01.0 - x16 root port for external graphics
02.0,1 - integrated graphics device (IGD)
03.0,1,2,3 - management engine subsystem
And the ICH defines:
19.0 - Embedded ethernet (e1000e)
1a.* - UHCI/EHCI
1b.0 - HDA audio
1c.* - PCIe root ports
1d.* - UHCI/EHCI
1e.0 - PCI Bridge
1f.0 - ISA Bridge
1f.2,5 - SATA
1f.3 - SMBUS
Personally, I think these slots should be reserved for only the spec
defined devices, and I'm not all that keen on using the remaining slots
for anything else. Users should of course be allowed to put anything
anywhere, but libvirt auto-placement should follow some rules.
All of the above sit on what we now call bus pcie.0. This is a root
complex, which implies that all of endpoints are root complex integrated
endpoints. Being an integrated endpoint restricts aspects of the
device. I've already found out the hard way that Windows actually cares
about this and will ignore PCI assigned devices of type "Endpoint" when
attached to the root complex bus. (endpoint, root complex, etc is
defined in the PCIe spec, the above slot use is defined in the
respective chipset spec)
What I'd like to see is to implement the PCI-bridge at 1e.0 to expose a
complete, virgin PCI bus. libvirt should use that as the default
location for any PCI device that's not a chipset component. We might be
able to get away with installing our e1000 at 19.0, but otherwise I'm
thinking that the list only includes uhci/ehci, hda, ahci, and the
chipset components themselves (smbus, isa, root ports, etc...). We
don't have "IGD", so our graphics should go on the PCI bus and the PCI
bridge should include functioning VGA enable bits. Maybe QXL wants to
make itself a PCIe device, in which case it should be attached behind a
PCIe root port at slot 01.0. Secondary PCIe graphics attach to root
ports behind 1c.*. This is the same framework within real hardware has
to work.
Assigned devices get interesting due to the PCIe type. We've never had
any problems attaching PCIe devices to PCI buses on PIIX (but it may be
holding back our ability to support graphics passthrough), so assigned
devices can probably be attached to the PCI bus. More appropriate would
be to attach "Endpoints" behind root ports and "Integrated Endpoints"
to
the root complex. I've got some code that will mangle the PCIe type to
it's location in the topology, but it needs more work. That should help
make things more flexible.
3) What new types of devices/controllers must be supported for a
properly functioning q35 machine?
AHCI, bridges, root ports (we can skip these w/o PCIe devices, but for
hotplug we might want them fully populated - otherwise everything gets
hotplugged to the PCI bus). Thanks,
Alex