On 06/08/2016 02:04 PM, Tomasz Flendrich wrote:
Hello everyone,
Let me introduce myself - I'm Tomasz Flendrich and I'm a Google Summer
of Code student from University of Wrocław, Poland.
My goal is to create a generic address allocator (for PCI, virtio,
SCSI, etc.), because current handling of addresses has its flaws:
sometimes addresses aren't properly assigned and checked for duplicates.
I know you took this sentence straight from the wiki page of GSoC ideas,
but I'd be interested to see a few specific examples of the problems
that you intend to solve (maybe Martin can help here, since I think he's
the one who wrote that).
I know of a few problems with PCI address allocation, but abstracting
the data structure and functions used for allocation aren't going to
solve them (for example, if you attach a device that uses PCI with
"virsh attach-device .... --live", then then add another with "virsh
attach-device --live --config", the new device will be assigned a
different address in the persistent config than in the live domain (the
result will be that the next time you shutdown and restart, the device
will appear at a different address, which could confuse the guest OS).
Another (which I need to fix *very* soon) is that even when a device is
classified as connecting to PCIe slots rather than PCI (and there aren't
enough of those right now, which is *yet another* problem I need to
fix), if there aren't any empty PCIe slots available to assign, we'll
fail rather than adding a new controller of the appropriate type and
attached to the appropriate existing controller). But, as I said, a new
data structure isn't going to solve any of those problems (and I'm not
sure it will make the solution any easier to get to either).
Additionally, a generic solution will be very useful when more
hypervisors
add the feature of explicitly stating the addresses of devices.
The key goal I'm willing to achieve at this moment is defining
a minimum viable product (one thing to be implemented quickly
which is reasonably complete on its own, and has potential for further
development in the direction which makes sense).
I came up with an initial design. The internal allocator's data
will be kept in struct Allocator. There will be one structure
for one address type and all of the allocators would be in an array.
We will have the possibility of adding an address pool to the allocator.
For example, when we create a root PCI bus, we tell the allocator that
the following PCI addresses are possible: {0..0}:{0..31}.{0..7},
where {a..b} means a range from a to b, inclusive.
PCI addresses are likely much more complicated than you've planned for
this. For starters, each PCI bus can have a different number of slots
available (the 2nd dimension, which you list as {0..31}, has different
start & end indexes depending on the type of controller that provides
that bus. (e.g. pci-root, pcie-root, and pci-bridge have 31 slots from
1-31, dmi-to-pci-bridge and pcie-switch-upstream-port have 32 slots from
0-31, pcie-root-port and pcie-switch-downstream-port each have a single
slot at 0,), and each bus has different rules about what devices can be
plugged into them (e.g. a pcie-root-port can only be plugged into
pcie-root or a pcie-switch-downstream-port, some devices can only be
plugged into pci-root, pci-bridge, or dmi-to-pci-bridge, some can be
plugged into pcie-root directly if specifically requested, but it's
otherwise avoided, etc). So it's not just a case of a 4-dimensional
vector with the same size at each index on on the same axis, and each
address is not the same as any other address.
Are you intending for this allocator to have any sort of "Give me the
next free address" function? If so, doing this for PCI is likely to
require you to add lots of extra complexity that will be completely
unnecessary for the other address types (although... there are several
different kinds of SCSI controller (including virtio-scsi as well as
several emulations of real hardware) that all share the same address
space, and we need to know what model is each controller so that the
user can be sure devices are placed on the desired controller.) Of
course maybe your intent is that your allocator will only implement an
"allocate this specific address" function, and leave auto-assignment up
to a higher layer; but then the higher layer would still need a way to
efficiently scan the entire list of free addresses, and each address
would need a bit of data describing the controllers at that address so
they they could be matched.
(Another thing - at any time a new bus of just about any type can be
added, sometimes in response to a request to allocate an address for
which no matching slot is available (e.g. - adding a new pcie-root-port,
pcie-switch-upstream-port, and list of pcie-downstream-ports when there
are no more hotpluggable PCIe slots available)).
This function call could look like this:
allocatorAddPool(
allocator[PCI_ADDRESS_TYPE],
&outputAddress,
AllocatorRange(0, 0),
AllocatorRange(0, 31),
AllocatorRange(0, 7));
Again, this doesn't allow for some of the indexes on a particular
dimension having a different start/end (or for other differences for a
particular index, which possibly could be stored in some opaque pointer).
The outputAddress would be an array owned by the caller
(only filled in by the allocator).
If we were to allocate IP addresses for computers from the range
192.168.1.2 to 192.168.1.240, where the router has to have
the address 192.168.1.254:
First, we reserve the address for the router to let the Allocator know
that it's in use, and that we can track collisions in manually
assigned addresses:
allocatorReserveAddress(
allocator[IP_ADDRESS_TYPE],
&outputAddress,
AllocatorValue(192),
AllocatorValue(168),
AllocatorValue(1),
AllocatorValue(254));
Then, we assign the addresses for the computers:
allocatorReserveAddress(
allocator[IP_ADDRESS_TYPE],
&outputAddress,
AllocatorValue(192),
AllocatorValue(168),
AllocatorValue(1),
AllocatorRange(1, 254));
Please note that AllocatorValue() is now in use.
There could be a different function call to simply assign any address:
allocatorReserveAny(Allocator* allocator, &outputAddress);
Ah, there it is! This won't work for PCI (or for scsi) without much more
information available for the ranges that are being allocated from.
Let's say that we want an "sda" disk. We could create a wrapper:
allocatorReserveSCSIAddress(allocator, "sda");
All this wrapper does is representing the string "sda" as an integer
mapping to the letter (eg. 'a' = 0, 'z' = 25):
allocatorReserveAddress(allocator[SCSI_ADDRESS_TYPE], &outputAddress, 0);
Disk device names are incredibly ugly beasts. Apparently once you get to
sdz, you can start over at sdaa, sdab, etc. And the whole thing is
pointless anyway, because that name isn't passed directly to the guest
anyway (there isn't a way to do that - the guest OS determines device
names on its own based on the bus and order in which the disks appear).
If an address is already determined, because it was specified in the XML
or it's some specific device that has to be at some specific address,
we still reserve it to let the Allocator know that it's in use.
How would this work internally?
One of the possible solutions is keeping a set of ranges of addresses.
For example, suppose that we have two PCI busses to use. Our address pool
is stored as one range:
{0..1} {0..31} {0..7}
Now someone reserves address 0:13.2
Our new free addresses pool is now stored as this set of ranges:
{0..0} {0..12} {0..7},
{0..0} {12..12} {0..1},
{0..0} {12..12} {3..7},
{0..0} {13..31} {0..7},
{1..1} {0..31} {0..7}
If we kept every address separately, it would require 2*32*8=512 addresses.
...which isn't all that much, since we aren't talking about
microcontrollers with tiny amounts of RAM.
The set data structure from gnulib's gl_oset.h is a candidate for keeping
the ranges in a sorted fashion. Another candidate is simply a sorted list
from gl_list.h.
Without looking at the implementation, I'm concerned that this may be
adding lots of extra code complexity in order to save just a few bytes
of RAM, which realistically is pointless - we only have one pool of PCI
addresses per active domain, so the amount of memory used to store a
bitmap of the availability of addresses is inconsequential in relation
to the total memory requirement for a domain.
This structure would be able to handle all types of addresses that are
convertible to a fixed-length list of integers. We don't mind how many
of these integers there are, because we can use variadic arguments.
It won't allow duplicates if we stick to using it in every place where
we have some addresses.
It will also be very generic, with the possibility of writing wrappers on top
of it that might be more convenient for some particular address type.
This way we would keep qemu_* files as they are for now, and just replace
existing allocators (and manual assignment of addresses) to calls to our
new allocator.
My ideas:
- A possibility of reserving a range of addresses might be used
to reserve a whole PCI slot at once.
I do see the usefulness of this in general, but in this particular case
rather than reserving an entire slot at once, we may want to allocate a
single function where the device is going to be, but mark the rest of
that slot available for further functions to be allocated, then later
mark it unavailable for further allocations *without actually allocating
the rest of the functions in the slot*. The reason for this is that it
would make it simpler to use the same code for detaching a "lone" device
on a slot as we do for detaching the functions of a multifunction PCI
device. (You're likely thinking "what's the difference between
allocating a single function + mark the other functions on the slot as
unavailable vs. just marking the entire slot as allocated; it all has to
do with being able to easily notice when the slot is once again
completely availaable - if you mark all the functions in the slot as
used, then when you free it you have to free all the slots; if you're
doing that, then you have to figure out whether the device you'r freeing
was one function of a multifunction device, or if it was a lone device
plugged into a slot by itself. Of course that may not fit in with your
abstracted view of generic address allocation.) Right now we "ham fist"
this issue by just marking the entire slot as allocated when someone
asks for a single function, but really we should do allocations (and
frees) one function at a time, but just with a knowledge of when it's
appropriate to allocate a 2nd (and further) function on a slot that
already has some functions allocated, and when it isn't.
- Flags could be added, changing the behavior of particular
addresses or the whole allocator.
My questions:
- Is this approach good long-term?
Again, my concern would be about making something more complex than
necessary for the individual uses and their requirements. Eliminating
nearly duplicate code is good, but you need to be careful that what you
create doesn't require more "glue" to make each of the cases fit into
the abstract model than would have been used for a separate simpler
"from-scratch" implementation for each individual case.
- Is address releasing expected to be required at any point?
(think device hotplug/unplug, etc.)
Yes. And we need to be aware that sometimes devices are added to the
live domain but not the persistent config, sometimes only to the
persistent config, and sometimes to both (and when they're added to
both, they should use the same address in both live domain and
persistent config, which isn't always the case today).
- What other features are needed?
You had said you wanted the qemu_* files to remain the same for now.
Exactly what are you figuring on replacing? For example, for PCI
addresses (since these are obviously the only ones I care about :-)) are
you just planning to replace virDomainPCIAddressBus with your new
virAllocator and a set of accessor functions? Even if you do nothing
higher level than this, each bus will need to have a "connectFlags" that
describes what type of devices can be connected to that bus (or range,
or whatever you call it).
Please speak up your mind if you have remarks or thoughts about the idea.
I'd really appreciate it.
Hopefully my comments haven't just made the situation more muddy :-)