On Mon, Jun 13, 2016 at 04:44:53AM +0200, Tomasz Flendrich wrote:
Thank you for your input, I really appreciate it.
My goal was to define a data structure with a minimal set of features
that would make sense and could be further developed.
> I know of a few problems with PCI address allocation, but abstracting
> the data structure and functions used for allocation aren't going to
> solve them (for example, if you attach a device that uses PCI with
> "virsh attach-device .... --live", then then add another with "virsh
> attach-device --live --config", the new device will be assigned a
> different address in the persistent config than in the live domain
> (the result will be that the next time you shutdown and restart, the
> device will appear at a different address, which could confuse the
> guest OS). Another (which I need to fix *very* soon) is that even
> when a device is classified as connecting to PCIe slots rather than
> PCI (and there aren't enough of those right now, which is *yet
> another* problem I need to fix), if there aren't any empty PCIe slots
> available to assign, we'll fail rather than adding a new controller
> of the appropriate type and attached to the appropriate existing
> controller). But, as I said, a new data structure isn't going to
> solve any of those problems (and I'm not sure it will make the
> solution any easier to get to either).
Laine probably missed my mail as a reply:
https://www.redhat.com/archives/libvir-list/2016-June/msg00677.html
Funnily enough, the first thing will not be dealt with at first, but it
will make the solution clearer (though not necessarily way easier at
first sight).
We are now working with only one set of addresses and we don't
distinguish between live and config scenarios. Part of what Tomasz is
doing is moving the address sets from the QEMU's private data for the
domain into non-QEMU object (e.g. virDomainDef). That way each
definition will have it's own address allocation data. We don't have
that now. Even though we could now just allocate one address and if
both _CONFIG and _LIVE flags are supplied, we would just give it to
both, in the future we would be able to create a function that allocates
empty address in both sets.
But more importantly, Tomasz mail only drafted one of the first ideas.
What I suggested aiming at from the start are less complex address types
that we have. I'd like to leave PCI for the finish line.
To elaborate on what I wanted to achieve with the project; there are two
major problems in how we deal with address allocation. One of them is
that all the code is concentrated to the drivers. If multiple drivers
want to do some interesting things, it is coded in every single one of
them. Of course each one might want to work a bit differently, but most
of the code can be shared and the rest can either be left in the driver
itself or just changed a bit using function arguments. The second
problem is that we only deal with addresses that someone remembers to
add the allocation code for. We have many types of addresses, yet we
were properly allocating only one of them two years ago. Now it's three
and each one of them has it's own code. Ad that's only in QEMU. The
idea here is that adding next address types to be handled would be just
a matter of assigning a function into some structure or array.
That’s right, the structure itself won’t solve these. Using it might
make address assignment more convenient though, because you will be
using a uniform interface for dealing with all addresses:
1. There will be no need to use virDomainCCWAddressAsString() because
you are doing a hashtable lookup. You wouldn’t have to know if there is
a hashtable.
2. You also wouldn’t have to iterate through controllers every time you
want to assign one address, which happens in
virDomainVirtioSerialAddrNext. You would just iterate through all the
controllers once and add the addresses that they supply to the pool of
available addresses. Then if you need an address for a device, just ask
for one and the allocator will deal with it.
> PCI addresses are likely much more complicated than you've planned
> for this.
I spent a lot of time reading about the PCI addresses in libvirt and I
am somewhat familiar with their complexity. I was trying to define a
minimum viable product, so this data structure was stripped out of
features. Let me elaborate how it could be expanded below.
That's one of the reasons I wanted to start with something less than PCI
addresses.
> For starters, each PCI bus can have a different number of slots
> available (the 2nd dimension, which you list as {0..31}, has
> different start & end indexes depending on the type of controller
> that provides that bus. (e.g. pci-root, pcie-root, and pci-bridge
> have 31 slots from 1-31, dmi-to-pci-bridge and
> pcie-switch-upstream-port have 32 slots from 0-31, pcie-root-port and
> pcie-switch-downstream-port each have a single slot at 0,),
Thanks for the explanation. I believe there will be even more stuff to
come, right? ;-)
I might have not explained myself sufficiently on how I see this data
structure in use. This data structure would keep *all* the available
addresses of a given type, e.g. one allocator for all the PCI
addresses.
Initially, the pool of possible addresses is empty. Then, you can
iterate through the buses and if you encounter a PCI bus, insert its
addresses that can be used (slots and functions) into the allocator.
Let’s say we want to do it with a pcie-root-port that has only one slot
that has 7 functions and its index is 0:
allocatorAddPool(allocator[PCIE_ADDRESS], AllocatorValue(0), AllocatorValue(1),
AllocatorRange(0, 7));
If you decide that your machine also has a pci-root with 31 slots from 1 to 31 (each with
7 functions) and its index is 0, do this:
allocatorAddPool(allocator[PCI_ADDRESS], AllocatorValue(0), AllocatorRange(1, 31),
AllocatorRange(0, 7));
If you also add (on index 5) a new type of pci bus that has 32 slots
(from 0 to 31) and the first 16 slots have 8 functions, but the
remaming slots have 3 functions only, you can use the
allocatorAddPool() function twice:
allocatorAddPool(allocator[PCI_ADDRESS), AllocatorValue(5), AllocatorRange(0, 15),
AllocatorRange(0, 7));
allocatorAddPool(allocator[PCI_ADDRESS), AllocatorValue(5), AllocatorRange(16, 31),
AllocatorRange(0, 2));
After you do all of these on a single struct Allocator, it now knows
the addresses provided by all of the buses and you can proceed to
asking it for some addresses. You can iterate through all the devices
and request a PCI address for each of them.
> and each bus has different rules about what devices can be plugged
> into them (e.g. a pcie-root-port can only be plugged into pcie-root
> or a pcie-switch-downstream-port, some devices can only be plugged
> into pci-root, pci-bridge, or dmi-to-pci-bridge, some can be plugged
> into pcie-root directly if specifically requested, but it's otherwise
> avoided, etc). So it's not just a case of a 4-dimensional vector with
> the same size at each index on on the same axis, and each address is
> not the same as any other address.
This could be configured by flags. Every address could have the ability
to have some flag set for it, and during reserving an address, the
caller could specify that he’s looking for addresses with some specific
flag.
> Are you intending for this allocator to have any sort of "Give me the
> next free address" function?
Yes, I intend to do that. If I do it for this generic structure, it
will be done for every other address type too. All the address types do
have some common features, so it’s better to have it in code once.
This should be more like a "give this device in this domain definition
an address and make sure these conditions are met", however it might
also end up being part of "Make sure all addresses in this virDomainDef
are assigned.
> If so, doing this for PCI is likely to require you to add lots
of
> extra complexity that will be completely unnecessary for the other
> address types
It might not be a bad thing, provided that the functions from the
interface that will be used for other address types (other than PCI)
won’t be cluttered because of PCI’s complexity.
> Of course maybe your intent is that your allocator will only
> implement an "allocate this specific address" function, and leave
> auto-assignment up to a higher layer; but then the higher layer would
> still need a way to efficiently scan the entire list of free
> addresses, and each address would need a bit of data describing the
> controllers at that address so they they could be matched.
I want this data structure to be versatile enough so that you can use
it in all the required ways.
If you want some specific address, just ask for it explicitly and you
will get a response whether you succeeded or not (you might’ve failed
because the address was never added to the pool, or because it’s
already used by some other device).
If you want any address, it could be done with a function:
allocatorReserveAny(allocator[PCI_ADDRESS]), which could be the following function in
disguise:
allocatorReserveAddress(allocator[PCI_ADDRESS], AllocatorRange(INT_MIN, INT_MAX),
AllocatorRange(INT_MIN, INT_MAX), AllocatorRange(INT_MIN, INT_MAX));
Since any address is between INT_MIN and INT_MAX, it would simply find
the first one that is available.
> and each address would need a bit of data describing the controllers
> at that address so they they could be matched.
That’s one of my ideas for further development. Each address could
store some additional information, for example what kind of SCSI
controller it is. Then during reserving an address, you could restrict
that you’re only interested in addresses that have that flag.
>> This function call could look like this:
>> allocatorAddPool(
>> allocator[PCI_ADDRESS_TYPE],
>> &outputAddress,
>> AllocatorRange(0, 0),
>> AllocatorRange(0, 31),
>> AllocatorRange(0, 7));
>
> Again, this doesn't allow for some of the indexes on a particular
> dimension having a different start/end (or for other differences for
> a particular index, which possibly could be stored in some opaque
> pointer).
It’s possible that I misunderstood you, but it does allow you to do
that. You can use this allocatorAddPool() function many times to add
many addresses to the big pool of possible addresses.Perhaps I
shouldn’t have used the word “pool”, because it suggests that there is
one pool only.
Think of it as of a function that is a quicker and more convenient
possibility of simply specifying all the possible addresses explicitly,
one by one:
allocatorAddAddressThatIsPossibleToUse(0, 0, 0);
allocatorAddAddressThatIsPossibleToUse(0, 0, 1);
…
allocatorAddAddressThatIsPossibleToUse(0, 0, 7);
allocatorAddAddressThatIsPossibleToUse(0, 1, 0);
etc.
You can iterate through all of your PCI buses and execute
allocatorAddPool() for each of them (all using the same struct
Allocator). They can have various possible addresses, starting from 0
or 1 or any other value, some of them might have one lot only and it’s
not a problem.
> Ah, there it is! This won't work for PCI (or for scsi) without much
> more information available for the ranges that are being allocated
> from.
I hope that I dealt with that somewhere above, but if you still see a
problem that arises, please let me know.
> Disk device names are incredibly ugly beasts. Apparently once you get
> to sdz, you can start over at sdaa, sdab, etc. And the whole thing is
> pointless anyway, because that name isn't passed directly to the
> guest anyway (there isn't a way to do that - the guest OS determines
> device names on its own based on the bus and order in which the disks
> appear).
*sigh of relief* That’s great, I prefer to deal with numbers :)
>> If we kept every address separately, it would require 2*32*8=512
>> addresses.
>
> ...which isn't all that much, since we aren't talking about
> microcontrollers with tiny amounts of RAM.
You are right. An array or a linked list will probably be enough.
Linked lists are nice to teach at school but there are like five
real-world cases where they are used to store data that you can move
freely around. Especially if the data are pointers. Array should be
fine for a long time =)
I don't know why we couldn't use the virDomainPCIAddressSet for
starters. It should be more about adding abstract "wrappers" around all
these functions so that we are moving code from QEMU (and possibly other
drivers) into src/conf/.
Sorry for the confusions, I had it planed out a bit better, but that was
when I wrote the idea, more than 3 years ago :-/
>> My ideas:
>> - A possibility of reserving a range of addresses might be used
>> to reserve a whole PCI slot at once.
>
> I do see the usefulness of this in general, but in this particular
> case rather than reserving an entire slot at once, we may want to
> allocate a single function where the device is going to be, but mark
> the rest of that slot available for further functions to be
> allocated, then later mark it unavailable for further allocations
> *without actually allocating the rest of the functions in the
> slot*. The reason for this is that it would make it simpler to use
> the same code for detaching a "lone" device on a slot as we do for
> detaching the functions of a multifunction PCI device. (You're likely
> thinking "what's the difference between allocating a single function
> + mark the other functions on the slot as unavailable vs. just
> marking the entire slot as allocated; it all has to do with being
> able to easily notice when the slot is once again completely
> availaable - if you mark all the functions in the slot as used, then
> when you free it you have to free all the slots; if you're doing
> that, then you have to figure out whether the device you'r freeing
> was one function of a multifunction device, or if it was a lone
> device plugged into a slot by itself. Of course that may not fit in
> with your abstracted view of generic address allocation.) Right now
> we "ham fist" this issue by just marking the entire slot as allocated
> when someone asks for a single function, but really we should do
> allocations (and frees) one function at a time, but just with a
> knowledge of when it's appropriate to allocate a 2nd (and further)
> function on a slot that already has some functions allocated, and
> when it isn’t
It does fit in with my abstracted view :) We could add some distinction
between “used” addresses and “unavailable” addresses (maybe add a
reason for being unavailable too). Or some flags that specify how a
particular address type should behave. I must learn more about how PCI
addresses behave to be sure, especially if it’s not in our code yet.
In my idea about this data structure, the caller has to know whether
he/she wants to free one function, or a whole slot. You can always make
a wrapper though that would deal with it.
>> - Is this approach good long-term?
>
> Again, my concern would be about making something more complex than
> necessary for the individual uses and their requirements. Eliminating
> nearly duplicate code is good, but you need to be careful that what
> you create doesn't require more "glue" to make each of the cases fit
> into the abstract model than would have been used for a separate
> simpler "from-scratch" implementation for each individual case.
That’s a valid concern. I am trying to make it as simple as possible
when still providing all the required functionality. That’s why I am
asking the experienced developers what is needed right now and what
might be needed in the future. If the requirements are too wide for
this data structure to be sensible, I will use a different approach.
>> - Is address releasing expected to be required at any point?
>> (think device hotplug/unplug, etc.)
>
> Yes. And we need to be aware that sometimes devices are added to the
> live domain but not the persistent config, sometimes only to the
> persistent config, and sometimes to both (and when they're added to
> both, they should use the same address in both live domain and
> persistent config, which isn't always the case today).
That’s really good to know. I thought that the addresses are persisted
though, because I looked at the comment that said:
/* if this is the live domain object, we persist the PCI addresses */
Could you elaborate on that? When/where are addresses handled wrong?
>> - What other features are needed?
>
> You had said you wanted the qemu_* files to remain the same for
> now. Exactly what are you figuring on replacing? For example, for PCI
> addresses (since these are obviously the only ones I care about :-))
> are you just planning to replace virDomainPCIAddressBus with your new
> virAllocator and a set of accessor functions? Even if you do nothing
> higher level than this, each bus will need to have a "connectFlags"
> that describes what type of devices can be connected to that bus (or
> range, or whatever you call it).
The initial idea is to simply create a structure that provides an
interface (through functions) which you can use to allocate
addresses. That’s just an idea though and it can change depending on
the feedback and what my mentor requires from me.
The allocator would remove these:
virDomainCCWAddressSetPtr ccwaddrs;
virDomainVirtioSerialAddrSetPtr vioserialaddrs;
virDomainPCIAddressSetPtr pciaddrs;
virDomainPCIAddressBus would be removed too, because it’s a part of
virDomainPCIAddressSetPtr.
> Hopefully my comments haven't just made the situation more muddy :-)
Not at all! They were really useful :)
All ideas and comments are welcomed!