I'll send a separate email (in a new thread, so it doesn't get lost! ;-)
with a new draft of what the network XML should look like, but wanted to
respond to Dan's comments inline...
On 05/24/2011 10:21 AM, Daniel P. Berrange wrote:
On Fri, Apr 29, 2011 at 04:12:55PM -0400, Laine Stump wrote:
> Okay, here's a brief description of what I *think* will work. I'll
> build up the RNG based on this pseudo-xml:
>
>
> For the<interface> definition in the guest XML, the main change
> will be that<source .. mode='something'> will be valid (but
> optional) when interface type='network' - in this case, it will just
> be used to match against the source mode of the network on the host.
> <virtualport> will also become valid for type='network', and will
> serve two purposes:
>
> 1) if there is a mismatch with the virtualport on the host network,
> the migrate/start will fail.
> 2) It will be ORed with<virtualport> on the host network to arrive
> at the virtualport settings actually used.
>
> For example:
>
> <interface type='network'>
> <source network='red-network' mode='vepa'/>
IMHO having a 'mode' here is throwing away the main reason for
using type=network in the first place - namely independance
from this host config element.
I agree, but was being accommodating :-) Since then, Dave has pointed
out that the same functionality can be achieved by having the management
application grab the XML for the network on the targetted host, and
check for matches of any important parameters before deciding to migrate
to that host. This has 2 advantages:
1) It is more flexible. The management application can check for more
than just mode='vepa', but also any number of other attributes of the
network on the target.
2) The result of a host's network not matching the desired mode will be
"management app looks elsewhere", rather than "migration fails".
The management application will need to do this anyway (even if just to
check that the given network is present at all) or, again, face the
prospect of the migration failing.
So I'll withdraw this piece from the next draft.
> <virtualport type='802.1Qbg'>
> <parameters instanceid='09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f'/>
> </virtualport>
> <mac address='xx:xx:.....'/>
> </interface>
>
> (NB: if "mode" isn't specified, and the host network is actually a
> bridge or virtual network, the contents of virtualport will be
> ignored.)
>
>
> <network> will be expanded by giving it an optional "type"
attribute
> (which will default to 'virtual'),<source> subelement, and
> <virtualport> subelement. When type='bridge', you can specify source
> exactly as you would in a domain<interface> definition:
>
> <network type='bridge'>
> <name>red-network</name>
> <source bridge='br0'/>
> </network>
>
> When type='direct', again you can specify source and virtualport
> pretty much as you would in an interface definition:
>
> <network type='direct'>
> <name>red-network</name>
> <source dev='eth0' mode='vepa'/>
> <virtualport type='802.1Qbg'>
> <parameters managerid="11" typeid="1193047"
typeidversion="2"
> instanceid='09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f'/>
> </virtualport>
> </network>
None of this really feels right to me. With this proposed
schema, there is basically nothing in common between the
existing functionality for<network> and this new functionality
except for the<name> and<uuid> elements.
Apps which know how to deal with existing<network> schema
will have no ability to interpret this new data at all.
Quite probably they will mis-interpet it as providing an
isolated virtual network, with no IP addr set, since this
design isn't actually changing any attribute value that
they currently look for.
Either we need to make this align with the existing schema,
or we need to put this under a completely separate set of
APIs. I think we can likely do better with the schema design
and achieve the former.
So the problem is that the new uses are so orthogonal to the current
usage that existing management apps encountering this new XML will
mistakenly believe that it's "old" XML with a bit of extra stuff that
can be ignored (thus leading to mayhem).
I think the most important thing is to make sure that a config for one
of these new types will have at least one change to an *existing*
element/attribute (mine just added a *new* attribute specifying type)
that causes existing apps to realize this isn't just an old school
network definition that happens to have a few kinks on the side. Your
suggestion of using new values for <forward mode="..."> seems like as
good an idea as any (actually I can't think of anything else that works
as well :-)
> However, dev would be optional - if not specified, we would
expect a
> pool of interfaces to be defined within source, eg:
>
> <network type='direct'>
> <name>red-network</name>
> <source mode='vepa'>
> <pool>
> <interface name='eth10' maxConnect='1'/>
> <interface name='eth11' maxConnect='1'/>
> <interface name='eth12' maxConnect='1'/>
> <interface name='eth13' maxConnect='1'/>
> <interface name='eth14' maxConnect='1'/>
> <interface name='eth25' maxConnect='5'/>
> </pool>
> </source>
> <virtualport ...... />
> </network>
I don't really like the fact that this design has special
cased the num(intefaces) == 1 case to have a completely
different XML schema. eg we have this:
<source dev='eth0' mode='vepa'/>
And this
<source mode='vepa'>
<pool>
<interface name='eth10' maxConnect='1'/>
</pool>
both meaning the same thing. There should only be one
representation in the schema for this kind of thing.
> BTW, for all the people asking about sectunnel, openvswitch, and vde
> - can you see how those would fit in with this? In particular, do
> you see any conflicts? (It's easy to add more stuff on later if
> something is just missing, but much more problematic if I put
> something in that is just plain wrong).
As mentioned above, I think this design is wrong, because it is not
taking any account of the current schema for<network> which defines
the various routed modes.
Currently<network> supports 3 connectivity modes
- Non-routed network, separate subnet (no<forward> element present)
- Routed network, separate subnet with NAT (<forward mode='nat'/>)
- Routed network, separate subnet (<forward mode='route'/>)
Following on from this, I can see another couple of routed modes
- Routed network, IP subnetting
- Routed network, separate subnet with VPN
And the core goal here is to replae type=bridge and type=direct from the
domain XML, which means we're adding several bridging modes
- Bridged network, eth + bridge + tap (akin to type=bridge)
- Bridged network, eth + macvtap (akin to type=direct)
- Bridged network, sriov eth + bridge + tap (akin to type=bridge)
- Bridged network, sriov eth + macvtap (akin to type=direct)
The macvtap can be in 4 modes, so perhaps it is probably better to
consider them separately
- Bridged network, eth + bridge + tap
- Bridged network, eth + macvtap + vepa
- Bridged network, eth + macvtap + private
- Bridged network, eth + macvtap + passthrough
- Bridged network, eth + macvtap + bridge
- Bridged network, sriov eth + bridge + tap
- Bridged network, sriov eth + macvtap + vepa
- Bridged network, sriov eth + macvtap + private
- Bridged network, sriov eth + macvtap + passthrough
- Bridged network, sriov eth + macvtap + bridge
I can also perhaps imagine another VPN mode:
- Bridged network, with VPN
The current routed modes can route to anywhere, or be restricted to
a particular network interface eg with<forward dev='eth0'/>. It
only allows for a single interface, though even for routed modes it
could be desirable to list multiple devs.
The other big distinction is that the<network> modes which do routing,
include interface configuration data (ie the IP addrs& bridge name)
which is configured on the fly. It looks like with the bridged modes,
you're assuming the app has statically configured the interfaces via
the virInterface APIs already, and this just points to an existing
configured interface. This isn't neccessarily a bad thing, just an
observation of a significant difference.
Right. Perhaps later it can be expanded
(at least in some of the modes)
to setup these devices when the network is started, but right now the
network definition is just used to point to something that already
exists and is functioning.
So if we ignore the<ip> and<domain> elements from the
current<network>
schema, then there are a handful of others which we need to have a plan
for
<forward mode='nat|route'/> (omitted completely for isolated
networks)
<bridge name="virbr0" /> (auto-generated/filled if omitted)
<mac address='....'/> (auto-generated/filled if omitted)
The<forward> element can have an optional dev= attribute.
I think the key attribute is the<forward> mode= attribute. I think we
should be adding further values to that attribute for the new network
modes we want to support. We should also make use of the dev= attribute
on<forward> where practical, and/or extend it.
We could expand the list of<foward> mode values in a flat list
- route
- nat
- bridge (brctl)
- vepa
- private
- passthru
- bridge (macvtap)
NB: really need to avoid using 'bridge' in terminology, since all
5 of the last options are really 'bridge'.
Or we could introduce a extra attribute, and have a 2 level list
-<forward layer='link'/> (for all ethernet layer bridging)
Does that gain us anything, though? While it's correct information, it
seems redundant (the layer can always be implied from the mode).
-<forward layer='network'/> (for all IP layer
bridging aka routing)
So the current modes would be
<forward layer='network' mode='route|nat'/>
And new bridging modes would be
<forward layer='link'
mode='bridge-brctl|vepa|private|passthru|bridge-macvtap'/>
For the brctl/macvtap modes, the dev= attribute on<forward> could point to
the NIC being used, while with brctl modes,<bridge> would also be present.
Are you saying that in the case of a brctl mode, it would be required to
fill in both of these?
<forward mode="bridge-brctl" dev="br0" .../>
<bridge name="br0" .../>
I think I would prefer to only use the one in <forward>. Are you
suggesting putting it there to help older management apps cope with the
new modes? I don't really think it would help; it's really just an
accident of implementation that the device in "bridge-brctl" mode
happens to be a bridge device.
In the SRIOV case, we potentiallly need a list of interfaces. For
this we
probably want to use
BTW, just to clarify, when you say "SRIOV", what you really mean is "any
situation where there are multiple network interface devices connected
to the same physical network, and identical connectivity to the guest
could be provided by any one of these devices". In other words, it
doesn't need to be an SRIOV ethernet card with multiple virtual
functions, it could also be an older style setup with multiple physical
cards, or multiple complete devices on a single card.
<forward dev='eth0'>
<interface dev='eth0'/>
<interface dev='eth1'/>
<interface dev='eth2'/>
...
</forward>
NB, the first interface is always to be listed both as a dev= attribute
(for compat with existing apps) *and* as a child<interface> element (for
apps knowing the new schema).
But since the pool of devices would only ever be used in one of the new
forward modes, which an existing app wouldn't understand anyway, would
that really buy us anything?
The maxConnect= attribute from your examples above is an interesting
thing. I'm not sure whether that is neccessarily a good idea. It feels
similar to VMWare's "port group" idea, but I don't think having a
simple 'maxConnect=' attribute is sufficient to let us represent the
vmware port group idea. I think we might need an more explicit
element eg
<portgroup count='5'>
<interface dev='eth2'/>
</portgroup>
eg, so this associates a port group which allows 5 clients (VM NICs)
with the uplink provided by eth2 (which is assumed to be listed
under<forward>).
I've thought about this a bit, and I think portgroup is a good idea, but
I don't think the name of the device being used fits there. portgroup is
a good place to put information about the characteristics of a set of
connections, but which device to use is a backend implementation detail,
and there isn't necessarily a 1:1 correspondence between the two.
portgroup would be used, for example, to configure bandwidth (that's
pretty much all VMWare uses it for, plus a blob of "vendor-specific"
data), and the guest interface XML would specify which portgroup a guest
was going to belong to - if you also set which physical device to use
based on portgroup, that would leave the guest XML specifying which
physical device to use, which is what we're trying to get away from.
(and also it would mean that each physical device would need its own
portgroup, which I don't think we want.
Thinking more about the maxCount thing, it seems like it might be
overkill for now. The case where there must be a limitation of 1 guest
per NIC is macvtap passthrough mode, but that's already implied by the
fact that it's passthrough. Other than that, libvirt can just attempt to
load-balance as best as possible by keeping track of how many
connections there are on each device, but not force any artificial
limit. We may need to provide some method of reporting the number of
connections to any particular network, to be used by a management
application for load balancing decisions (although the amount of traffic
is probably more important, and that can already be learned).
Conclusion on portgroup - a good idea, but not for this, probably for
configuration of bandwidth limiting.
So a complete SRIOV example might be
<network>
<name>Foo</name>
<forward dev='eth0' layer='link' mode='vepa'>
<interface dev='eth0'/>
<interface dev='eth1'/>
<interface dev='eth2'/>
...
</forward>
<portgroup count='10'>
<interface dev='eth0'/>
</portgroup>
<portgroup count='5'>
<interface dev='eth1'/>
</portgroup>
<portgroup count='5'>
<interface dev='eth2'/>
</portgroup>
</network>
The<virtualport> parameters for VEPA/VNLink could either be stored at
the top level under<network>, or inside<portgroup> or both.
Ah, now *there's* something that fits in portgroup (since that's likely
exactly what it's used for on the vepa/vnlink capable switch).
I think it's reasonable to put it in both places, at the top-level
(which would apply to all connections) and in portgroup (which would
override the global setting for connections using that portgroup). (I
think the bandwidth config could be done in the same way.