qemu:///embed and isolation from global components

[PATCH] tests: switch away from...

[PATCH 0/5] qemu: Don't log...

Andrea Bolognani

Friday, 6 March 2020 Fri, 6 Mar '20

5:24 p.m.

Recently I've been working on integrating qemu:///embed into an application. It's been reasonably smooth so far :) There's one thing, however, that has caused a bit of confusion, and I would like to clarify whether my expectations are incorrect, there are genuine bugs in the implementation that need to be addressed, or maybe the experimental nature of embedded driver support results in not all scenarios having yet been considered and catered for. Using virt-qemu-run as an example, when I run it as root I see that * the domain doesn't show up in the output of 'virsh list' for qemu:///system; * it does, however, show up in the output of 'machinectl', with class=vm and service=libvirt-qemu; * virtlogd is automatically spawned, if not already active, to take care of the domain's log file. The first one is expected, but the other two were a surprise, at least to me. Basically, what I would expect is that qemu:///embed would work in a completely isolated way, only interacting with system-wide components when that's explicitly requested. In other words, I would expect virtlogd not to be spawned, and the domain not to be registered with machinectl; at the same time, if the domain configuration is such that it contains for example <interface type='network'> <source network='default'/> <model type='virtio'/> </interface> then I would expect to see a failure unless a connection to network:///system had been explicitly and pre-emptively opened, and not the current behavior which apparently is to automatically open that connection and spawning virtnetworkd as a result. As a corollary to this, I would expect that two qemu:///embed connections using different roots to be able to define domains with the same name without any problem, but as it stands this is not possible because machinectl will refuse to register the second one. There might be other components that freak out as well, that's just the first one that blocked me. Are my expectations unreasonable? Or should we work towards making sure uses of qemu:///embed don't touch global resources and don't clash with each other? -- Andrea Bolognani / Red Hat / Virtualization

Show replies by date

Daniel P. Berrangé

Friday, 6 March Fri, 6 Mar

5:49 p.m.

On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote:

...

This is good.

...

* it does, however, show up in the output of 'machinectl', with class=vm and service=libvirt-qemu;

This is bad. It is one of the gaps we need to deal with. A long while back I proposed a domain XML option for this: https://www.redhat.com/archives/libvir-list/2017-December/msg00729.html <resource register="none|direct|machined|systemd"/> The "none" case meant inherit the cgroups placement of the parent libvirtd (or now the app using the embedded driver), and would be a reasonable default for the embedded case. For the other cases, we certainly need to do something to ensure uniqueness. This is possibly as simple as including a fixed prefix like "qemu-$PID" where $PID is the app embedding the driver. That said, we support closing the app, and re-starting using the same embedded driver directory, which would change PID.

...

* virtlogd is automatically spawned, if not already active, to take care of the domain's log file.

This is trickier. The use of virtlogd was to fix a security DoS where malicious QEMU can write to serial console, or trigger QEMU to write to stdout/err, such that it exhausts the host filesystem. IIUC, virtlogd should still end up writing to the logfile path associated with the embedded driver root directory prefix, so there shouldn't be a filename clash unless I screwed up. Since introducing virtlogd, however, I did think of a different strategy, which would be to connect a FIFO to QEMU as the log file FD. The QEMU driver itself can own the other end of the FIFO and do rollover. Of course you can turn off virtlogd via qemu.conf

...

The first one is expected, but the other two were a surprise, at least to me. Basically, what I would expect is that qemu:///embed would work in a completely isolated way, only interacting with system-wide components when that's explicitly requested.

The goal wasn't explicitly to avoid system-wide components, but it certainly was intended to avoid clashing resources. IOW, machined, virtlogd are both valid to use, as long as the resource clashes are avoided. We should definitely have a way to disable these too.

...

In other words, I would expect virtlogd not to be spawned, and the domain not to be registered with machinectl; at the same time, if the domain configuration is such that it contains for example <interface type='network'> <source network='default'/> <model type='virtio'/> </interface> then I would expect to see a failure unless a connection to network:///system had been explicitly and pre-emptively opened, and not the current behavior which apparently is to automatically open that connection and spawning virtnetworkd as a result.

The QEMU embedded driver is explicitly intended to be able to interact with other global secondary drivers. If you don't want to use virtnetworkd, then you won't be creating such an <interface> config in the first place. The app will have the option to open an embedded seconary driver if desired. Some of the drivers won't really make sense as embedded things though, at least not without extra work. ie a embedded network or nwfilter driver has no value unless your app has moved into a new network namespace, as otherwise it will just fight with the global network driver.

...

As a corollary to this, I would expect that two qemu:///embed connections using different roots to be able to define domains with the same name without any problem, but as it stands this is not possible because machinectl will refuse to register the second one. There might be other components that freak out as well, that's just the first one that blocked me.

Yes, that is definitely needed to work for resources created by those drivers, so we must solve the name clashes for example. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Michal Privoznik

Monday, 9 March Mon, 9 Mar

1:02 p.m.

On 3/6/20 6:49 PM, Daniel P. Berrangé wrote:

...

On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote: > Recently I've been working on integrating qemu:///embed into an > application. It's been reasonably smooth so far :) > > There's one thing, however, that has caused a bit of confusion, and > I would like to clarify whether my expectations are incorrect, there > are genuine bugs in the implementation that need to be addressed, or > maybe the experimental nature of embedded driver support results in > not all scenarios having yet been considered and catered for. > > Using virt-qemu-run as an example, when I run it as root I see that > > * the domain doesn't show up in the output of 'virsh list' for > qemu:///system; This is good. > * it does, however, show up in the output of 'machinectl', with > class=vm and service=libvirt-qemu; This is bad. It is one of the gaps we need to deal with. A long while back I proposed a domain XML option for this: https://www.redhat.com/archives/libvir-list/2017-December/msg00729.html <resource register="none|direct|machined|systemd"/> The "none" case meant inherit the cgroups placement of the parent libvirtd (or now the app using the embedded driver), and would be a reasonable default for the embedded case. For the other cases, we certainly need to do something to ensure uniqueness. This is possibly as simple as including a fixed prefix like "qemu-$PID" where $PID is the app embedding the driver. That said, we support closing the app, and re-starting using the same embedded driver directory, which would change PID.

Instead of PID we can derive the value from the root path, e.g. i-node of the root dir or some substring of the path hash, say first 6 letters of md5($root_path). This would guarantee the stability of the prefix across app restarts (if we assume all root dirs to be on one FS so we don't have clashing i-node numbers). Michal

Christophe de Dinechin

Wednesday, 11 March Wed, 11 Mar

10:35 a.m.

Le 9 mars 2020 à 14:03, Michal Privoznik <mprivozn(a)redhat.com> a écrit :

...

On 3/6/20 6:49 PM, Daniel P. Berrangé wrote: >> On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote: >> Recently I've been working on integrating qemu:///embed into an >> application. It's been reasonably smooth so far :) >> >> There's one thing, however, that has caused a bit of confusion, and >> I would like to clarify whether my expectations are incorrect, there >> are genuine bugs in the implementation that need to be addressed, or >> maybe the experimental nature of embedded driver support results in >> not all scenarios having yet been considered and catered for. >> >> Using virt-qemu-run as an example, when I run it as root I see that >> >> * the domain doesn't show up in the output of 'virsh list' for >> qemu:///system; > This is good. >> * it does, however, show up in the output of 'machinectl', with >> class=vm and service=libvirt-qemu; > This is bad. It is one of the gaps we need to deal with. > A long while back I proposed a domain XML option for this: > https://www.redhat.com/archives/libvir-list/2017-December/msg00729.html > <resource register="none|direct|machined|systemd"/> > The "none" case meant inherit the cgroups placement of the parent > libvirtd (or now the app using the embedded driver), and would be > a reasonable default for the embedded case. > For the other cases, we certainly need to do something to ensure > uniqueness. This is possibly as simple as including a fixed > prefix like "qemu-$PID" where $PID is the app embedding the > driver. That said, we support closing the app, and re-starting > using the same embedded driver directory, which would change > PID. Instead of PID we can derive the value from the root path, e.g. i-node of the root dir or some substring of the path hash, say first 6 letters of md5($root_path). This would guarantee the stability of the prefix across app restarts (if we assume all root dirs to be on one FS so we don't have clashing i-node numbers).

Root path hash is more robust also in other cases, like file copy/rename often used for atomic updates.

...

Michal

Andrea Bolognani

Monday, 9 March Mon, 9 Mar

5:09 p.m.

On Fri, 2020-03-06 at 17:49 +0000, Daniel P. Berrangé wrote:

...

On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote: > * it does, however, show up in the output of 'machinectl', with > class=vm and service=libvirt-qemu; This is bad. It is one of the gaps we need to deal with. A long while back I proposed a domain XML option for this: https://www.redhat.com/archives/libvir-list/2017-December/msg00729.html <resource register="none|direct|machined|systemd"/> The "none" case meant inherit the cgroups placement of the parent libvirtd (or now the app using the embedded driver), and would be a reasonable default for the embedded case.

Agreed. In my case I'll want to use "direct" instead, but "none" would indeed be a sane default for the embedded driver. Aside: instead of a per-VM setting, would it make sense for this to be a connection-level setting? That way, even on traditional libvirt deployments, the hypervisor admin could eg. opt out of machinectl registration if they so desire; at the same time, you'd avoid the situation where most VMs are confined using CGroups but a few are not, which is probably not a desirable scenario.

...

For the other cases, we certainly need to do something to ensure uniqueness. This is possibly as simple as including a fixed prefix like "qemu-$PID" where $PID is the app embedding the driver. That said, we support closing the app, and re-starting using the same embedded driver directory, which would change PID.

Right now we're already doing qemu-$domid-$domname where I'm not entirely sure how much $domid actually buys us. I think machine ids need to be unique per host, not per service, which is kind of a bummer because the obvious choice would be to generate a service name based on the embedded root... Michal already suggested another solution, perhaps that one is more viable. Anyway, I think it's reasonable to expect that, when it comes to VMs created via the embedded driver, the same way you'd not be able to control them via virsh, you'd also not be able to do so via machinectl, so I'm not too concerned about this once we flip the default to "none" as discussed above.

...

> * virtlogd is automatically spawned, if not already active, to > take care of the domain's log file. This is trickier. The use of virtlogd was to fix a security DoS where malicious QEMU can write to serial console, or trigger QEMU to write to stdout/err, such that it exhausts the host filesystem. IIUC, virtlogd should still end up writing to the logfile path associated with the embedded driver root directory prefix, so there shouldn't be a filename clash unless I screwed up. Since introducing virtlogd, however, I did think of a different strategy, which would be to connect a FIFO to QEMU as the log file FD. The QEMU driver itself can own the other end of the FIFO and do rollover. Of course you can turn off virtlogd via qemu.conf

That's what I'm doing right now and it works well enough, but I'm afraid that requiring a bunch of setup will discourage developers from using the embedded driver. We should aim for a reasonable out of the box experience instead.

...

> The first one is expected, but the other two were a surprise, at > least to me. Basically, what I would expect is that qemu:///embed > would work in a completely isolated way, only interacting with > system-wide components when that's explicitly requested. The goal wasn't explicitly to avoid system-wide components, but it certainly was intended to avoid clashing resources. IOW, machined, virtlogd are both valid to use, as long as the resource clashes are avoided. We should definitely have a way to disable these too.

I'd argue that most users of the embedded driver would probably prefer it didn't interact with system-wide components: if that wasn't the case, they'd just use qemu:///system or qemu:///session instead. Having a way to turn off those behaviors would certainly be a step in the right direction, but I think ultimately we want to be in a situation where developers opt in rather than out of them.

...

> In other words, I would expect virtlogd not to be spawned, and the > domain not to be registered with machinectl; at the same time, if > the domain configuration is such that it contains for example > > <interface type='network'> > <source network='default'/> > <model type='virtio'/> > </interface> > > then I would expect to see a failure unless a connection to > network:///system had been explicitly and pre-emptively opened, and > not the current behavior which apparently is to automatically open > that connection and spawning virtnetworkd as a result. The QEMU embedded driver is explicitly intended to be able to interact with other global secondary drivers. If you don't want to use virtnetworkd, then you won't be creating such an <interface> config in the first place. The app will have the option to open an embedded seconary driver if desired. Some of the drivers won't really make sense as embedded things though, at least not without extra work. ie a embedded network or nwfilter driver has no value unless your app has moved into a new network namespace, as otherwise it will just fight with the global network driver.

Again, I think our defaults for qemu:///embed should be consistent and conservative: instead of having developers opt out of using network:///system, they should have to opt in before global resources are involved. If we don't do that, I'm afraid developers will lose trust in the whole qemu:///embed idea. Speaking from my own experience, I was certainly surprised when I accidentally realized my qemu:///embed VMs were showing up in the output of machinectl, and now I'm kinda wondering how many other ways the application I'm working on, for which the use of libvirt is just an implementation detail, is poking at the system without my knowledge... -- Andrea Bolognani / Red Hat / Virtualization

Daniel P. Berrangé

6:04 p.m.

On Mon, Mar 09, 2020 at 06:09:13PM +0100, Andrea Bolognani wrote:

...

On Fri, 2020-03-06 at 17:49 +0000, Daniel P. Berrangé wrote: > On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote: > > * it does, however, show up in the output of 'machinectl', with > > class=vm and service=libvirt-qemu; > > This is bad. It is one of the gaps we need to deal with. > > A long while back I proposed a domain XML option for this: > > https://www.redhat.com/archives/libvir-list/2017-December/msg00729.html > > <resource register="none|direct|machined|systemd"/> > > The "none" case meant inherit the cgroups placement of the parent > libvirtd (or now the app using the embedded driver), and would be > a reasonable default for the embedded case. Agreed. In my case I'll want to use "direct" instead, but "none" would indeed be a sane default for the embedded driver.

Yes, "none" makes sense for the emedded driver, as QEMU is intended to be able to inherit process execution context from the calling application. IOW, we must inherit cgroups placement, and not create things under /machine.slice by default.

...

Aside: instead of a per-VM setting, would it make sense for this to be a connection-level setting? That way, even on traditional libvirt deployments, the hypervisor admin could eg. opt out of machinectl registration if they so desire; at the same time, you'd avoid the situation where most VMs are confined using CGroups but a few are not, which is probably not a desirable scenario.

Yes, functionally it would work fine as a connection level setting too, though this hides the behaviour from the anyone looking at the guest config. We've previously punted quite a few things to the qemu.conf because we didn't want to go through process of representing them in the domain XML. This was OK when the qemu.conf settings were something done once at host deployment time. With the embedded driver, I think this is not so desirable, as means to get the configuration they want from a VM, they need to deal with two distinct config files. The ideal would be that everything that is commonly needed can be achieved solely in the domain XML, and I think resource control backend is one such common tunable.

...

> For the other cases, we certainly need to do something to ensure > uniqueness. This is possibly as simple as including a fixed > prefix like "qemu-$PID" where $PID is the app embedding the > driver. That said, we support closing the app, and re-starting > using the same embedded driver directory, which would change > PID. Right now we're already doing qemu-$domid-$domname where I'm not entirely sure how much $domid actually buys us.

IIRC $domid was a hack because at one time we had problems with systemd not cleaning up the transient scope when QEMU was killed. This would prevent libvirt starting the guest again thereafter. I can't remember now if this was a bug we fixed in systemd or libvirt or both or neither.

...

I think machine ids need to be unique per host, not per service, which is kind of a bummer because the obvious choice would be to generate a service name based on the embedded root... Michal already suggested another solution, perhaps that one is more viable.

Yes, they definitely need to be unique globally, but as agreed above, we should not create cgroups by default at all - we must inherit cgroups placement.

...

> > * virtlogd is automatically spawned, if not already active, to > > take care of the domain's log file. > > This is trickier. The use of virtlogd was to fix a security DoS > where malicious QEMU can write to serial console, or trigger > QEMU to write to stdout/err, such that it exhausts the host > filesystem. IIUC, virtlogd should still end up writing to > the logfile path associated with the embedded driver root > directory prefix, so there shouldn't be a filename clash > unless I screwed up. > > Since introducing virtlogd, however, I did think of a different > strategy, which would be to connect a FIFO to QEMU as the log > file FD. The QEMU driver itself can own the other end of the FIFO > and do rollover. > > Of course you can turn off virtlogd via qemu.conf That's what I'm doing right now and it works well enough, but I'm afraid that requiring a bunch of setup will discourage developers from using the embedded driver. We should aim for a reasonable out of the box experience instead.

Why do you need to turn it off though ? It should already "do the right thing" as the log files should appear under a different location and not have any clash. Turning it off immediately creates a denial of service CVE in your application. None the less, as above I think we need common things to be controllable via the domain XML. So either we need to make a tunable there for use of logd or not, or we need to implement the FIFO idea to avoid need for logd at all.

...

> > The first one is expected, but the other two were a surprise, at > > least to me. Basically, what I would expect is that qemu:///embed > > would work in a completely isolated way, only interacting with > > system-wide components when that's explicitly requested. > > The goal wasn't explicitly to avoid system-wide components, > but it certainly was intended to avoid clashing resources. > > IOW, machined, virtlogd are both valid to use, as long as > the resource clashes are avoided. We should definitely have > a way to disable these too. I'd argue that most users of the embedded driver would probably prefer it didn't interact with system-wide components: if that wasn't the case, they'd just use qemu:///system or qemu:///session instead. Having a way to turn off those behaviors would certainly be a step in the right direction, but I think ultimately we want to be in a situation where developers opt in rather than out of them. > > In other words, I would expect virtlogd not to be spawned, and the > > domain not to be registered with machinectl; at the same time, if > > the domain configuration is such that it contains for example > > > > <interface type='network'> > > <source network='default'/> > > <model type='virtio'/> > > </interface> > > > > then I would expect to see a failure unless a connection to > > network:///system had been explicitly and pre-emptively opened, and > > not the current behavior which apparently is to automatically open > > that connection and spawning virtnetworkd as a result. > > The QEMU embedded driver is explicitly intended to be able to > interact with other global secondary drivers. > > If you don't want to use virtnetworkd, then you won't be > creating such an <interface> config in the first place. > The app will have the option to open an embedded seconary > driver if desired. Some of the drivers won't really make > sense as embedded things though, at least not without > extra work. ie a embedded network or nwfilter driver has > no value unless your app has moved into a new network > namespace, as otherwise it will just fight with the > global network driver. Again, I think our defaults for qemu:///embed should be consistent and conservative: instead of having developers opt out of using network:///system, they should have to opt in before global resources are involved.

They already opt-in to using the network driver by virtue of configuring their guest to request its use. We don't need to opt-in twice.

...

If we don't do that, I'm afraid developers will lose trust in the whole qemu:///embed idea. Speaking from my own experience, I was certainly surprised when I accidentally realized my qemu:///embed VMs were showing up in the output of machinectl, and now I'm kinda wondering how many other ways the application I'm working on, for which the use of libvirt is just an implementation detail, is poking at the system without my knowledge...

First off, this mis-understanding highlights a need for better documentation to set out what the embedded driver is and is not expected to do. At a high level the embedded QEMU driver - Isolated from any other instance of the QEMU driver - Process context of app is inherited by QEMU (namespaces, cgroups, CPU pinning, security context, etc) - All other functionality is unchanged. Your comments above are equating two distinct scenarios, one which had a serious functional bug & violated the first two design goals, and one which does not cause any functional issues at all. There's no question that we must fix the machined problem. There is nothing that needs fixing in the virtual network case as that behaviour is intentional and is not causing any ill effects. The embedded driver is NOT intended to be completely isolated from everything on the host, whether it is systemd, or another host OS service, a libvirt secondary driver, or something else again. In fact a libvirt secondary driver should really just be considered the same as any other host OS service. We just happen to use a familiar RPC API to talk to that secondary driver, where as for a host OS service we'd use DBus or some 3rd party API/protocol. We don't require a special config in the QEMU driver to permit use of other host OS services, we would simply do the right thing based on whatever domain XML the application provides. The same is true of use of libvirt secondary drivers. We must NEVER connect to a secondary driver, unless something in the domain XML requests that behaviour. The <interface type=network> is a valid reason to connect to the virnetworkd daemon. Of course when we do connect to virnetworkd, we MUST ensure that anything we do preserves isolation from other QEMU driver instances. I would also note that the virtnetworkd daemon connected to, may or may not actually be the same as the host OS one. It is entirely possible that the application has opened the embedded QEMU driver from within a namesace, that results in a connection to the /var/run/libvirt/virnetworkd being serviced by a completely isolated virnetworkd running inside a different network namespace from the host OS virtnetworkd. This is of no concern to the QEMU driver though - it just needs to honour what is in the domain XML. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani

Tuesday, 10 March Tue, 10 Mar

3:42 p.m.

On Mon, 2020-03-09 at 18:04 +0000, Daniel P. Berrangé wrote:

...

On Mon, Mar 09, 2020 at 06:09:13PM +0100, Andrea Bolognani wrote: > On Fri, 2020-03-06 at 17:49 +0000, Daniel P. Berrangé wrote: > > On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote:

[...]

...

> Aside: instead of a per-VM setting, would it make sense for this to > be a connection-level setting? That way, even on traditional libvirt > deployments, the hypervisor admin could eg. opt out of machinectl > registration if they so desire; at the same time, you'd avoid the > situation where most VMs are confined using CGroups but a few are > not, which is probably not a desirable scenario. Yes, functionally it would work fine as a connection level setting too, though this hides the behaviour from the anyone looking at the guest config. We've previously punted quite a few things to the qemu.conf because we didn't want to go through process of representing them in the domain XML. This was OK when the qemu.conf settings were something done once at host deployment time. With the embedded driver, I think this is not so desirable, as means to get the configuration they want from a VM, they need to deal with two distinct config files. The ideal would be that everything that is commonly needed can be achieved solely in the domain XML, and I think resource control backend is one such common tunable.

I don't have a strong opinion either way, and as far as my current use is concerned it doesn't bother me to have to deal with a second configuration file. The reason why I thought a per-VM setting might not be desirable is that applications would then be able to override it, and so maybe VMs created with virt-manager would be registered against machinectl but VMs created using GNOME Boxes would not, and if the sysadmin likes to use machinectl to get a comprehensive view of the system they'd no longer be guaranteed that. But if that's not the kind of scenario we think we should prevent, then a per-VM setting is fine by me :) [...]

...

> Right now we're already doing > > qemu-$domid-$domname > > where I'm not entirely sure how much $domid actually buys us. IIRC $domid was a hack because at one time we had problems with systemd not cleaning up the transient scope when QEMU was killed. This would prevent libvirt starting the guest again thereafter. I can't remember now if this was a bug we fixed in systemd or libvirt or both or neither.

I see! It would be neat if we could get rid of it, assuming of course it's no longer needed on the platforms we target. [...]

...

> > Of course you can turn off virtlogd via qemu.conf > > That's what I'm doing right now and it works well enough, but I'm > afraid that requiring a bunch of setup will discourage developers > from using the embedded driver. We should aim for a reasonable out > of the box experience instead. Why do you need to turn it off though ? It should already "do the right thing" as the log files should appear under a different location and not have any clash. Turning it off immediately creates a denial of service CVE in your application.

I was getting the same SELinux denial that Michal reported a few days back: virtlogd wants to verify it's being connected to by a process running as root, but it's only allowed by the policy to look into libvirtd's /proc/$pid for this information. So, for the same reason virtqemud can't currently connect to virtlogd when SELinux is in enforcing mode, neither can my qemu:///embed-using application. Besides that, there is the fact that a lot of people, mainly those coming from a containers background, are not happy with having extra daemons running. I'm not saying they would prefer being hit by a DoS than having virtlogd running :) but they really, really don't like daemons :)

...

None the less, as above I think we need common things to be controllable via the domain XML. So either we need to make a tunable there for use of logd or not, or we need to implement the FIFO idea to avoid need for logd at all.

It seems like the FIFO idea (though I'll admit I don't fully understand it) would be the best of both worlds. [...]

...

> > If you don't want to use virtnetworkd, then you won't be > > creating such an <interface> config in the first place. > > The app will have the option to open an embedded seconary > > driver if desired. Some of the drivers won't really make > > sense as embedded things though, at least not without > > extra work. ie a embedded network or nwfilter driver has > > no value unless your app has moved into a new network > > namespace, as otherwise it will just fight with the > > global network driver. > > Again, I think our defaults for qemu:///embed should be consistent > and conservative: instead of having developers opt out of using > network:///system, they should have to opt in before global > resources are involved. They already opt-in to using the network driver by virtue of configuring their guest to request its use. We don't need to opt-in twice.

For some applications that's definitely an option, but others like virt-qemu-run accept a fairly arbitrary configuration and having to figure out whether that would result in eg. virtnetworkd being used, and whether that's acceptable, is extra work as well as potential duplication of logic that already exists in libvirt... Then again, something like virt-qemu-run is probably expected to expose basically the entire feature set of libvirt, as opposed to more targeted applications which will use qemu:///embed internally and only rely on a relatively small subset.

...

> If we don't do that, I'm afraid developers will lose trust in the > whole qemu:///embed idea. Speaking from my own experience, I was > certainly surprised when I accidentally realized my qemu:///embed > VMs were showing up in the output of machinectl, and now I'm kinda > wondering how many other ways the application I'm working on, for > which the use of libvirt is just an implementation detail, is poking > at the system without my knowledge... First off, this mis-understanding highlights a need for better documentation to set out what the embedded driver is and is not expected to do.

We definitely need to document this very clearly if we want qemu:///embed to gain traction.

...

At a high level the embedded QEMU driver - Isolated from any other instance of the QEMU driver

Yup.

...

- Process context of app is inherited by QEMU (namespaces, cgroups, CPU pinning, security context, etc)

Optionally! The fact that libvirt can deal with these is a big selling point in some scenarios.

...

- All other functionality is unchanged. Your comments above are equating two distinct scenarios, one which had a serious functional bug & violated the first two design goals, and one which does not cause any functional issues at all.

I'm not equating the two, just reporting a bunch of behaviors that I ran into while trying to use qemu:///embed in my application and that I found to be surprising, in an attempt to figure out which ones are intentional and whether even those are necessarily something that we want to keep around in their current form.

...

There's no question that we must fix the machined problem. There is nothing that needs fixing in the virtual network case as that behaviour is intentional and is not causing any ill effects. The embedded driver is NOT intended to be completely isolated from everything on the host, whether it is systemd, or another host OS service, a libvirt secondary driver, or something else again. In fact a libvirt secondary driver should really just be considered the same as any other host OS service. We just happen to use a familiar RPC API to talk to that secondary driver, where as for a host OS service we'd use DBus or some 3rd party API/protocol. We don't require a special config in the QEMU driver to permit use of other host OS services, we would simply do the right thing based on whatever domain XML the application provides. The same is true of use of libvirt secondary drivers. We must NEVER connect to a secondary driver, unless something in the domain XML requests that behaviour. The <interface type=network> is a valid reason to connect to the virnetworkd daemon.

I mean, one could argue that namespaces and CGroups are OS services, and whether or not those are used is decided not based on the per-VM configuration but the per-driver configuration :) But I'm not here to play devil's advocate, and I mostly get your point. Anyway, as I mentioned earlier part of the issue is having daemons running, which neither of the features mentioned above requires.

...

Of course when we do connect to virnetworkd, we MUST ensure that anything we do preserves isolation from other QEMU driver instances. I would also note that the virtnetworkd daemon connected to, may or may not actually be the same as the host OS one. It is entirely possible that the application has opened the embedded QEMU driver from within a namesace, that results in a connection to the /var/run/libvirt/virnetworkd being serviced by a completely isolated virnetworkd running inside a different network namespace from the host OS virtnetworkd. This is of no concern to the QEMU driver though - it just needs to honour what is in the domain XML.

This kind of setup sounds extremely confusing. Would you have multiple instances of virtnetworkd, one per network namespace, all running at the same time? How would the application pick the correct one to connect to? -- Andrea Bolognani / Red Hat / Virtualization

Daniel P. Berrangé

4 p.m.

On Tue, Mar 10, 2020 at 04:42:57PM +0100, Andrea Bolognani wrote:

...

On Mon, 2020-03-09 at 18:04 +0000, Daniel P. Berrangé wrote: > Of course when we do connect to virnetworkd, we MUST ensure that > anything we do preserves isolation from other QEMU driver instances. > > I would also note that the virtnetworkd daemon connected to, may > or may not actually be the same as the host OS one. It is entirely > possible that the application has opened the embedded QEMU driver > from within a namesace, that results in a connection to the > /var/run/libvirt/virnetworkd being serviced by a completely isolated > virnetworkd running inside a different network namespace from the > host OS virtnetworkd. This is of no concern to the QEMU driver > though - it just needs to honour what is in the domain XML. This kind of setup sounds extremely confusing. Would you have multiple instances of virtnetworkd, one per network namespace, all running at the same time? How would the application pick the correct one to connect to?

I forgot to mention that this is actually a scenario we'd like to support even ignoring namespaces. The big pain point for desktop virt apps like Boxes or Virt-manager is that the QEMU session driver lacked any sane networking connectivity. For this reason we invented the QEMU setuid network helper which runs as root and opens a TAP device connected to a bridge, passing it back to unprivileged libvirtd. This is really crude and not a general solution to the problem. The key design flaw in session libvirtd, was tieing together the privileges of the virt driver and the secondary drivers. So we had a 1-1 relation between them. What we really need to have is a N-N relation between them. The virtual network driver as it exists today, requires elevated privileges, but if we had a NetworkManager backed impl it could work unprivileged. The nwfilter driver requires privileges. The storage driver is a little different because some backends (directory backend) work when unprivileged, but other backends (SCSI, LVM, disk) only ever work when privileged. The split daemon model is intended to allow us to address this long standing design flaw, by allowing the QEMU session driver to optionally talk to a secondary driver running with different privileges, instead of the instance running with the same privs. So currently we have <interface type='network'> <source network='default'/> </interface> This refers to a secondary driver running at the same privilege level. I expected we'd extend it to allow <interface type='network'> <source scope='system' network='default'/> </interface> This explicitly requests the secondary driver running with elevated privileges. The same concept for disk storage <disk type='volume' device='disk'> <source pool='blk-pool0' volume='blk-pool0-vol0'/> </disk> Would be extended to allow <disk type='volume' device='disk'> <source scope='system' pool='blk-pool0' volume='blk-pool0-vol0'/> </disk> The same for host devices, and for nwfilter. With such functionality present, it logically then also extends to cover connections to daemons running in different namespaces. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani

6:25 p.m.

On Tue, 2020-03-10 at 16:00 +0000, Daniel P. Berrangé wrote:

...

The split daemon model is intended to allow us to address this long standing design flaw, by allowing the QEMU session driver to optionally talk to a secondary driver running with different privileges, instead of the instance running with the same privs. So currently we have <interface type='network'> <source network='default'/> </interface> This refers to a secondary driver running at the same privilege level. I expected we'd extend it to allow <interface type='network'> <source scope='system' network='default'/> </interface> This explicitly requests the secondary driver running with elevated privileges.

This makes perfect sense to me, and in fact I believe you're basically suggesting what I was arguing for earlier :) In your scenario, when you don't specify a scope you get the same one as the primary driver is using (this matches the current behavior): so if you are using qemu:///session, every <interface> will use network:///session unless you explicitly specify scope='system', at which point network:///system will be used; in the same fashion, if you're connected to qemu:///embed, the default for <interface>s should be network:///embed, with the possibility to use either network:///session (with scope='session') or, most likely, network:///system (with scope='system'). virtlogd would probably have to be treated a bit differently since it needs to be running even before the first guest is started, but other than that the same general principle should apply.

...

With such functionality present, it logically then also extends to cover connections to daemons running in different namespaces.

I'm still unclear on how this scenario, which would apparently have multiple eg. privileged virtnetworkd, running at the same time but in different network namespaces, would work. -- Andrea Bolognani / Red Hat / Virtualization

Daniel P. Berrangé

Wednesday, 11 March Wed, 11 Mar

9:53 a.m.

On Tue, Mar 10, 2020 at 07:25:46PM +0100, Andrea Bolognani wrote:

...

On Tue, 2020-03-10 at 16:00 +0000, Daniel P. Berrangé wrote: > The split daemon model is intended to allow us to address this > long standing design flaw, by allowing the QEMU session driver > to optionally talk to a secondary driver running with different > privileges, instead of the instance running with the same privs. > > So currently we have > > <interface type='network'> > <source network='default'/> > </interface> > > This refers to a secondary driver running at the same privilege > level. > > I expected we'd extend it to allow > > <interface type='network'> > <source scope='system' network='default'/> > </interface> > > This explicitly requests the secondary driver running with elevated > privileges. This makes perfect sense to me, and in fact I believe you're basically suggesting what I was arguing for earlier :) In your scenario, when you don't specify a scope you get the same one as the primary driver is using (this matches the current behavior): so if you are using qemu:///session, every <interface> will use network:///session unless you explicitly specify scope='system', at which point network:///system will be used; in the same fashion, if you're connected to qemu:///embed, the default for <interface>s should be network:///embed, with the possibility to use either network:///session (with scope='session') or, most likely, network:///system (with scope='system').

No, I'm not talking about using the same URI for the secondary drivers, I'm talking about using the same *privilege* level. ie, qemu:///system and a qemu:///embed running as root should both use the privileged network/storage driver. The qemu:///session and qemu:///embed running as non-root should both default to the unprivileged network/storage drivers.

...

> With such functionality present, it logically then also extends to > cover connections to daemons running in different namespaces. I'm still unclear on how this scenario, which would apparently have multiple eg. privileged virtnetworkd, running at the same time but in different network namespaces, would work.

There would need to be some selection method, most likely a way to explicitly specify the socket path, but this is a longish way into the future. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani

Thursday, 12 March Thu, 12 Mar

5:57 p.m.

On Wed, 2020-03-11 at 09:53 +0000, Daniel P. Berrangé wrote:

...

On Tue, Mar 10, 2020 at 07:25:46PM +0100, Andrea Bolognani wrote: > In your scenario, when you don't specify a scope you get the same > one as the primary driver is using (this matches the current > behavior): so if you are using qemu:///session, every <interface> > will use network:///session unless you explicitly specify > scope='system', at which point network:///system will be used; in > the same fashion, if you're connected to qemu:///embed, the default > for <interface>s should be network:///embed, with the possibility > to use either network:///session (with scope='session') or, most > likely, network:///system (with scope='system'). No, I'm not talking about using the same URI for the secondary drivers, I'm talking about using the same *privilege* level. ie, qemu:///system and a qemu:///embed running as root should both use the privileged network/storage driver. The qemu:///session and qemu:///embed running as non-root should both default to the unprivileged network/storage drivers.

What's the advantage of conflating URI and privilege level? It seems to me like it only makes things complicated when they could be absolutely straightforward instead. In fact, I believe libguestfs developers have long lamented the fact that it's not really possible to have qemu:///session for the root user, which is caused by the same kind of logic... It would be preferable, I think, not to introduce even more instances of it.

...

> > With such functionality present, it logically then also extends to > > cover connections to daemons running in different namespaces. > > I'm still unclear on how this scenario, which would apparently have > multiple eg. privileged virtnetworkd, running at the same time but in > different network namespaces, would work. There would need to be some selection method, most likely a way to explicitly specify the socket path, but this is a longish way into the future.

Got it! -- Andrea Bolognani / Red Hat / Virtualization

Michal Privoznik

Wednesday, 11 March Wed, 11 Mar

4:32 p.m.

On 3/10/20 4:42 PM, Andrea Bolognani wrote:

...

On Mon, 2020-03-09 at 18:04 +0000, Daniel P. Berrangé wrote: > On Mon, Mar 09, 2020 at 06:09:13PM +0100, Andrea Bolognani wrote: >> On Fri, 2020-03-06 at 17:49 +0000, Daniel P. Berrangé wrote: >>> On Fri, Mar 06, 2020 at 06:24:15PM +0100, Andrea Bolognani wrote: [...] >> Aside: instead of a per-VM setting, would it make sense for this to >> be a connection-level setting? That way, even on traditional libvirt >> deployments, the hypervisor admin could eg. opt out of machinectl >> registration if they so desire; at the same time, you'd avoid the >> situation where most VMs are confined using CGroups but a few are >> not, which is probably not a desirable scenario. > > Yes, functionally it would work fine as a connection level setting > too, though this hides the behaviour from the anyone looking at the > guest config. We've previously punted quite a few things to the > qemu.conf because we didn't want to go through process of representing > them in the domain XML. This was OK when the qemu.conf settings were > something done once at host deployment time. > > With the embedded driver, I think this is not so desirable, as means > to get the configuration they want from a VM, they need to deal with > two distinct config files. The ideal would be that everything that > is commonly needed can be achieved solely in the domain XML, and > I think resource control backend is one such common tunable. I don't have a strong opinion either way, and as far as my current use is concerned it doesn't bother me to have to deal with a second configuration file. The reason why I thought a per-VM setting might not be desirable is that applications would then be able to override it, and so maybe VMs created with virt-manager would be registered against machinectl but VMs created using GNOME Boxes would not, and if the sysadmin likes to use machinectl to get a comprehensive view of the system they'd no longer be guaranteed that. But if that's not the kind of scenario we think we should prevent, then a per-VM setting is fine by me :)

I still don't quite see the value in machinectl (maybe because I'm not using systemd :-D), but anyway - it's a system-wide monitor of virtual machines. Therefore it makes sense to register a domain started under qemu:///embed there. I don't view embed mode as a way of starting VMs secretly. It's a way of starting VMs privately and that's a different thing. Other users might learn that my app is running a VM (plain 'ps' would give it away), but they can not mangle with it in any way, e.g. change its XML.

...

[...] >> Right now we're already doing >> >> qemu-$domid-$domname >> >> where I'm not entirely sure how much $domid actually buys us. > > IIRC $domid was a hack because at one time we had problems with > systemd not cleaning up the transient scope when QEMU was killed. > This would prevent libvirt starting the guest again thereafter. > I can't remember now if this was a bug we fixed in systemd or > libvirt or both or neither.

It was introduced by bd773e74f0d1d1b9ebbfcaa645178316b4f2265c and the commit message links to this bug: https://bugs.freedesktop.org/show_bug.cgi?id=68370 which looks like fixed in systemd.

...

I see! It would be neat if we could get rid of it, assuming of course it's no longer needed on the platforms we target.

I don't think it's that simple. Machinectl poses some limitations on the name: either it has to be a FQDN or a simple name without any dots. And according to our code the strlen() must be <= 64 (don't know if that comes from machinectl or is just our own limitation). Therefore, if you have two domains which names would clash after our processing, the domain ID guarantees unique strings are passed to machined.

...

[...] >>> Of course you can turn off virtlogd via qemu.conf >> >> That's what I'm doing right now and it works well enough, but I'm >> afraid that requiring a bunch of setup will discourage developers >> from using the embedded driver. We should aim for a reasonable out >> of the box experience instead. > > Why do you need to turn it off though ? It should already > "do the right thing" as the log files should appear under a > different location and not have any clash. Turning it off > immediately creates a denial of service CVE in your application. I was getting the same SELinux denial that Michal reported a few days back: virtlogd wants to verify it's being connected to by a process running as root, but it's only allowed by the policy to look into libvirtd's /proc/$pid for this information. So, for the same reason virtqemud can't currently connect to virtlogd when SELinux is in enforcing mode, neither can my qemu:///embed-using application. Besides that, there is the fact that a lot of people, mainly those coming from a containers background, are not happy with having extra daemons running. I'm not saying they would prefer being hit by a DoS than having virtlogd running :) but they really, really don't like daemons :) > None the less, as above I think we need common things to be > controllable via the domain XML. So either we need to make a > tunable there for use of logd or not, or we need to implement > the FIFO idea to avoid need for logd at all. It seems like the FIFO idea (though I'll admit I don't fully understand it) would be the best of both worlds. [...] >>> If you don't want to use virtnetworkd, then you won't be >>> creating such an <interface> config in the first place. >>> The app will have the option to open an embedded seconary >>> driver if desired. Some of the drivers won't really make >>> sense as embedded things though, at least not without >>> extra work. ie a embedded network or nwfilter driver has >>> no value unless your app has moved into a new network >>> namespace, as otherwise it will just fight with the >>> global network driver. >> >> Again, I think our defaults for qemu:///embed should be consistent >> and conservative: instead of having developers opt out of using >> network:///system, they should have to opt in before global >> resources are involved. > > They already opt-in to using the network driver by virtue of > configuring their guest to request its use. We don't need to > opt-in twice. For some applications that's definitely an option, but others like virt-qemu-run accept a fairly arbitrary configuration and having to figure out whether that would result in eg. virtnetworkd being used, and whether that's acceptable, is extra work as well as potential duplication of logic that already exists in libvirt... Then again, something like virt-qemu-run is probably expected to expose basically the entire feature set of libvirt, as opposed to more targeted applications which will use qemu:///embed internally and only rely on a relatively small subset. >> If we don't do that, I'm afraid developers will lose trust in the >> whole qemu:///embed idea. Speaking from my own experience, I was >> certainly surprised when I accidentally realized my qemu:///embed >> VMs were showing up in the output of machinectl, and now I'm kinda >> wondering how many other ways the application I'm working on, for >> which the use of libvirt is just an implementation detail, is poking >> at the system without my knowledge... > > First off, this mis-understanding highlights a need for better > documentation to set out what the embedded driver is and is not > expected to do. We definitely need to document this very clearly if we want qemu:///embed to gain traction. > At a high level the embedded QEMU driver > > - Isolated from any other instance of the QEMU driver Yup. > - Process context of app is inherited by QEMU (namespaces, > cgroups, CPU pinning, security context, etc) Optionally! The fact that libvirt can deal with these is a big selling point in some scenarios. > - All other functionality is unchanged. > > Your comments above are equating two distinct scenarios, one which > had a serious functional bug & violated the first two design goals, > and one which does not cause any functional issues at all. I'm not equating the two, just reporting a bunch of behaviors that I ran into while trying to use qemu:///embed in my application and that I found to be surprising, in an attempt to figure out which ones are intentional and whether even those are necessarily something that we want to keep around in their current form. > There's no question that we must fix the machined problem.

I will try to post patches for this. Michal

Andrea Bolognani

Thursday, 12 March Thu, 12 Mar

11:57 a.m.

On Wed, 2020-03-11 at 17:32 +0100, Michal Privoznik wrote:

...

I still don't quite see the value in machinectl (maybe because I'm not using systemd :-D)

Honestly, so far I haven't been able to figure out the use case for registering libvirt VMs with machined either :) Most of the operations are either not supported (login, shell, start) or do not work as expected (list --all, reboot), so all you can really do is list the subset of libvirt VMs that happen to be running and power them off. I can't really imagine that being very useful to anyone... Am I missing something?

...

but anyway - it's a system-wide monitor of virtual machines. Therefore it makes sense to register a domain started under qemu:///embed there. I don't view embed mode as a way of starting VMs secretly. It's a way of starting VMs privately and that's a different thing. Other users might learn that my app is running a VM (plain 'ps' would give it away), but they can not mangle with it in any way, e.g. change its XML.

Of course it's not about secrecy, but for the same reasons qemu:///embed VMs don't show up in the output of 'virsh list' it also makes sense for them to be omitted from that of 'machinectl list', I think. It's easier to make the case for virsh because the expectation is that, if they show up, you can perform various operations such as edit or dumpxml or whatever on them; for machinectl, since the only operation that seems to work is poweroff anyway, and that's apparently achieved not by calling libvirt APIs but by killing the QEMU process directly, then the case is not as obvious - but I think the same rationale still applies.

...

> > > Right now we're already doing > > > > > > qemu-$domid-$domname > > > > > > where I'm not entirely sure how much $domid actually buys us. > > > > IIRC $domid was a hack because at one time we had problems with > > systemd not cleaning up the transient scope when QEMU was killed. > > This would prevent libvirt starting the guest again thereafter. > > I can't remember now if this was a bug we fixed in systemd or > > libvirt or both or neither. It was introduced by bd773e74f0d1d1b9ebbfcaa645178316b4f2265c and the commit message links to this bug: https://bugs.freedesktop.org/show_bug.cgi?id=68370 which looks like fixed in systemd.

That doesn't look right.

...

From what I can gather, $domid was added much more recently,

specifically in c3bd0019c0e3f080dbf0d4bd08245ffb2daa2765, in order to addresses https://bugzilla.redhat.com/show_bug.cgi?id=1282846. While I get the overall idea behind the changes, I don't quite understand the rationale behind adding $domid specifically...

...

> I see! It would be neat if we could get rid of it, assuming of course > it's no longer needed on the platforms we target. I don't think it's that simple. Machinectl poses some limitations on the name: either it has to be a FQDN or a simple name without any dots. And according to our code the strlen() must be <= 64 (don't know if that comes from machinectl or is just our own limitation). Therefore, if you have two domains which names would clash after our processing, the domain ID guarantees unique strings are passed to machined.

... but here it is! Makes sense now :) [...]

...

> > There's no question that we must fix the machined problem. I will try to post patches for this.

That'd be awesome! Thanks in advance :) -- Andrea Bolognani / Red Hat / Virtualization

Daniel P. Berrangé

12:09 p.m.

On Thu, Mar 12, 2020 at 12:57:36PM +0100, Andrea Bolognani wrote:

...

On Wed, 2020-03-11 at 17:32 +0100, Michal Privoznik wrote: > I still don't quite see the value in machinectl (maybe because I'm not > using systemd :-D) Honestly, so far I haven't been able to figure out the use case for registering libvirt VMs with machined either :) Most of the operations are either not supported (login, shell, start) or do not work as expected (list --all, reboot), so all you can really do is list the subset of libvirt VMs that happen to be running and power them off. I can't really imagine that being very useful to anyone... Am I missing something?

Yeah, pretty much all you get is a way to report & terminate VMs via systemd commands. A few others things could be wired up, but no one ever made an effort todo so and I don't think it is worth it. So I'm getting inclined to consider machined a failed experiment from POV of VMs - still makes sense for containers. That said I'd still keep using it, because we need systemd to deal with cgroups creation no matter what, and its no worse to talk to systemd via machined than directly.

...

> but anyway - it's a system-wide monitor of virtual > machines. Therefore it makes sense to register a domain started under > qemu:///embed there. I don't view embed mode as a way of starting VMs > secretly. It's a way of starting VMs privately and that's a different > thing. Other users might learn that my app is running a VM (plain 'ps' > would give it away), but they can not mangle with it in any way, e.g. > change its XML. Of course it's not about secrecy, but for the same reasons qemu:///embed VMs don't show up in the output of 'virsh list' it also makes sense for them to be omitted from that of 'machinectl list', I think.

Yes, I agree with this. The only even slightly plausible use case for machinectl to list a full set of guest OS running on the host. This just about makes sense for traditional data center / cloud virt use case. I don't think it makes sense when KVM is merely used as an infrastructure building block embedded in applications. As such, I think we should NOT register with machined or systemd at all, for embedded VMs, without an explicit opt-in. We should flip to just inheriting the calling processes cgroups context, to align with the goal that embedded driver should generally aim to inherit all process context. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani

12:50 p.m.

On Thu, 2020-03-12 at 12:09 +0000, Daniel P. Berrangé wrote:

...

On Thu, Mar 12, 2020 at 12:57:36PM +0100, Andrea Bolognani wrote: > Honestly, so far I haven't been able to figure out the use case for > registering libvirt VMs with machined either :) > > Most of the operations are either not supported (login, shell, start) > or do not work as expected (list --all, reboot), so all you can > really do is list the subset of libvirt VMs that happen to be running > and power them off. I can't really imagine that being very useful to > anyone... Am I missing something? Yeah, pretty much all you get is a way to report & terminate VMs via systemd commands. A few others things could be wired up, but no one ever made an effort todo so and I don't think it is worth it. So I'm getting inclined to consider machined a failed experiment from POV of VMs - still makes sense for containers. That said I'd still keep using it, because we need systemd to deal with cgroups creation no matter what, and its no worse to talk to systemd via machined than directly.

Would it make sense to default to not registering with machined? It would probably be more honest of us, considering how severely limited the functionality is... Set expectations right and all that. The fact that not even reboot, one of the only three operations available through machinectl, works correctly (it will shut down the VM instead of restarting it) leads me to believe that nobody is actually using this anyway.

...

> Of course it's not about secrecy, but for the same reasons > qemu:///embed VMs don't show up in the output of 'virsh list' it > also makes sense for them to be omitted from that of 'machinectl > list', I think. Yes, I agree with this. The only even slightly plausible use case for machinectl to list a full set of guest OS running on the host. This just about makes sense for traditional data center / cloud virt use case. I don't think it makes sense when KVM is merely used as an infrastructure building block embedded in applications. As such, I think we should NOT register with machined or systemd at all, for embedded VMs, without an explicit opt-in. We should flip to just inheriting the calling processes cgroups context, to align with the goal that embedded driver should generally aim to inherit all process context.

Cool. Let's just make sure there is a way for qemu:///embed users to explicitly opt-in into qemu:///system-style CGroup handling, hopefully without machined getting in the way, before we flip the switch. Are you going to revive the old patch you linked to a few day ago? -- Andrea Bolognani / Red Hat / Virtualization

Daniel P. Berrangé

1:05 p.m.

On Thu, Mar 12, 2020 at 01:50:49PM +0100, Andrea Bolognani wrote:

...

On Thu, 2020-03-12 at 12:09 +0000, Daniel P. Berrangé wrote: > On Thu, Mar 12, 2020 at 12:57:36PM +0100, Andrea Bolognani wrote: > > Honestly, so far I haven't been able to figure out the use case for > > registering libvirt VMs with machined either :) > > > > Most of the operations are either not supported (login, shell, start) > > or do not work as expected (list --all, reboot), so all you can > > really do is list the subset of libvirt VMs that happen to be running > > and power them off. I can't really imagine that being very useful to > > anyone... Am I missing something? > > Yeah, pretty much all you get is a way to report & terminate VMs via > systemd commands. A few others things could be wired up, but no one > ever made an effort todo so and I don't think it is worth it. > > So I'm getting inclined to consider machined a failed experiment from > POV of VMs - still makes sense for containers. That said I'd still > keep using it, because we need systemd to deal with cgroups creation > no matter what, and its no worse to talk to systemd via machined > than directly. Would it make sense to default to not registering with machined? It would probably be more honest of us, considering how severely limited the functionality is... Set expectations right and all that. The fact that not even reboot, one of the only three operations available through machinectl, works correctly (it will shut down the VM instead of restarting it) leads me to believe that nobody is actually using this anyway. > > Of course it's not about secrecy, but for the same reasons > > qemu:///embed VMs don't show up in the output of 'virsh list' it > > also makes sense for them to be omitted from that of 'machinectl > > list', I think. > > Yes, I agree with this. > > The only even slightly plausible use case for machinectl to list > a full set of guest OS running on the host. This just about makes > sense for traditional data center / cloud virt use case. I don't > think it makes sense when KVM is merely used as an infrastructure > building block embedded in applications. As such, I think we should > NOT register with machined or systemd at all, for embedded VMs, without > an explicit opt-in. We should flip to just inheriting the calling > processes cgroups context, to align with the goal that embedded driver > should generally aim to inherit all process context. Cool. Let's just make sure there is a way for qemu:///embed users to explicitly opt-in into qemu:///system-style CGroup handling, hopefully without machined getting in the way, before we flip the switch. Are you going to revive the old patch you linked to a few day ago?

Dunno how well it will still apply, but yeah, something approximately the same as that would suit us Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani

Wednesday, 18 March Wed, 18 Mar

3:47 p.m.

On Mon, 2020-03-09 at 18:04 +0000, Daniel P. Berrangé wrote:

...

At a high level the embedded QEMU driver - Isolated from any other instance of the QEMU driver

Replying here because it looks as good a place as any. Now that Michal has made it so that identically-name domains defined under different qemu:///embed roots no longer clash at the machined level (thanks!), I have immediately bumped into the next issue: if I use either one of <memoryBacking> <hugepages/> </memoryBacking> <memoryBacking> <access mode='shared'/> </memoryBacking> both qemu:///embed instances try to use the same paths: /dev/hugepages/libvirt/qemu/$domid-$domname/ /dev/shm/libvirt/qemu/$domid-$domname/ which obviously breaks everything. In this case, is the application expected to set hugetlbfs_mount = "/dev/hugepages/$random" memory_backing_dir = "/dev/shm/$random" in qemu.conf beforehand (I'm not sure the former setting would even work), or should libvirt take care of it automatically in a similar way it now does for machine names? -- Andrea Bolognani / Red Hat / Virtualization

Michal Prívozník

5:01 p.m.

On 18. 3. 2020 16:47, Andrea Bolognani wrote:

...

On Mon, 2020-03-09 at 18:04 +0000, Daniel P. Berrangé wrote: > At a high level the embedded QEMU driver > > - Isolated from any other instance of the QEMU driver Replying here because it looks as good a place as any. Now that Michal has made it so that identically-name domains defined under different qemu:///embed roots no longer clash at the machined level (thanks!), I have immediately bumped into the next issue: if I use either one of <memoryBacking> <hugepages/> </memoryBacking> <memoryBacking> <access mode='shared'/> </memoryBacking> both qemu:///embed instances try to use the same paths: /dev/hugepages/libvirt/qemu/$domid-$domname/ /dev/shm/libvirt/qemu/$domid-$domname/

This is because qemu driver uses virDomainDefGetShortName() to construct the path which is not embed driver aware. In fact, we use it on a few other places: libvirt.git $ git grep -C0 virDomainDefGetShortName -- src/qemu/ \ | wc -l 15 so we probably have more broken plaes. I will investigate and post a patch. Probably something among those lines that fixed machined name generation.

...

which obviously breaks everything. In this case, is the application expected to set hugetlbfs_mount = "/dev/hugepages/$random" memory_backing_dir = "/dev/shm/$random" in qemu.conf beforehand (I'm not sure the former setting would even work), or should libvirt take care of it automatically in a similar way it now does for machine names?

The latter is true. Michal

Andrea Bolognani

Thursday, 19 March Thu, 19 Mar

9:21 a.m.

On Wed, 2020-03-18 at 18:01 +0100, Michal Prívozník wrote:

...

On 18. 3. 2020 16:47, Andrea Bolognani wrote: > if I use either one of > > <memoryBacking> > <hugepages/> > </memoryBacking> > > <memoryBacking> > <access mode='shared'/> > </memoryBacking> > > both qemu:///embed instances try to use the same paths: > > /dev/hugepages/libvirt/qemu/$domid-$domname/ > > /dev/shm/libvirt/qemu/$domid-$domname/ This is because qemu driver uses virDomainDefGetShortName() to construct the path which is not embed driver aware. In fact, we use it on a few other places: libvirt.git $ git grep -C0 virDomainDefGetShortName -- src/qemu/ \ | wc -l 15 so we probably have more broken plaes. I will investigate and post a patch. Probably something among those lines that fixed machined name generation.

I'm not sure changing virDomainDefGetShortName() to include the embed root hash all the time, which I assume is what you had in mind, is a good enough solution. That would work very well for /dev/shm and friends, but if the path you're building is *inside* the embed root, then, you just added 15 characters to the path of any file. This wouldn't be too bad if not for the fact that the path to a UNIX socket can only be up to 108 characters long, which is limiting enough already... Can we make the whole thing smart enough that the embed root hash will appear in paths that are outside of the embed root, but not in those that are inside it? -- Andrea Bolognani / Red Hat / Virtualization

Daniel P. Berrangé

9:59 a.m.

On Thu, Mar 19, 2020 at 10:21:39AM +0100, Andrea Bolognani wrote:

...

On Wed, 2020-03-18 at 18:01 +0100, Michal Prívozník wrote: > On 18. 3. 2020 16:47, Andrea Bolognani wrote: > > if I use either one of > > > > <memoryBacking> > > <hugepages/> > > </memoryBacking> > > > > <memoryBacking> > > <access mode='shared'/> > > </memoryBacking> > > > > both qemu:///embed instances try to use the same paths: > > > > /dev/hugepages/libvirt/qemu/$domid-$domname/ > > > > /dev/shm/libvirt/qemu/$domid-$domname/ > > This is because qemu driver uses virDomainDefGetShortName() to construct > the path which is not embed driver aware. In fact, we use it on a few > other places: > > libvirt.git $ git grep -C0 virDomainDefGetShortName -- src/qemu/ \ > | wc -l > 15 > > so we probably have more broken plaes. I will investigate and post a > patch. Probably something among those lines that fixed machined name > generation. I'm not sure changing virDomainDefGetShortName() to include the embed root hash all the time, which I assume is what you had in mind, is a good enough solution. That would work very well for /dev/shm and friends, but if the path you're building is *inside* the embed root, then, you just added 15 characters to the path of any file. This wouldn't be too bad if not for the fact that the path to a UNIX socket can only be up to 108 characters long, which is limiting enough already... Can we make the whole thing smart enough that the embed root hash will appear in paths that are outside of the embed root, but not in those that are inside it?

This is probably a matter of having 2 methods for building a unique name. One for use cases interacting with global resources, and one for use cases interacting with isolated resources. Then we'll need to make sure the driver calls the right one in each context. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

1935

days inactive

1948

days old

devel@lists.libvirt.org

Manage subscription

19 comments

5 participants

tags (0)

participants (5)

Andrea Bolognani
Christophe de Dinechin
Daniel P. Berrangé
Michal Privoznik
Michal Prívozník

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

qemu:///embed and isolation from global components