[libvirt] [PATCH] Fix race in starting transient VMs

From: "Daniel P. Berrange" <berrange@redhat.com> When starting a transient VM the first thing done is to check for duplicates. The check looks if there are any running VMs with the matching name/uuid. It explicitly allows there to be inactive VMs, so that a persistent VM can be temporarily booted with a different config. There is a race condition, however, where 2 or more clients try to create the same transient VM. The first client will cause a virDomainObjPtr to be added to the domain list, and it is inactive at this stage. The second client may then come along and see this inactive VM, and mistake it for a persistent VM. If the first VM fails to start its transient guest for any reason, then it'll remove the virDomainObjPtr from the list. The second client now has a virDomainObjPtr that it can try to boot, which libvirt no longer has a record of. The result can be a running QEMU process that is orphaned. It was also, however, possible for the virDomainObjPtr to be completely free'd which will cause libvirtd to crash in some scenarios. The fix is to only allow an existing inactive VM if it is marked as persistent. Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- src/conf/domain_conf.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 51c4e29..454fbfe 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2171,6 +2171,12 @@ virDomainObjListAddLocked(virDomainObjListPtr doms, vm->def->name); goto error; } + if (!vm->persistent) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("domain is being started as '%s'"), + vm->def->name); + goto error; + } } virDomainObjAssignDef(vm, -- 1.8.3.1

On 10/31/2013 12:41 PM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When starting a transient VM the first thing done is to check for duplicates. The check looks if there are any running VMs with the matching name/uuid. It explicitly allows there to be inactive VMs, so that a persistent VM can be temporarily booted with a different config.
The fix is to only allow an existing inactive VM if it is marked as persistent.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- src/conf/domain_conf.c | 6 ++++++ 1 file changed, 6 insertions(+)
ACK. What a nasty bug to track down.
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 51c4e29..454fbfe 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2171,6 +2171,12 @@ virDomainObjListAddLocked(virDomainObjListPtr doms, vm->def->name); goto error; } + if (!vm->persistent) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("domain is being started as '%s'"),
The error is a bit awkward (I ask to create a domain named 'foo', and the error message is 'domain is being started as 'foo''), but no worse than the error message a couple lines earlier ('domain is already active as 'foo''). Maybe "domain '%s' is already being started" reads better. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 10/31/2013 01:00 PM, Eric Blake wrote:
On 10/31/2013 12:41 PM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When starting a transient VM the first thing done is to check for duplicates. The check looks if there are any running VMs with the matching name/uuid. It explicitly allows there to be inactive VMs, so that a persistent VM can be temporarily booted with a different config.
The fix is to only allow an existing inactive VM if it is marked as persistent.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- src/conf/domain_conf.c | 6 ++++++ 1 file changed, 6 insertions(+)
ACK. What a nasty bug to track down.
If I'm correct, this bug is a regression from the time that we first converted to no longer holding the driver lock around entire API calls (commit a9e97e0, v1.0.3) -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Thu, Oct 31, 2013 at 01:02:54PM -0600, Eric Blake wrote:
On 10/31/2013 01:00 PM, Eric Blake wrote:
On 10/31/2013 12:41 PM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When starting a transient VM the first thing done is to check for duplicates. The check looks if there are any running VMs with the matching name/uuid. It explicitly allows there to be inactive VMs, so that a persistent VM can be temporarily booted with a different config.
The fix is to only allow an existing inactive VM if it is marked as persistent.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- src/conf/domain_conf.c | 6 ++++++ 1 file changed, 6 insertions(+)
ACK. What a nasty bug to track down.
If I'm correct, this bug is a regression from the time that we first converted to no longer holding the driver lock around entire API calls (commit a9e97e0, v1.0.3)
No, it goes waaaaaaaaaaaaaaaaaaaaaaaaaaaaaay back. It at least affects 0.10.2, and probably all earlier versions too, for as long as the qemuDomainBeginJobWithDriver code has existed. Removing the driver lock merely made the problem worse, as it introduced a crash scenario, as well as the previous orphaned VMs problem. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Oct 31, 2013 at 01:00:01PM -0600, Eric Blake wrote:
On 10/31/2013 12:41 PM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When starting a transient VM the first thing done is to check for duplicates. The check looks if there are any running VMs with the matching name/uuid. It explicitly allows there to be inactive VMs, so that a persistent VM can be temporarily booted with a different config.
The fix is to only allow an existing inactive VM if it is marked as persistent.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- src/conf/domain_conf.c | 6 ++++++ 1 file changed, 6 insertions(+)
ACK. What a nasty bug to track down.
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 51c4e29..454fbfe 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2171,6 +2171,12 @@ virDomainObjListAddLocked(virDomainObjListPtr doms, vm->def->name); goto error; } + if (!vm->persistent) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("domain is being started as '%s'"),
The error is a bit awkward (I ask to create a domain named 'foo', and the error message is 'domain is being started as 'foo''), but no worse than the error message a couple lines earlier ('domain is already active as 'foo''). Maybe "domain '%s' is already being started" reads better.
Ok, changing it thus: @@ -2167,7 +2167,13 @@ virDomainObjListAddLocked(virDomainObjListPtr doms, /* UUID & name match, but if VM is already active, refuse it */ if (virDomainObjIsActive(vm)) { virReportError(VIR_ERR_OPERATION_INVALID, - _("domain is already active as '%s'"), + _("domain '%s' is already active"), + vm->def->name); + goto error; + } + if (!vm->persistent) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("domain '%s' is already being started"), vm->def->name); goto error; } Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Eric Blake