[libvirt] [RFC PATCH] Use PAUSED state for domains that are starting up

When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through. I'm not quite sure what the best way to fix this is. In this patch, I tried to use PAUSED state with STARTING_UP reason. Alternatively, we could keep using SHUTOFF state and just set STARTING_UP reason instead of UNKNOWN but it just feels wrong and wouldn't really solve the confusion when looking at virsh list. I made the change to qemu driver only in this RFC patch, I will update all drivers once we agree on the best approach. Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- include/libvirt/libvirt-domain.h | 1 + src/conf/domain_conf.c | 3 ++- src/qemu/qemu_process.c | 22 ++++++++++++++-------- tools/virsh-domain-monitor.c | 3 ++- 4 files changed, 19 insertions(+), 10 deletions(-) diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h index 4dbd7f5..90150f6 100644 --- a/include/libvirt/libvirt-domain.h +++ b/include/libvirt/libvirt-domain.h @@ -116,6 +116,7 @@ typedef enum { VIR_DOMAIN_PAUSED_SHUTTING_DOWN = 8, /* paused during shutdown process */ VIR_DOMAIN_PAUSED_SNAPSHOT = 9, /* paused while creating a snapshot */ VIR_DOMAIN_PAUSED_CRASHED = 10, /* paused due to a guest crash */ + VIR_DOMAIN_PAUSED_STARTING_UP = 11, /* the domain is being started */ # ifdef VIR_ENUM_SENTINELS VIR_DOMAIN_PAUSED_LAST diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index b3d63f8..2b7c5bf 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -661,7 +661,8 @@ VIR_ENUM_IMPL(virDomainPausedReason, VIR_DOMAIN_PAUSED_LAST, "from snapshot", "shutdown", "snapshot", - "panicked") + "panicked", + "starting up") VIR_ENUM_IMPL(virDomainShutdownReason, VIR_DOMAIN_SHUTDOWN_LAST, "unknown", diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 1d4e957..d317b19 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3412,6 +3412,7 @@ qemuProcessUpdateState(virQEMUDriverPtr driver, virDomainObjPtr vm) virDomainState state; virDomainPausedReason reason; virDomainState newState = VIR_DOMAIN_NOSTATE; + int oldReason; int newReason; bool running; char *msg = NULL; @@ -3425,9 +3426,16 @@ qemuProcessUpdateState(virQEMUDriverPtr driver, virDomainObjPtr vm) if (ret < 0) return -1; - state = virDomainObjGetState(vm, NULL); + state = virDomainObjGetState(vm, &oldReason); - if (state == VIR_DOMAIN_PAUSED && running) { + if (running && + (state == VIR_DOMAIN_SHUTOFF || + (state == VIR_DOMAIN_PAUSED && + oldReason == VIR_DOMAIN_PAUSED_STARTING_UP))) { + newState = VIR_DOMAIN_RUNNING; + newReason = VIR_DOMAIN_RUNNING_BOOTED; + ignore_value(VIR_STRDUP_QUIET(msg, "finished booting")); + } else if (state == VIR_DOMAIN_PAUSED && running) { newState = VIR_DOMAIN_RUNNING; newReason = VIR_DOMAIN_RUNNING_UNPAUSED; ignore_value(VIR_STRDUP_QUIET(msg, "was unpaused")); @@ -3446,10 +3454,6 @@ qemuProcessUpdateState(virQEMUDriverPtr driver, virDomainObjPtr vm) ignore_value(virAsprintf(&msg, "was paused (%s)", virDomainPausedReasonTypeToString(reason))); } - } else if (state == VIR_DOMAIN_SHUTOFF && running) { - newState = VIR_DOMAIN_RUNNING; - newReason = VIR_DOMAIN_RUNNING_BOOTED; - ignore_value(VIR_STRDUP_QUIET(msg, "finished booting")); } if (newState != VIR_DOMAIN_NOSTATE) { @@ -3817,7 +3821,9 @@ qemuProcessReconnect(void *opaque) goto error; state = virDomainObjGetState(obj, &reason); - if (state == VIR_DOMAIN_SHUTOFF) { + if (state == VIR_DOMAIN_SHUTOFF || + (state == VIR_DOMAIN_PAUSED && + reason == VIR_DOMAIN_PAUSED_STARTING_UP)) { VIR_DEBUG("Domain '%s' wasn't fully started yet, killing it", obj->def->name); goto error; @@ -4435,7 +4441,7 @@ int qemuProcessStart(virConnectPtr conn, vm->def->id = qemuDriverAllocateID(driver); qemuDomainSetFakeReboot(driver, vm, false); - virDomainObjSetState(vm, VIR_DOMAIN_SHUTOFF, VIR_DOMAIN_SHUTOFF_UNKNOWN); + virDomainObjSetState(vm, VIR_DOMAIN_PAUSED, VIR_DOMAIN_PAUSED_STARTING_UP); if (virAtomicIntInc(&driver->nactive) == 1 && driver->inhibitCallback) driver->inhibitCallback(true, driver->inhibitOpaque); diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c index 925eb1b..da23ace 100644 --- a/tools/virsh-domain-monitor.c +++ b/tools/virsh-domain-monitor.c @@ -184,7 +184,8 @@ VIR_ENUM_IMPL(vshDomainPausedReason, N_("from snapshot"), N_("shutting down"), N_("creating snapshot"), - N_("crashed")) + N_("crashed"), + N_("starting up")) VIR_ENUM_DECL(vshDomainShutdownReason) VIR_ENUM_IMPL(vshDomainShutdownReason, -- 2.3.0

On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
A shutoff transient domain isn't too bad IMHO, but a shutoff domain with an ID number is definitely not expected. Could we perhaps address it by ensuring that we always return '-1' for ID if the state is "SHUTOFF", even if def->id has a positive value ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Feb 16, 2015 at 14:57:17 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
A shutoff transient domain isn't too bad IMHO, but a shutoff domain with an ID number is definitely not expected.
Could we perhaps address it by ensuring that we always return '-1' for ID if the state is "SHUTOFF", even if def->id has a positive value ?
But we should somehow make it clear that the domain is actually there, somehow, only not completely usable. That is, one may need to actually call virsh destroy on such domain to get rid of the leftover process if something goes wrong. Jirka

On Mon, Feb 16, 2015 at 04:03:50PM +0100, Jiri Denemark wrote:
On Mon, Feb 16, 2015 at 14:57:17 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
A shutoff transient domain isn't too bad IMHO, but a shutoff domain with an ID number is definitely not expected.
Could we perhaps address it by ensuring that we always return '-1' for ID if the state is "SHUTOFF", even if def->id has a positive value ?
But we should somehow make it clear that the domain is actually there, somehow, only not completely usable. That is, one may need to actually call virsh destroy on such domain to get rid of the leftover process if something goes wrong.
Hmm, if something goes wrong due virDomainStart though, we should be tearing down the QEMU process. IIRC we should even be kill -9'ing QEMU, so even if QEMU is stuck in an uninterruptable sleep and won't exit, once the (storage?) problem causing that sleep is resolved QEMU will exit without further intervention. Similarly calling 'destroy' more times won't make it any more likely to quit, once it has had a SIGKILL Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Feb 16, 2015 at 15:07:19 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 04:03:50PM +0100, Jiri Denemark wrote:
On Mon, Feb 16, 2015 at 14:57:17 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
A shutoff transient domain isn't too bad IMHO, but a shutoff domain with an ID number is definitely not expected.
Could we perhaps address it by ensuring that we always return '-1' for ID if the state is "SHUTOFF", even if def->id has a positive value ?
But we should somehow make it clear that the domain is actually there, somehow, only not completely usable. That is, one may need to actually call virsh destroy on such domain to get rid of the leftover process if something goes wrong.
Hmm, if something goes wrong due virDomainStart though, we should be tearing down the QEMU process. IIRC we should even be kill -9'ing QEMU, so even if QEMU is stuck in an uninterruptable sleep and won't exit, once the (storage?) problem causing that sleep is resolved QEMU will exit without further intervention. Similarly calling 'destroy' more times won't make it any more likely to quit, once it has had a SIGKILL
You're right of course. However, I still feel we should distinguish shutoff domain from a domain that is being started. Considering it shutoff until we have a monitor connection may cause all sorts of confusion. Except for shutoff transient domains, one can see a shutoff domain that cannot be started because it is already running (or perhaps because acquiring a job fails), it's impossible to distinguish between a domain which was running previously and wasn't cleaned up for whatever reason (bug in libvirt most likely) from a normal state when libvirt is waiting for a monitor to show up... Jirka

On Thu, Feb 19, 2015 at 05:07:45PM +0100, Jiri Denemark wrote:
On Mon, Feb 16, 2015 at 15:07:19 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 04:03:50PM +0100, Jiri Denemark wrote:
On Mon, Feb 16, 2015 at 14:57:17 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
A shutoff transient domain isn't too bad IMHO, but a shutoff domain with an ID number is definitely not expected.
Could we perhaps address it by ensuring that we always return '-1' for ID if the state is "SHUTOFF", even if def->id has a positive value ?
But we should somehow make it clear that the domain is actually there, somehow, only not completely usable. That is, one may need to actually call virsh destroy on such domain to get rid of the leftover process if something goes wrong.
Hmm, if something goes wrong due virDomainStart though, we should be tearing down the QEMU process. IIRC we should even be kill -9'ing QEMU, so even if QEMU is stuck in an uninterruptable sleep and won't exit, once the (storage?) problem causing that sleep is resolved QEMU will exit without further intervention. Similarly calling 'destroy' more times won't make it any more likely to quit, once it has had a SIGKILL
You're right of course. However, I still feel we should distinguish shutoff domain from a domain that is being started. Considering it shutoff until we have a monitor connection may cause all sorts of confusion. Except for shutoff transient domains, one can see a shutoff domain that cannot be started because it is already running (or perhaps because acquiring a job fails), it's impossible to distinguish between a domain which was running previously and wasn't cleaned up for whatever reason (bug in libvirt most likely) from a normal state when libvirt is waiting for a monitor to show up...
It kind of feels like it merits a new state, but I fear that would cause more problems for existing apps which won't be expecting it. So perhaps using 'paused' during startup is the least worst option ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Feb 19, 2015 at 16:09:55 +0000, Daniel P. Berrange wrote:
On Thu, Feb 19, 2015 at 05:07:45PM +0100, Jiri Denemark wrote:
On Mon, Feb 16, 2015 at 15:07:19 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 04:03:50PM +0100, Jiri Denemark wrote:
On Mon, Feb 16, 2015 at 14:57:17 +0000, Daniel P. Berrange wrote:
On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
A shutoff transient domain isn't too bad IMHO, but a shutoff domain with an ID number is definitely not expected.
Could we perhaps address it by ensuring that we always return '-1' for ID if the state is "SHUTOFF", even if def->id has a positive value ?
But we should somehow make it clear that the domain is actually there, somehow, only not completely usable. That is, one may need to actually call virsh destroy on such domain to get rid of the leftover process if something goes wrong.
Hmm, if something goes wrong due virDomainStart though, we should be tearing down the QEMU process. IIRC we should even be kill -9'ing QEMU, so even if QEMU is stuck in an uninterruptable sleep and won't exit, once the (storage?) problem causing that sleep is resolved QEMU will exit without further intervention. Similarly calling 'destroy' more times won't make it any more likely to quit, once it has had a SIGKILL
You're right of course. However, I still feel we should distinguish shutoff domain from a domain that is being started. Considering it shutoff until we have a monitor connection may cause all sorts of confusion. Except for shutoff transient domains, one can see a shutoff domain that cannot be started because it is already running (or perhaps because acquiring a job fails), it's impossible to distinguish between a domain which was running previously and wasn't cleaned up for whatever reason (bug in libvirt most likely) from a normal state when libvirt is waiting for a monitor to show up...
It kind of feels like it merits a new state, but I fear that would cause more problems for existing apps which won't be expecting it. So perhaps using 'paused' during startup is the least worst option ?
Exactly, and that's basically what I did in the patch we are discussing :-) Jirka

On Mon, Feb 16, 2015 at 03:50:41PM +0100, Jiri Denemark wrote:
When libvirt is starting a domain, it reports the state as SHUTOFF until it's RUNNING. This is not ideal because domain startup may take a long time (usually because of some configuration issues, firewalls blocking access to network disks, etc.) and domain lists provided by libvirt look awkward. One can see weird shutoff domains with IDs in a list of active domains or even shutoff transient domains. In any case, it looks more like a bug in libvirt than a normal state a domain goes through.
I'm not quite sure what the best way to fix this is. In this patch, I tried to use PAUSED state with STARTING_UP reason. Alternatively, we could keep using SHUTOFF state and just set STARTING_UP reason instead of UNKNOWN but it just feels wrong and wouldn't really solve the confusion when looking at virsh list.
I made the change to qemu driver only in this RFC patch, I will update all drivers once we agree on the best approach.
So, yeah, I think this proposal is the least-worst option I see. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Jiri Denemark