[libvirt] [RFC PATCH] qemu: Fail APIs not allowed during async job

When an asynchronous job is running while another API that is incompatible with that job is called, we now try to wait until the job finishes and either run the API or fail with timeout. I guess nicer solution is to just fail such API immediately and let the application retry once the asynchronous job ends. --- src/qemu/THREADS.txt | 5 ++--- src/qemu/qemu_domain.c | 28 +++++++++++++++------------- 2 files changed, 17 insertions(+), 16 deletions(-) diff --git a/src/qemu/THREADS.txt b/src/qemu/THREADS.txt index 3a27a85..9183f1f 100644 --- a/src/qemu/THREADS.txt +++ b/src/qemu/THREADS.txt @@ -69,8 +69,7 @@ There are a number of locks on various objects specify what kind of action it is about to take and this is checked against the allowed set of jobs in case an asynchronous job is running. If the job is incompatible with current asynchronous job, - it needs to wait until the asynchronous job ends and try to acquire - the job again. + the operation fails. Immediately after acquiring the virDomainObjPtr lock, any method which intends to update state must acquire either asynchronous or @@ -80,7 +79,7 @@ There are a number of locks on various objects whenever it hits a piece of code which may sleep/wait, and re-acquire it after the sleep/wait. Whenever an asynchronous job wants to talk to the monitor, it needs to acquire nested job (a - special kind of normla job) to obtain exclusive access to the + special kind of normal job) to obtain exclusive access to the monitor. Since the virDomainObjPtr lock was dropped while waiting for the diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index f9755a4..b2a36ad 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -723,21 +723,25 @@ qemuDomainObjBeginJobInternal(struct qemud_driver *driver, if (driver_locked) qemuDriverUnlock(driver); -retry: - while (!nested && !qemuDomainJobAllowed(priv, job)) { - if (virCondWaitUntil(&priv->job.asyncCond, &obj->lock, then) < 0) - goto error; - } + if (!nested && !qemuDomainJobAllowed(priv, job)) + goto not_allowed; while (priv->job.active) { - if (virCondWaitUntil(&priv->job.cond, &obj->lock, then) < 0) + if (virCondWaitUntil(&priv->job.cond, &obj->lock, then) < 0) { + if (errno == ETIMEDOUT) + qemuReportError(VIR_ERR_OPERATION_TIMEOUT, + "%s", _("cannot acquire state change lock")); + else + virReportSystemError(errno, + "%s", _("cannot acquire job mutex")); goto error; + } } /* No job is active but a new async job could have been started while obj * was unlocked, so we need to recheck it. */ if (!nested && !qemuDomainJobAllowed(priv, job)) - goto retry; + goto not_allowed; qemuDomainObjResetJob(priv); @@ -759,13 +763,11 @@ retry: return 0; +not_allowed: + qemuReportError(VIR_ERR_OPERATION_INVALID, + "%s", _("incompatible job is running")); + error: - if (errno == ETIMEDOUT) - qemuReportError(VIR_ERR_OPERATION_TIMEOUT, - "%s", _("cannot acquire state change lock")); - else - virReportSystemError(errno, - "%s", _("cannot acquire job mutex")); if (driver_locked) { virDomainObjUnlock(obj); qemuDriverLock(driver); -- 1.7.6

On 07/15/2011 08:41 AM, Jiri Denemark wrote:
When an asynchronous job is running while another API that is incompatible with that job is called, we now try to wait until the job finishes and either run the API or fail with timeout. I guess nicer solution is to just fail such API immediately and let the application retry once the asynchronous job ends. --- src/qemu/THREADS.txt | 5 ++--- src/qemu/qemu_domain.c | 28 +++++++++++++++------------- 2 files changed, 17 insertions(+), 16 deletions(-)
If all such APIs have a flag argument, then we could make the behavior configurable - passing 0 blocks until possible, and passing VIR_DOMAIN_OPERATION_NONBLOCK as a flag returns immediately. But we probably don't have uniform flags arguments. Which APIs are affected? If we can't control things by a flag, then returning a specific failure seems like the best way to go (it is easier to write an app that can retry than it is to write an app that doesn't suffer from unintended blocking). So this patch seems sane to me, although I'd still like a list of all affected APIs before giving ack. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Fri, Jul 15, 2011 at 09:09:52 -0600, Eric Blake wrote:
On 07/15/2011 08:41 AM, Jiri Denemark wrote:
When an asynchronous job is running while another API that is incompatible with that job is called, we now try to wait until the job finishes and either run the API or fail with timeout. I guess nicer solution is to just fail such API immediately and let the application retry once the asynchronous job ends. --- src/qemu/THREADS.txt | 5 ++--- src/qemu/qemu_domain.c | 28 +++++++++++++++------------- 2 files changed, 17 insertions(+), 16 deletions(-)
If all such APIs have a flag argument, then we could make the behavior configurable - passing 0 blocks until possible, and passing VIR_DOMAIN_OPERATION_NONBLOCK as a flag returns immediately.
But we probably don't have uniform flags arguments. Which APIs are affected?
All APIs that modify state, currently the following list (qemu driver names): Resume Shutdown Reboot SetMemoryFlags InjectNMI SetVcpusFlags Suspend Restore ModifyDeviceFlags SnapshotCreateActive RevertToSnapshot SnapshotDelete MonitorCommand
If we can't control things by a flag, then returning a specific failure seems like the best way to go (it is easier to write an app that can retry than it is to write an app that doesn't suffer from unintended blocking).
So this patch seems sane to me, although I'd still like a list of all affected APIs before giving ack.
BTW, the patch is not complete so acking or not would only affect the idea. Jirka

On Fri, Jul 15, 2011 at 04:41:38PM +0200, Jiri Denemark wrote:
When an asynchronous job is running while another API that is incompatible with that job is called, we now try to wait until the job finishes and either run the API or fail with timeout. I guess nicer solution is to just fail such API immediately and let the application retry once the asynchronous job ends.
I'm not entirely convinced this is a good idea, because IIUC, what this is in effect doing is having a zero second timeout. Previously we would wait forever, currently we wait upto 30 seconds IIRC, and now we'll wait 0 seconds. I think this will create lots of spurious timeouts for applications to needlessly deal with. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Jul 15, 2011 at 16:18:54 +0100, Daniel P. Berrange wrote:
On Fri, Jul 15, 2011 at 04:41:38PM +0200, Jiri Denemark wrote:
When an asynchronous job is running while another API that is incompatible with that job is called, we now try to wait until the job finishes and either run the API or fail with timeout. I guess nicer solution is to just fail such API immediately and let the application retry once the asynchronous job ends.
I'm not entirely convinced this is a good idea, because IIUC, what this is in effect doing is having a zero second timeout. Previously we would wait forever, currently we wait upto 30 seconds IIRC, and now we'll wait 0 seconds. I think this will create lots of spurious timeouts for applications to needlessly deal with.
I'm not sure how far in the past are you referring to by saying "previously". Anyway since few years age we've been waiting up to 30 seconds. Now we would wait for 0 seconds but only in case current job is migration/save/dump. If the job doesn't finish in 30 seconds, we would just fail with timeout providing quite unhelpful error message. This way we would provide a better error message. Removing the timeout should be fine in this case since in most cases (migration/save) the domain won't exist after the job finishes anyway so the current API would fail anyway. However, if we still want to have the timeout there, we could just make the error message better in case we're waiting for incompatible async job to finish. Jirka
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Jiri Denemark