[libvirt] [PATCH 0/2] qemu: Improve error/debug messages if unable to acquire a job

*** BLURB HERE *** Michal Privoznik (2): qemuDomainObjBeginJobInternal: Log agent job too qemuDomainObjBeginJobInternal: Report agent job in error message src/qemu/qemu_domain.c | 42 +++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-) -- 2.16.4

If a thread is unable to start a job (e.g. because of timeout) an warning is printed into the logs. So far, the message does not contain agent job info. Add it as it might help future debugging. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index d7c0598cee..827597d5f3 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6422,6 +6422,7 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, const char *blocker = NULL; int ret = -1; unsigned long long duration = 0; + unsigned long long agentDuration = 0; unsigned long long asyncDuration = 0; VIR_DEBUG("Starting job: job=%s agentJob=%s asyncJob=%s " @@ -6526,22 +6527,27 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, ignore_value(virTimeMillisNow(&now)); if (priv->job.active && priv->job.started) duration = now - priv->job.started; + if (priv->job.agentActive && priv->job.agentStarted) + agentDuration = now - priv->job.agentStarted; if (priv->job.asyncJob && priv->job.asyncStarted) asyncDuration = now - priv->job.asyncStarted; - VIR_WARN("Cannot start job (%s, %s) for domain %s; " - "current job is (%s, %s) " - "owned by (%llu %s, %llu %s (flags=0x%lx)) " - "for (%llus, %llus)", + VIR_WARN("Cannot start job (%s, %s, %s) for domain %s; " + "current job is (%s, %s, %s) " + "owned by (%llu %s, %llu %s, %llu %s (flags=0x%lx)) " + "for (%llus, %llus, %llus)", qemuDomainJobTypeToString(job), + qemuDomainAgentJobTypeToString(agentJob), qemuDomainAsyncJobTypeToString(asyncJob), obj->def->name, qemuDomainJobTypeToString(priv->job.active), + qemuDomainAgentJobTypeToString(priv->job.agentActive), qemuDomainAsyncJobTypeToString(priv->job.asyncJob), priv->job.owner, NULLSTR(priv->job.ownerAPI), + priv->job.agentOwner, NULLSTR(priv->job.agentOwnerAPI), priv->job.asyncOwner, NULLSTR(priv->job.asyncOwnerAPI), priv->job.apiFlags, - duration / 1000, asyncDuration / 1000); + duration / 1000, agentDuration / 1000, asyncDuration / 1000); if (nested || qemuDomainNestedJobAllowed(priv, job)) blocker = priv->job.ownerAPI; -- 2.16.4

On Wed, Jun 20, 2018 at 14:32:03 +0200, Michal Privoznik wrote:
If a thread is unable to start a job (e.g. because of timeout) an warning is printed into the logs. So far, the message does not
s/an/a/
contain agent job info. Add it as it might help future debugging.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
Reviewed-by: Jiri Denemark <jdenemar@redhat.com>

If a thread is unable to acquire a job (e.g. because of timeout) an error is reported and the error message contains reference to the other thread holding the job. Well, the error message should report agent job too as it is yet another source of possible failure. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 827597d5f3..4331b95917 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6420,6 +6420,7 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, bool async = job == QEMU_JOB_ASYNC; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); const char *blocker = NULL; + const char *agentBlocker = NULL; int ret = -1; unsigned long long duration = 0; unsigned long long agentDuration = 0; @@ -6549,16 +6550,21 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, priv->job.apiFlags, duration / 1000, agentDuration / 1000, asyncDuration / 1000); - if (nested || qemuDomainNestedJobAllowed(priv, job)) - blocker = priv->job.ownerAPI; - else - blocker = priv->job.asyncOwnerAPI; + if (job) { + if (nested || qemuDomainNestedJobAllowed(priv, job)) + blocker = priv->job.ownerAPI; + else + blocker = priv->job.asyncOwnerAPI; + } + + if (agentJob) + agentBlocker = priv->job.agentOwnerAPI; if (errno == ETIMEDOUT) { - if (blocker) { + if (blocker || agentBlocker) { virReportError(VIR_ERR_OPERATION_TIMEOUT, - _("cannot acquire state change lock (held by %s)"), - blocker); + _("cannot acquire state change lock (held by %s %s)"), + NULLSTR(blocker), NULLSTR(agentBlocker)); } else { virReportError(VIR_ERR_OPERATION_TIMEOUT, "%s", _("cannot acquire state change lock")); @@ -6566,11 +6572,11 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, ret = -2; } else if (cfg->maxQueuedJobs && priv->jobs_queued > cfg->maxQueuedJobs) { - if (blocker) { + if (blocker || agentBlocker) { virReportError(VIR_ERR_OPERATION_FAILED, - _("cannot acquire state change lock (held by %s) " + _("cannot acquire state change lock (held by %s %s) " "due to max_queued limit"), - blocker); + NULLSTR(blocker), NULLSTR(agentBlocker)); } else { virReportError(VIR_ERR_OPERATION_FAILED, "%s", _("cannot acquire state change lock " -- 2.16.4

On Wed, Jun 20, 2018 at 14:32:04 +0200, Michal Privoznik wrote:
If a thread is unable to acquire a job (e.g. because of timeout) an error is reported and the error message contains reference to the other thread holding the job. Well, the error message should report agent job too as it is yet another source of possible failure.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 827597d5f3..4331b95917 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6420,6 +6420,7 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, bool async = job == QEMU_JOB_ASYNC; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); const char *blocker = NULL; + const char *agentBlocker = NULL; int ret = -1; unsigned long long duration = 0; unsigned long long agentDuration = 0; @@ -6549,16 +6550,21 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, priv->job.apiFlags, duration / 1000, agentDuration / 1000, asyncDuration / 1000);
- if (nested || qemuDomainNestedJobAllowed(priv, job)) - blocker = priv->job.ownerAPI; - else - blocker = priv->job.asyncOwnerAPI; + if (job) { + if (nested || qemuDomainNestedJobAllowed(priv, job)) + blocker = priv->job.ownerAPI; + else + blocker = priv->job.asyncOwnerAPI; + } + + if (agentJob) + agentBlocker = priv->job.agentOwnerAPI;
if (errno == ETIMEDOUT) { - if (blocker) { + if (blocker || agentBlocker) { virReportError(VIR_ERR_OPERATION_TIMEOUT, - _("cannot acquire state change lock (held by %s)"), - blocker); + _("cannot acquire state change lock (held by %s %s)"), + NULLSTR(blocker), NULLSTR(agentBlocker));
Since this is an error message reported to the user I think we should make it a little bit more user friendly. It would be nice to distinguish state change lock and agent lock and only print the relevant blocker (rather than both when one of them is NULL). And when both blockers are reported, we should separate them better, e.g., "%s and %s". Jirka

On 06/22/2018 10:07 AM, Jiri Denemark wrote:
On Wed, Jun 20, 2018 at 14:32:04 +0200, Michal Privoznik wrote:
If a thread is unable to acquire a job (e.g. because of timeout) an error is reported and the error message contains reference to the other thread holding the job. Well, the error message should report agent job too as it is yet another source of possible failure.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 827597d5f3..4331b95917 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6420,6 +6420,7 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, bool async = job == QEMU_JOB_ASYNC; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); const char *blocker = NULL; + const char *agentBlocker = NULL; int ret = -1; unsigned long long duration = 0; unsigned long long agentDuration = 0; @@ -6549,16 +6550,21 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver, priv->job.apiFlags, duration / 1000, agentDuration / 1000, asyncDuration / 1000);
- if (nested || qemuDomainNestedJobAllowed(priv, job)) - blocker = priv->job.ownerAPI; - else - blocker = priv->job.asyncOwnerAPI; + if (job) { + if (nested || qemuDomainNestedJobAllowed(priv, job)) + blocker = priv->job.ownerAPI; + else + blocker = priv->job.asyncOwnerAPI; + } + + if (agentJob) + agentBlocker = priv->job.agentOwnerAPI;
if (errno == ETIMEDOUT) { - if (blocker) { + if (blocker || agentBlocker) { virReportError(VIR_ERR_OPERATION_TIMEOUT, - _("cannot acquire state change lock (held by %s)"), - blocker); + _("cannot acquire state change lock (held by %s %s)"), + NULLSTR(blocker), NULLSTR(agentBlocker));
Since this is an error message reported to the user I think we should make it a little bit more user friendly. It would be nice to distinguish state change lock and agent lock and only print the relevant blocker (rather than both when one of them is NULL). And when both blockers are reported, we should separate them better, e.g., "%s and %s".
I thought about it too, but then decided to take the easy way out because this would end up being a spaghetti code in my sight. But okay, I will rework this to: if (blocker && agentBlocker) { ... } else if (blocker) { ... } else if (agentBlocker) { ... } else { ... } where all four bodies are merely the same (only the error message would change - hence the spaghetti code). Michal
participants (3)
-
Jiri Denemark
-
Michal Privoznik
-
Michal Prívozník