Re: [libvirt] Syslog "nested job is dangerous" message

22 Jul 2014

      On 17/07/14 17:54, Ján Tomko wrote:
...
On 07/17/2014 08:51 AM, Jiri Denemark wrote:
...
On Thu, Jul 17, 2014 at 13:29:52 +1000, Sam Bobroff wrote:
...
On 17/07/14 12:50, Eric Blake wrote:
...
On 07/16/2014 07:52 PM, Sam Bobroff wrote:
...
Hello everyone,
[Can you configure your mailer to wrap long lines?]
[No problem, done.]
...
...
After performing a migration, the syslog often contains several
messages like this:
"This thread seems to be the async job owner; entering monitor
without asking for a nested job is dangerous"
The sign of a bug that we need to fix.  What version of libvirt are
you using?  We may have already fixed it.
I've been testing on 1.2.6, but I will re-test on the latest code from git.
Hmm, it what were you doing when the message appeared in the log? I
fixed wrong usage of our job tracking APIs long time ago
(e4e28220b572cf4c71d07740c24150411f0704a6 in 1.0.3) so I guess there
must be another place we don't do it correctly.
That commit was essentially reverted by my commit 9846402 when fixing
migration crashes: https://bugzilla.redhat.com/show_bug.cgi?id=1018267
I think the message was harmless during migration as we don't allow other jobs
(except destroy).
Jan
I've re-tested this on the latest libvirt from git (1.2.7) and I still
see several occurrences of the "dangerous" message on the destination
machine after performing a live migration with --copy-storage-all.

It seems like qemuDomainObjEnterMonitorInternal() is expecting the
asyncJob parameter to have the value of the currently running async job,
and in the cases that cause the warning message, it's being called via
qemuDomainObjEnterMonitor() which hard codes it to QEMU_ASYNC_NONE.

It looks like these calls should be changed to
qemuDomainObjEnterMonitorAsync() with the right asyncJob parameter,
which I believe is QEMU_ASYNC_JOB_MIGRATION_IN for the cases I'm looking
at (and the old value of QEMU_ASYNC_NONE for all other callers). There
aren't many cases of this, so fixing it this way is fairly simple but it
does involve passing the asyncJob through some intermediate functions.

Another approach would be to change qemuDomainObjEnterMonitorInternal()
so that it can automatically do the same thing: It could see that there
was an async job running, and that it's owned by the current thread, and
then call qemuDomainObjBeginNestedJob() (or directly call
qemuDomainObjBeginJobInternal() since we know that the checks in
qemuDomainObjBeginNestedJob() are going to succeed). It seems like this
might also simplify some other code that is doing pretty much just what
I described above.

The second approach does fix all the cases I'm looking at, and would
also fix other similar cases I haven't yet found, but I'm not sure that
it would be entirely safe or if it's circumventing some sort of safety
checking.

Any comments?
Which approach should I pursue?

Cheers,
Sam.