[libvirt] Syslog "nested job is dangerous" message

Hello everyone, After performing a migration, the syslog often contains several messages like this: "This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous" The message makes it sound as if the user has done something dangerous but, after looking at the code that produces the message, it appears to be more to do with how the libvirt job code is used. The commit that added the warning (6eede368bc8e3df2c94c2ec1a913885ce08e99db) doesn't explain why it might be dangerous... qemu: Warn on possibly incorrect usage of EnterMonitor* qemuDomainObjEnterMonitor{,WithDriver} should not be called from async jobs, only EnterMonitorAsync variant is allowed. ... so I would appreciate some advice from people who understand that area. Would it be appropriate to re-word the message, or perhaps change it to a debug message so that it's not normally seen by users? Is it important to track down the cases that are generating the warning and fix them? (Could it cause some kind of significant problem?) Cheers, Sam.

On 07/16/2014 07:52 PM, Sam Bobroff wrote:
Hello everyone,
[Can you configure your mailer to wrap long lines?]
After performing a migration, the syslog often contains several messages like this:
"This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous"
The sign of a bug that we need to fix. What version of libvirt are you using? We may have already fixed it.
The message makes it sound as if the user has done something dangerous but, after looking at the code that produces the message, it appears to be more to do with how the libvirt job code is used.
Rather, libvirt itself has done something dangerous.
The commit that added the warning (6eede368bc8e3df2c94c2ec1a913885ce08e99db) doesn't explain why it might be dangerous...
Jiri can probably explain better, but it means there is a race condition where libvirt can lose track of a long-running job and cause memory corruption in its management of the guest.
Would it be appropriate to re-word the message, or perhaps change it to a debug message so that it's not normally seen by users?
No.
Is it important to track down the cases that are generating the warning and fix them? (Could it cause some kind of significant problem?)
Yes. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 17/07/14 12:50, Eric Blake wrote:
On 07/16/2014 07:52 PM, Sam Bobroff wrote:
Hello everyone,
[Can you configure your mailer to wrap long lines?]
[No problem, done.]
After performing a migration, the syslog often contains several messages like this:
"This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous"
The sign of a bug that we need to fix. What version of libvirt are you using? We may have already fixed it.
I've been testing on 1.2.6, but I will re-test on the latest code from git.
The message makes it sound as if the user has done something dangerous but, after looking at the code that produces the message, it appears to be more to do with how the libvirt job code is used.
Rather, libvirt itself has done something dangerous.
The commit that added the warning (6eede368bc8e3df2c94c2ec1a913885ce08e99db) doesn't explain why it might be dangerous...
Jiri can probably explain better, but it means there is a race condition where libvirt can lose track of a long-running job and cause memory corruption in its management of the guest.
Would it be appropriate to re-word the message, or perhaps change it to a debug message so that it's not normally seen by users?
No.
Is it important to track down the cases that are generating the warning and fix them? (Could it cause some kind of significant problem?)
Yes.
Acknowledged. If they still occur on the latest code I'll start tracking them down :-) Cheers, Sam.

On Thu, Jul 17, 2014 at 13:29:52 +1000, Sam Bobroff wrote:
On 17/07/14 12:50, Eric Blake wrote:
On 07/16/2014 07:52 PM, Sam Bobroff wrote:
Hello everyone,
[Can you configure your mailer to wrap long lines?]
[No problem, done.]
After performing a migration, the syslog often contains several messages like this:
"This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous"
The sign of a bug that we need to fix. What version of libvirt are you using? We may have already fixed it.
I've been testing on 1.2.6, but I will re-test on the latest code from git.
Hmm, it what were you doing when the message appeared in the log? I fixed wrong usage of our job tracking APIs long time ago (e4e28220b572cf4c71d07740c24150411f0704a6 in 1.0.3) so I guess there must be another place we don't do it correctly. Jirka

On 07/17/2014 08:51 AM, Jiri Denemark wrote:
On Thu, Jul 17, 2014 at 13:29:52 +1000, Sam Bobroff wrote:
On 17/07/14 12:50, Eric Blake wrote:
On 07/16/2014 07:52 PM, Sam Bobroff wrote:
Hello everyone,
[Can you configure your mailer to wrap long lines?]
[No problem, done.]
After performing a migration, the syslog often contains several messages like this:
"This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous"
The sign of a bug that we need to fix. What version of libvirt are you using? We may have already fixed it.
I've been testing on 1.2.6, but I will re-test on the latest code from git.
Hmm, it what were you doing when the message appeared in the log? I fixed wrong usage of our job tracking APIs long time ago (e4e28220b572cf4c71d07740c24150411f0704a6 in 1.0.3) so I guess there must be another place we don't do it correctly.
That commit was essentially reverted by my commit 9846402 when fixing migration crashes: https://bugzilla.redhat.com/show_bug.cgi?id=1018267 I think the message was harmless during migration as we don't allow other jobs (except destroy). Jan

On 17/07/14 17:54, Ján Tomko wrote:
On 07/17/2014 08:51 AM, Jiri Denemark wrote:
On Thu, Jul 17, 2014 at 13:29:52 +1000, Sam Bobroff wrote:
On 17/07/14 12:50, Eric Blake wrote:
On 07/16/2014 07:52 PM, Sam Bobroff wrote:
Hello everyone,
[Can you configure your mailer to wrap long lines?]
[No problem, done.]
After performing a migration, the syslog often contains several messages like this:
"This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous"
The sign of a bug that we need to fix. What version of libvirt are you using? We may have already fixed it.
I've been testing on 1.2.6, but I will re-test on the latest code from git.
Hmm, it what were you doing when the message appeared in the log? I fixed wrong usage of our job tracking APIs long time ago (e4e28220b572cf4c71d07740c24150411f0704a6 in 1.0.3) so I guess there must be another place we don't do it correctly.
That commit was essentially reverted by my commit 9846402 when fixing migration crashes: https://bugzilla.redhat.com/show_bug.cgi?id=1018267
I think the message was harmless during migration as we don't allow other jobs (except destroy).
Jan
I've re-tested this on the latest libvirt from git (1.2.7) and I still see several occurrences of the "dangerous" message on the destination machine after performing a live migration with --copy-storage-all. It seems like qemuDomainObjEnterMonitorInternal() is expecting the asyncJob parameter to have the value of the currently running async job, and in the cases that cause the warning message, it's being called via qemuDomainObjEnterMonitor() which hard codes it to QEMU_ASYNC_NONE. It looks like these calls should be changed to qemuDomainObjEnterMonitorAsync() with the right asyncJob parameter, which I believe is QEMU_ASYNC_JOB_MIGRATION_IN for the cases I'm looking at (and the old value of QEMU_ASYNC_NONE for all other callers). There aren't many cases of this, so fixing it this way is fairly simple but it does involve passing the asyncJob through some intermediate functions. Another approach would be to change qemuDomainObjEnterMonitorInternal() so that it can automatically do the same thing: It could see that there was an async job running, and that it's owned by the current thread, and then call qemuDomainObjBeginNestedJob() (or directly call qemuDomainObjBeginJobInternal() since we know that the checks in qemuDomainObjBeginNestedJob() are going to succeed). It seems like this might also simplify some other code that is doing pretty much just what I described above. The second approach does fix all the cases I'm looking at, and would also fix other similar cases I haven't yet found, but I'm not sure that it would be entirely safe or if it's circumventing some sort of safety checking. Any comments? Which approach should I pursue? Cheers, Sam.
participants (4)
-
Eric Blake
-
Jiri Denemark
-
Ján Tomko
-
Sam Bobroff