On 02/08/14 00:04, Ján Tomko wrote:
On 08/01/2014 03:12 AM, Sam Bobroff wrote:
> On 30/07/14 01:52, Ján Tomko wrote:
>> On 07/29/2014 02:41 AM, Sam Bobroff wrote:
>>> During a QEMU live migration several warning messages about job
>>> handling could be written to syslog on the destination host:
>>>
>>> "entering monitor without asking for a nested job is dangerous"
>>>
>>> The messages are written because the job handling during migration
>>> uses hard coded asyncJob values in several places that are incorrect.
>>>
>>> This patch passes the required asyncJob value around and prevents
>>> the warnings as well as any issues that the warnings may be referring
>>> to.
>>>
>>> Signed-off-by: Sam Bobroff <sam.bobroff(a)au1.ibm.com>
>>> --
>>
>> This patch seems to fix the deadlock that can happen if the migrated domain is
>> destroyed at the destination reported here:
>>
https://www.redhat.com/archives/libvir-list/2014-May/msg00236.html
>>
>> It looks good to me, but it seems there are more functions calling
>> qemuDomainObjEnterMonitor that can be called from qemuProcessStart,
>> for example qemuDomainChangeGraphicsPasswords.
>
> Yes, I was fairly sure there would be other cases; I just fixed all the
> ones that actually occurred during my tests.
>
> In fact it seems like for the cases I'm looking at here, where it's the
> async job owner thread that's using the EnterMonitor functions, passing
> asyncJob around is a waste of time anyway because we know the correct
> value of asyncJob to use: it's stored in priv->job.asyncJob.
>
> Why not have qemuDomainObjEnterMonitorInternal() automatically switch to
> creating a nested job in this case?
>
> It seems easy to do and would simplify some code as well; what do you think?
We've had it that way before - it didn't work that well:
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=193cd0f3
Interesting. I'll stick to the simpler fix for now then.
>
>>> @@ -2505,7 +2506,7 @@ qemuDomainUpdateDeviceList(virQEMUDriverPtr driver,
>>> if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_DEL_EVENT))
>>> return 0;
>>>
>>> - qemuDomainObjEnterMonitor(driver, vm);
>>> + ignore_value(qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob));
>>
>> Also, the return value of this call could be dangerous to ignore if asyncJob
>> != NONE.
>
> True, but the patch hasn't introduced this, and the full story is even
> worse ;-)
>
> void qemuDomainObjEnterMonitor(virQEMUDriverPtr driver,
> virDomainObjPtr obj)
> {
> ignore_value(qemuDomainObjEnterMonitorInternal(driver, obj,
> QEMU_ASYNC_JOB_NONE));
> }
>
qemuDomainObjEnterMonitorInternal is called with QEMU_ASYNC_JOB_NONE here. It
always returns 0 in that case and it's safe to ignore. The problem is when you
use other async jobs:
static int
qemuDomainObjEnterMonitorInternal(virQEMUDriverPtr driver,
virDomainObjPtr obj,
qemuDomainAsyncJob asyncJob)
{
qemuDomainObjPrivatePtr priv = obj->privateData;
if (asyncJob != QEMU_ASYNC_JOB_NONE) {
int ret;
if ((ret = qemuDomainObjBeginNestedJob(driver, obj, asyncJob)) < 0)
return ret;
if (!virDomainObjIsActive(obj)) {
virReportError(VIR_ERR_OPERATION_FAILED, "%s",
_("domain is no longer running"));
/* Still referenced by the containing async job. */
ignore_value(qemuDomainObjEndJob(driver, obj));
return -1;
}
}
...
Ah thanks, I see what you mean.
I'll post an updated version of my patch that handles error results and
I'll include the qemuDomainChangeGraphicsPasswords() (via
qemuProcessInitPasswords()) function you mentioned above as well.
I'd be happy to update other functions as well, but I can't see a simple
way of finding every way that qemuProcessStart() might end up calling
qemuDomainObjEnterMonitorInternal(). If you can recommend one I'll
handle them as well, but otherwise would you accept the patch with only
the known ones fixed? (It would at least be an improvement and would
make it easy to fix the other cases as they are found.)
Jan
Cheers,
Sam.