On 04/09/2018 12:12 PM, Laine Stump wrote:
On 04/06/2018 12:27 PM, John Ferlan wrote:
>
> On 04/03/2018 07:47 AM, Marc Hartmayer wrote:
>> On Tue, Mar 20, 2018 at 11:25 AM +0100, Marc Hartmayer
<mhartmay(a)linux.vnet.ibm.com> wrote:
>>> Hi,
>>>
>>> there is a race condition between 'qemuDomainCreate' and
>>> 'qemuDomainDestroy' causing a NULL pointer segmentation fault when
>>> accessing priv->monConfig. The race condition can be easily reproduced
>>> using gdb.
>>>
>>> (gdb) set non-stop on
>>> # set breakpoint on line 'mon = qemuMonitorOpen(vm, …)'
>>> (gdb) b qemu_process.c:1799
>>> # Actually, this second breakpoint is optional but it’s good to see
>>> where priv->monConfig is set to NULL
>>> # set breakpoint on line priv->monConfig = NULL;
>>> (gdb) b qemu_process.c:6589
>>> (gdb) run
>>> # continue all threads - just for the case we hit a breakpoint already
>>> (gdb) c -a
>>>
>>> Now start a domain (that is using QEMU)
>>>
>>> $ virsh start domain
>>>
>>> The first breakpoint will be hit. Now run in a second shell
>>>
>>> $ virsh destroy domain
>>>
>>> The second breakpoint will be hit. Continue the thread where the second
>>> breakpoint was hit (for this example this is thread 4)
>>>
>>> (gdb) thread apply 4 continue
>>>
>>> Now continue the thread where the first breakpoint was hit.
>>>
>>> => Segmentation fault because of a NULL pointer dereference at
>>> config->value
>>>
>>> Since I'm not very familiar with that part of the code, I wanted to ask
>>> for your advice.
>>>
>>> Thanks in advance.
>>>
>>> Beste Grüße / Kind regards
>>> Marc Hartmayer
>>>
>>> IBM Deutschland Research & Development GmbH
>>> Vorsitzende des Aufsichtsrats: Martina Koederitz
>>> Geschäftsführung: Dirk Wittkopp
>>> Sitz der Gesellschaft: Böblingen
>>> Registergericht: Amtsgericht Stuttgart, HRB 243294
>> Any ideas?
>>
> Seeing as no one else has an exact or authoritative answer...
>
> qemuDomainCreate{XML|WithFlags} (and a few others) will call
> qemuProcessBeginJob which calls qemuDomainObjBeginAsyncJob and
> qemuDomainObjSetAsyncJobMask which IIUC allows QEMU_JOB_DESTROY
> to be run.
>
> The qemuDomainDestroyFlags calls qemuProcessBeginStopJob which calls
> qemuDomainObjBeginJob (e.g. sync job) using QEMU_JOB_DESTROY, which
> again IIUC is allowed to happen alongside the Async job because of the
> mask setting.
>
> In the code where you've broken during create, the @vm object lock is
> dropped allowing destroy to obtain it. So with the perfect timing and
> storm the window of opportunity does exist that the monConfig could be
> free'd and thus priv->monConfig set to NULL before or during the create
> processing uses it in qemuMonitorOpen. If during the processing, then
> obviously the config-> would "appear" to be valid, but it may not
> actually be what was passed.
>
> The fix I believe involves using objects for virDomainChrSourceDef
> rather than VIR_ALLOC'd and VIR_FREE'd memory directly. I've put
> together a few patches and will post them shortly. Using the patches I
> don't see a core, but rather the (I believe) expected "error: internal
> error: qemu unexpectedly closed the monitor"
Isn't this just a symptom of a wider problem? Your patches fix the
problem for monConfig, but what about all the other members of the
domain object that aren't protected with the trappings of virObject?
Isn't it just as likely that they could fall prey to the same problem?
I don't think it'd be a wider problem as long as the virObjectLock is
held on the object not much can happen. In this case, there was a
perfect storm of sorts where that lock was dropped and we had an allowed
job (DESTROY) processed. That processing messed with one pointer that
was then used in a call when the lock was down. The @priv pointer used
for monJSON boolean is a @vm pointer and it's only freed if the @vm was
somehow removed - which it couldn't because we have a reference to it
(we only unlock'd and not unref'd).
So could there be more, I hope not, but I never want to say it's
impossible, because someone will open that window of opportunity.
Perhaps this discussion will cause some other poor souls that have had
to walk through that code in the past to chime in!
John