On 04/06/2018 12:27 PM, John Ferlan wrote:
On 04/03/2018 07:47 AM, Marc Hartmayer wrote:
> On Tue, Mar 20, 2018 at 11:25 AM +0100, Marc Hartmayer
<mhartmay(a)linux.vnet.ibm.com> wrote:
>> Hi,
>>
>> there is a race condition between 'qemuDomainCreate' and
>> 'qemuDomainDestroy' causing a NULL pointer segmentation fault when
>> accessing priv->monConfig. The race condition can be easily reproduced
>> using gdb.
>>
>> (gdb) set non-stop on
>> # set breakpoint on line 'mon = qemuMonitorOpen(vm, …)'
>> (gdb) b qemu_process.c:1799
>> # Actually, this second breakpoint is optional but it’s good to see
>> where priv->monConfig is set to NULL
>> # set breakpoint on line priv->monConfig = NULL;
>> (gdb) b qemu_process.c:6589
>> (gdb) run
>> # continue all threads - just for the case we hit a breakpoint already
>> (gdb) c -a
>>
>> Now start a domain (that is using QEMU)
>>
>> $ virsh start domain
>>
>> The first breakpoint will be hit. Now run in a second shell
>>
>> $ virsh destroy domain
>>
>> The second breakpoint will be hit. Continue the thread where the second
>> breakpoint was hit (for this example this is thread 4)
>>
>> (gdb) thread apply 4 continue
>>
>> Now continue the thread where the first breakpoint was hit.
>>
>> => Segmentation fault because of a NULL pointer dereference at
>> config->value
>>
>> Since I'm not very familiar with that part of the code, I wanted to ask
>> for your advice.
>>
>> Thanks in advance.
>>
>> Beste Grüße / Kind regards
>> Marc Hartmayer
>>
>> IBM Deutschland Research & Development GmbH
>> Vorsitzende des Aufsichtsrats: Martina Koederitz
>> Geschäftsführung: Dirk Wittkopp
>> Sitz der Gesellschaft: Böblingen
>> Registergericht: Amtsgericht Stuttgart, HRB 243294
> Any ideas?
>
Seeing as no one else has an exact or authoritative answer...
qemuDomainCreate{XML|WithFlags} (and a few others) will call
qemuProcessBeginJob which calls qemuDomainObjBeginAsyncJob and
qemuDomainObjSetAsyncJobMask which IIUC allows QEMU_JOB_DESTROY
to be run.
The qemuDomainDestroyFlags calls qemuProcessBeginStopJob which calls
qemuDomainObjBeginJob (e.g. sync job) using QEMU_JOB_DESTROY, which
again IIUC is allowed to happen alongside the Async job because of the
mask setting.
In the code where you've broken during create, the @vm object lock is
dropped allowing destroy to obtain it. So with the perfect timing and
storm the window of opportunity does exist that the monConfig could be
free'd and thus priv->monConfig set to NULL before or during the create
processing uses it in qemuMonitorOpen. If during the processing, then
obviously the config-> would "appear" to be valid, but it may not
actually be what was passed.
The fix I believe involves using objects for virDomainChrSourceDef
rather than VIR_ALLOC'd and VIR_FREE'd memory directly. I've put
together a few patches and will post them shortly. Using the patches I
don't see a core, but rather the (I believe) expected "error: internal
error: qemu unexpectedly closed the monitor"
Isn't this just a symptom of a wider problem? Your patches fix the
problem for monConfig, but what about all the other members of the
domain object that aren't protected with the trappings of virObject?
Isn't it just as likely that they could fall prey to the same problem?