On 02/23/2017 02:16 PM, Daniel P. Berrange wrote:
On Thu, Feb 23, 2017 at 02:07:27PM +0100, Michal Privoznik wrote:
> On 02/23/2017 11:48 AM, Daniel P. Berrange wrote:
>> On Thu, Feb 23, 2017 at 11:16:17AM +0100, Michal Privoznik wrote:
>>> On 02/23/2017 11:02 AM, Daniel P. Berrange wrote:
>>>> On Thu, Feb 23, 2017 at 10:02:48AM +0100, Michal Privoznik wrote:
>>>>> So, majority of the code is just ready as-is. Well, with one
>>>>> slight change: differentiate between dimm and nvdimm in places
>>>>> like device alias generation, generating the command line and so
>>>>> on.
>>>>>
>>>>> Speaking of the command line, we also need to append
'nvdimm=on'
>>>>> to the '-machine' argument so that the nvdimm feature is
>>>>> advertised in the ACPI tables properly.
>>>>>
>>>>> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
>>>>
>>>>
>>>>> + if (prealloc &&
>>>>> + virJSONValueObjectAdd(props,
>>>>> + "b:prealloc", true,
>>>>> + NULL) < 0)
>>>>> + goto cleanup;
>>>>
>>>> As discussed on IRC, using prealloc with QEMU causes it to memset()
>>>> the first byte of every page of memory to 0. With NVDIMM this is
>>>> obviously corrupting application data stored in the NVDIMM.
>>>>
>>>> Obviously this needs fixing in QEMU, but I would think that libvirt
>>>> needs to block use of prealloc==true when running against existing
>>>> broken QEMU versions, otherwise users are going to be rather upset
>>>> to see their data corrupted on every boot
>>>
>>> They are, but should we work around broken QEMUs? And if so - how do we
>>> detect whether the qemu we are dealing with is broken or already fixed
>>> in that respect?
>>
>> Most likely we'll just have todo a version number check I think, since
>> fixing this isn't going to involve any externally visible change to
>> QEMU.
>
> What about backports then? If some distro decides to backport your fix
> posted on the qemu list, this check here would prevent them from running
> fixed qemu. Also, I don't think we want to work around qemu bugs.
If the distro backports the QEMU fix, then they would have to adjust
their libvirt to relax the version check too. Obviously we try to
avoid this kind of scenario, but sometimes there's no other option,
and this looks like one of those times.
Okay, fair enough.
>>> On the other hand, we can just not set prealloc true for nvdimm. That
>>> will have one downside though - after qemu mmaps() the file, kernel is
>>> not forced to create a private copy of the pages.
>>
>> If the guest has requested prealloc, then I don't think we can ignore
>> it for NVDIMM, because its a clear violation of the memory guarantee
>> we're claiming to provide apps. Thus I think we've no option but to
>> report an error if we see prealloc + broken QEMU
>
> Do you mean user instead of guest? Because it's user who requests
> prealloc (well, in theory it is user; libvirt does not allow users to
> request prealloc but rather has some rules worked in that request it
> instead).
Sorry yes, i meant user. eg if they set
<memoryBacking>
<allocation mode="immediate"/>
</memoryBacking>
I guess this is fairly new syntax so probably few people/apps are using
it. IIRC, if this is not present in XML, then we just automagically
assume allocation=immediate if using huge pages.
So I guess we could say, if allocation=immediate is *not* in the guest
XML, then NVDIMMs are assumed to have allocation=ondemand by default.
It'd be inconsistent behaviour, but not wrong.
That way we'd only need to raise an error if the user had explicitly
set allocation=immediate
Frankly, I'd rather be consistent AND bug-free at the same time. Can't
we make NVDIMM work with just fixed QEMUs? At the level code, the nvdimm
capability would be set if nvdimm is found in a list of devices provided
by qemu AND if the qemu version is at least 2.8 (assuming your fix makes
it into the release).
Michal
Regards,
Daniel