Hi
On Wed, Sep 19, 2018 at 5:58 PM Michal Privoznik <mprivozn(a)redhat.com> wrote:
On 09/19/2018 12:03 PM, Marc-André Lureau wrote:
> Hi
>
> On Wed, Sep 19, 2018 at 1:41 PM Michal Privoznik <mprivozn(a)redhat.com> wrote:
>>
>> On 09/17/2018 03:14 PM, marcandre.lureau(a)redhat.com wrote:
>>> From: Marc-André Lureau <marcandre.lureau(a)redhat.com>
>>>
>>> Add a new memoryBacking source type "memfd", supported by QEMU
(when
>>> the apability is available).
>>>
>>> A memfd is a specialized anonymous memory kind. As such, an anonymous
>>> source type could be automatically using a memfd. However, there are
>>> some complications when migrating from different memory backends in
>>> qemu (mainly due to the internal object naming at this point, but
>>> there could be more). For now, it is simpler and safer to simply
>>> introduce a new source type "memfd". Eventually, the
"anonymous" type
>>> could learn to use memfd transparently in a seperate change.
>>>
>>> The main benefits are that it doesn't need to create filesystem files,
>>> and it also enforces sealing, providing a bit more safety.
>>>
>>> Signed-off-by: Marc-André Lureau <marcandre.lureau(a)redhat.com>
>>> ---
>>> docs/formatdomain.html.in | 9 +--
>>> docs/schemas/domaincommon.rng | 1 +
>>> src/conf/domain_conf.c | 3 +-
>>> src/conf/domain_conf.h | 1 +
>>> src/qemu/qemu_command.c | 69 +++++++++++++------
>>> src/qemu/qemu_domain.c | 12 +++-
>>> .../memfd-memory-numa.x86_64-latest.args | 34 +++++++++
>>> tests/qemuxml2argvdata/memfd-memory-numa.xml | 36 ++++++++++
>>> tests/qemuxml2argvtest.c | 2 +
>>> 9 files changed, 140 insertions(+), 27 deletions(-)
>>> create mode 100644
tests/qemuxml2argvdata/memfd-memory-numa.x86_64-latest.args
>>> create mode 100644 tests/qemuxml2argvdata/memfd-memory-numa.xml
>>>
>>> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
>>> index 1f12ab5b42..eeee1f6d40 100644
>>> --- a/docs/formatdomain.html.in
>>> +++ b/docs/formatdomain.html.in
>>> @@ -1099,7 +1099,7 @@
>>> </hugepages>
>>> <nosharepages/>
>>> <locked/>
>>> - <source type="file|anonymous"/>
>>> + <source type="file|anonymous|memfd"/>
>>
>> I'm sorry but I do not think this is the way we should go. This
>> effectively avoids libvirt making the decision and exposes the backend
>> used directly. This puts unnecessary burden on mgmt applications because
>> they have to make yet another decision (track another domain attribute).
>>
>> IIUC, memfd is like memory-backend-file and -ram combined. It can do
>> hugepages or just plain malloc(). Therefore it should be our first
>> choice for freshly started domains. And only if qemu doesn't support it
>> we should fall back to either -file or -ram backends.
>
> memory-backend-memfd doesn't replace either -file or -ram though. It's
> a specialized anonymous memory kind, linux-only atm, and not widely
> available.
Well, neither libvirt nor qemu really support hugepages on anything else
than linux.
Nor it ever will? Because if we merge these patches and expose it in
domain XML, there is no turning back. We can't stop supporting it.
>
> -file should be used for nvram or complex hugepage/numa setup for ex.
How come? I can see .host-nodes and .policy attributes for -memfd
backend too. Sure, nvram is special, but for plain hugepages use case
-file and -memfd are interchangeable, aren't they?
Sorry, I think I misunderstood the problem then. The qemu mbind()
might do all the work.
David, didn't you point out limitation of -memfd compared to -file for
NUMA setup?
-object memory-backend-memfd,id=ram-node0,\
hugetlb=yes,hugetlbsize=2097152,\
share=yes,size=15032385536,host-nodes=3,policy=preferred
-object memory-backend-file,id=ram-node0,\
path=/path/to/2M/hugetlfs,\
size=15032385536,host-nodes=3,policy=preferred
And for -ram there is no difference from usage/libvirt POV.
-object memory-backend-memfd,id=ram-node0,\
share=yes,size=15032385536,host-nodes=3,policy=preferred
-object memory-backend-ram,id=ram-node0,\
size=15032385536,host-nodes=3,policy=preferred
>
> But it's legitimate that a VM user request memfd to be used.
>
> The point of this patch is not to say that we shouldn't try to use
> memfd when possible, but rather let the user request specifically
> memfd, for security reasons for example. If the setup cannot be
> satisfied with -memfd, the user should get an error.
What security reasons do you have in mind?
grow/shrink sealing (and avoiding somewhat hazardous file system operations).
>
>>
>> This means we have to track what backend the domain was started with so
>> that we preserve that on migration (although, the fact that these
>> backends are not interchangeable makes me question 'backend' in their
>> name :-P). For that we can use status/migration XML as I suggested earlier.
>>
>> Once again, status XML is not editable by user [*] and is used solely by
>> libvirtd to store runtime information for a running domain (and backend
>> used falls into that category).
>
> Why not do this transparent memfd-usage in a seperate series?
Depends what we want libvirt to be. If we want it to be mere XML->qemu
cmd line generator, then we can expose all qemu settings as they are. If
we want it to have some logic built in (so that mgmt applications can
offload some decisions to it), then we can't expose all qemu settings.
I my ideal world, I'd like to tell libvirt "I want a machine that uses
hugepages of this size" and let libvirt figure out the best command line
to fulfil my request (either use -file or -memfd or even -ram + -mem-path).
On the other hand, I don't want to discourage you from posting patches,
so this is the point where I will no longer object. I pointed out my
objections enough :-)
I see the benefit in using memfd whenever possible. But I also see a
benefit in being able to request its usage explcitely. That's why I
think the 2 approaches are compatible.
Thanks!