On 12/03/2019 20:39, Daniel Henrique Barboza wrote:
On 3/12/19 1:05 AM, Alexey Kardashevskiy wrote:
>
> On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
>>
>> On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
>>> On 08/03/2019 04:51, Piotr Jaroszynski wrote:
>>>> On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
>>>>> On 3/7/19 12:15 PM, Erik Skultety wrote:
>>>>>> On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique
Barboza
>>>>>> wrote:
>>>>>>> The NVLink2 support in QEMU implements the detection of
NVLink2
>>>>>>> capable devices by verfying the attributes of the VFIO mem
region
>>>>>>> QEMU allocates for the NVIDIA GPUs. To properly allocate an
>>>>>>> adequate amount of memLock, Libvirt needs this information
before
>>>>>>> a QEMU instance is even created.
>>>>>>>
>>>>>>> An alternative is presented in this patch. Given a PCI
device,
>>>>>>> we'll traverse the device tree at /proc/device-tree to
check if
>>>>>>> the device has a NPU bridge, retrieve the node of the NVLink2
bus,
>>>>>>> find the memory-node that is related to the bus and see if
it's a
>>>>>>> NVLink2 bus by inspecting its 'reg' value. This logic
is contained
>>>>>>> inside the 'device_is_nvlink2_capable' function,
which uses other
>>>>>>> new helper functions to navigate and fetch values from the
device
>>>>>>> tree nodes.
>>>>>>>
>>>>>>> Signed-off-by: Daniel Henrique Barboza
<danielhb413(a)gmail.com>
>>>>>>> ---
>>>>>> This fails with a bunch of compilation errors, please make sure
you
>>>>>> run make
>>>>>> check and make syntax-check on each patch of the series.
>>>>> Ooops, forgot to follow up make syntax-check with 'make'
before
>>>>> submitting ... my bad.
>>>>>
>>>>>>> src/qemu/qemu_domain.c | 194
>>>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>>> 1 file changed, 194 insertions(+)
>>>>>>>
>>>>>>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>>>>>>> index 77548c224c..97de5793e2 100644
>>>>>>> --- a/src/qemu/qemu_domain.c
>>>>>>> +++ b/src/qemu/qemu_domain.c
>>>>>>> @@ -10343,6 +10343,200 @@
>>>>>>> qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> +/**
>>>>>>> + * Reads a phandle file and returns the phandle value.
>>>>>>> + */
>>>>>>> +static int
>>>>>>> +read_dt_phandle(const char* file)
>>>>>> ..we prefer camelCase naming of functions...all functions should
>>>>>> have a prefix,
>>>>>> e.g. vir,<driver>, etc.
>>>>>>
>>>>>>> +{
>>>>>>> + unsigned int buf[1];
>>>>>>> + size_t read;
>>>>>>> + FILE *f;
>>>>>>> +
>>>>>>> + f = fopen(file, "r");
>>>>>>> + if (!f)
>>>>>>> + return -1;
>>>>>>> +
>>>>>>> + read = fread(buf, sizeof(unsigned int), 1, f);
>>>>>> We already have safe helpers that take care of such operations.
>>>>>>
>>>>>>> +
>>>>>>> + if (!read) {
>>>>>>> + VIR_CLOSE(f);
>>>>>> You could use VIR_AUTOCLOSE for this
>>>>>>
>>>>>>> + return 0;
>>>>>>> + }
>>>>>>> +
>>>>>>> + VIR_CLOSE(f);
>>>>>>> + return be32toh(buf[0]);
>>>>>> Is this part of gnulib? We need to make sure it's portable
>>>>>> otherwise we'd need
>>>>>> to make the conversion ourselves, just like for le64toh
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * Reads a memory reg file and returns the first 4 int
values.
>>>>>>> + *
>>>>>>> + * The caller is responsible for freeing the returned
array.
>>>>>>> + */
>>>>>>> +static unsigned int *
>>>>>>> +read_dt_memory_reg(const char *file)
>>>>>>> +{
>>>>>>> + unsigned int *buf;
>>>>>>> + size_t read, i;
>>>>>>> + FILE *f;
>>>>>>> +
>>>>>>> + f = fopen(file, "r");
>>>>>>> + if (!f)
>>>>>>> + return NULL;
>>>>>>> +
>>>>>>> + if (VIR_ALLOC_N(buf, 4) < 0)
>>>>>>> + return NULL;
>>>>>>> +
>>>>>>> + read = fread(buf, sizeof(unsigned int), 4, f);
>>>>>>> +
>>>>>>> + if (!read && read < 4)
>>>>>>> + /* shouldn't happen */
>>>>>>> + VIR_FREE(buf);
>>>>>>> + else for (i = 0; i < 4; i++)
>>>>>>> + buf[i] = be32toh(buf[i]);
>>>>>>> +
>>>>>>> + VIR_CLOSE(f);
>>>>>>> + return buf;
>>>>>> Same comments as above...
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * This wrapper function receives arguments to be used in a
>>>>>>> + * 'find' call to retrieve the file names that
matches
>>>>>>> + * the criteria inside the /proc/device-tree dir.
>>>>>>> + *
>>>>>>> + * A 'find' call with '-iname phandle'
inside /proc/device-tree
>>>>>>> + * provides more than a thousand matches. Adding
'-path' to
>>>>>>> + * narrow it down further is necessary to keep the file
>>>>>>> + * listing sane.
>>>>>>> + *
>>>>>>> + * The caller is responsible to free the buffer returned by
>>>>>>> + * this function.
>>>>>>> + */
>>>>>>> +static char *
>>>>>>> +retrieve_dt_files_pattern(const char *path_pattern, const
char
>>>>>>> *file_pattern)
>>>>>>> +{
>>>>>>> + virCommandPtr cmd = NULL;
>>>>>>> + char *output = NULL;
>>>>>>> +
>>>>>>> + cmd = virCommandNew("find");
>>>>>>> + virCommandAddArgList(cmd,
"/proc/device-tree/", "-path",
>>>>>>> path_pattern,
>>>>>>> + "-iname", file_pattern,
NULL);
>>>>>>> + virCommandSetOutputBuffer(cmd, &output);
>>>>>>> +
>>>>>>> + if (virCommandRun(cmd, NULL) < 0)
>>>>>>> + VIR_FREE(output);
>>>>>>> +
>>>>>>> + virCommandFree(cmd);
>>>>>>> + return output;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * Helper function that receives a listing of file names
and
>>>>>>> + * calls read_dt_phandle() on each one finding for a match
>>>>>>> + * with the given phandle argument. Returns the file name if
a
>>>>>>> + * match is found, NULL otherwise.
>>>>>>> + */
>>>>>>> +static char *
>>>>>>> +find_dt_file_with_phandle(char *files, int phandle)
>>>>>>> +{
>>>>>>> + char *line, *tmp;
>>>>>>> + int ret;
>>>>>>> +
>>>>>>> + line = strtok_r(files, "\n", &tmp);
>>>>>>> + do {
>>>>>>> + ret = read_dt_phandle(line);
>>>>>>> + if (ret == phandle)
>>>>>>> + break;
>>>>>>> + } while ((line = strtok_r(NULL, "\n",
&tmp)) != NULL);
>>>>>>> +
>>>>>>> + return line;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * This function receives a string that represents a PCI
device,
>>>>>>> + * such as '0004:04:00.0', and tells if the device
is NVLink2
>>>>>>> capable.
>>>>>>> + *
>>>>>>> + * The logic goes as follows:
>>>>>>> + *
>>>>>>> + * 1 - get the phandle of a nvlink of the device, reading
the
>>>>>>> 'ibm,npu'
>>>>>>> + * attribute;
>>>>>>> + * 2 - find the device tree node of the nvlink bus using
the
>>>>>>> phandle
>>>>>>> + * found in (1)
>>>>>>> + * 3 - get the phandle of the memory region of the nvlink
bus
>>>>>>> + * 4 - find the device tree node of the memory region using
the
>>>>>>> + * phandle found in (3)
>>>>>>> + * 5 - read the 'reg' value of the memory region. If
the value of
>>>>>>> + * the second 64 bit value is 0x02 0x00, the device is
attached
>>>>>>> + * to a NVLink2 bus.
>>>>>>> + *
>>>>>>> + * If any of these steps fails, the function returns false.
>>>>>>> + */
>>>>>>> +static bool
>>>>>>> +device_is_nvlink2_capable(const char *device)
>>>>>>> +{
>>>>>>> + char *file, *files, *tmp;
>>>>>>> + unsigned int *reg;
>>>>>>> + int phandle;
>>>>>>> +
>>>>>>> + if ((virAsprintf(&file,
>>>>>>> "/sys/bus/pci/devices/%s/of_node/ibm,npu",
>>>>>>> + device)) < 0)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + /* Find phandles of nvlinks: */
>>>>>>> + if ((phandle = read_dt_phandle(file)) == -1)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + /* Find a DT node for the phandle found */
>>>>>>> + files =
retrieve_dt_files_pattern("*device-tree/pci*",
>>>>>>> "phandle");
>>>>>>> + if (!files)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + if ((file = find_dt_file_with_phandle(files, phandle))
==
>>>>>>> NULL)
>>>>>>> + goto fail;
>>>>>>> +
>>>>>>> + /* Find a phandle of the GPU memory region of the
device. The
>>>>>>> + * file found above ends with '/phandle' - the
memory region
>>>>>>> + * of the GPU ends with '/memory-region */
>>>>>>> + tmp = strrchr(file, '/');
>>>>>>> + *tmp = '\0';
>>>>>>> + file = strcat(file, "/memory-region");
>>>>>>> +
>>>>>>> + if ((phandle = read_dt_phandle(file)) == -1)
>>>>>>> + goto fail;
>>>>>>> +
>>>>>>> + file = NULL;
>>>>>>> + VIR_FREE(files);
>>>>>>> +
>>>>>>> + /* Find the memory node for the phandle found above */
>>>>>>> + files =
retrieve_dt_files_pattern("*device-tree/memory*",
>>>>>>> "phandle");
>>>>>>> + if (!files)
>>>>>>> + return false;
>>>>>>> +
>>>>>>> + if ((file = find_dt_file_with_phandle(files, phandle))
==
>>>>>>> NULL)
>>>>>>> + goto fail;
>>>>>>> +
>>>>>>> + /* And see its size in the second 64bit value of
'reg'. First,
>>>>>>> + * the end of the file needs to be changed from
'/phandle' to
>>>>>>> + * '/reg' */
>>>>>>> + tmp = strrchr(file, '/');
>>>>>>> + *tmp = '\0';
>>>>>>> + file = strcat(file, "/reg");
>>>>>>> +
>>>>>>> + reg = read_dt_memory_reg(file);
>>>>>>> + if (reg && reg[2] == 0x20 && reg[3] ==
0x00)
>>>>>>> + return true;
>>>>>> This function is very complex just to find out whether a PCI
device
>>>>>> is capable
>>>>>> of NVLink or not. Feels wrong to do it this way, I believe it
would
>>>>>> be much
>>>>>> easier if NVIDIA exposed a sysfs attribute saying whether a PCI
>>>>>> device supports
>>>>>> NVLink so that our node-device driver would take care of this
>>>>>> during libvirtd
>>>>>> startup and then you'd only call a single API from the PPC64
helper
>>>>>> you're
>>>>>> introducing in the next patch to find out whether you need the
>>>>>> alternative
>>>>>> formula or not.
>>>>>>
>>>>>> Honestly, apart from the coding style issues, I don't like
this
>>>>>> approach and
>>>>>> unless there's a really compelling reason for libvirt to do
it in a
>>>>>> way which
>>>>>> involves spawning a 'find' process because of a complex
pattern and
>>>>>> a bunch of
>>>>>> data necessary to filter out, I'd really suggest contacting
NVIDIA
>>>>>> about this.
>>>>>> It's the same for mdevs, NVIDIA exposes a bunch of attributes
in
>>>>>> sysfs which
>>>>>> we're able to read.
>>>>> I'll contact NVIDIA and see if there is an easier way (a sysfs
>>>>> attribute, for
>>>>> example) and, if doesn't, try to provide a plausibe reason to
>>>>> justify this
>>>>> detection code.
>>>> Sorry for the delay in responding. The problem is that all the V100
>>>> GPUs
>>>> support NVLink, but it may or may not be connected up. This is
>>>> detected
>>>> at runtime during GPU initialization, which seems like much too
>>>> heavy of
>>>> an operation to perform as part of passthrough initialization. And
>>>> that's
>>>> why vfio-pci pieces rely on device tree information to figure it out.
>>>>
>>>> Alexey, would it be possible for vfio-pci to export the information
>>>> in a
>>>> way more friendly to libvirt?
>>> The only information needed here is whether a specific GPU has RAM or
>>> not. This can easily be found from the device-tree, imho quite friendly
>>> already. VFIO gets to know about these new capabilities when the VFIO
>>> PCI device is opened, and we rather want to avoid going that far in
>>> libvirt (open a VFIO container, attach a group, get a vfio-pci fd from
>>> it, enumerate regions - 2 PCI resets on the way, delays, meh).
>>>
>>> btw the first "find" for "ibm,npu" can be skipped -
NVLinks have to be
>>> passed too or the entire RAM thing won't work. "find" for the
memory
>>> node can also be dropped really - if NVLink bridge OF node has
>>> "memory-region", then VFIO will most likely expose RAM and QEMU
will
>>> try
>>> using it anyway.
>>
>> Hmmm I am not not sure I understood it completely ..... seeing the
>> of_nodes exposed in sysfs of one of the NVLink buses we have:
>>
>>
>> /sys/bus/pci/devices/0007:00:00.0/of_node$ ls
>> class-code ibm,gpu ibm,nvlink-speed memory-region reg
>> device-id ibm,loc-code ibm,pci-config-space-type
>> name revision-id
>> ibm,device-tgt-addr ibm,nvlink interrupts phandle vendor-id
>>
>>
>> We can make a safe assumption that the V100 GPU will always be passed
>> through with at least one NVLink2 bus.
> Assumption #1: if we want GPU RAM to be available in the guest, an
> NVLink bridge must be passed through. No bridge - no RAM - no rlimit
> adjustment.
>
>
>> How many hops do we need to assert
>> that a given device is a NVLink2 bus from its of_node info?
>>
>> For example, can we say something like:
>>
>> "The device node of this device has ibm,gpu and ibm,nvlink and
>> ibm,nvlink-speed
>> and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not
>
> ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.
>
>
>> possible to pass
>> through the bus alone, there is a V100 GPU in this same IOMMU group. So
>> this is
>> a NVLink2 passthrough scenario"
>>
>>
>> Or perhaps:
>>
>> "It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has
>> a reg
>> value of 0x20 , thus this is a nvlink2 bus and ....."
>
> 0x20 is a size of the GPU RAM window which might change in different
> revisions of the hw as well.
>
>> Both alternatives are way simpler than what I'm doing in this patch. I
>> am not sure
>> if they are valid though.
>
> Assumption #2: if the nvlink2 bridge's DT node has memory-region, then
> the GPU will try to use its RAM. The adjustment you'll have to make
> won't depend on anything as the existing GPU RAM placement is so that
> whatever you can possibly throw at QEMU will fit 128TiB window and we
> cannot make the window smaller anyway (next size would lower power of
> two - 64TiB - and GPU RAM lives just above that).
Ok, so, from the of_node of an unknown device that has been
passed through to a VM, if the DT node has:
ibm,gpu
ibm,nvlink
memory-region
Is that enough to justify the rlimit adjustment for NVLink2?
This is the idea, yes. To be on a safe side, you might want to limit the
check to the IBM vendor ID (not even sure about the device id) and PCI
bridges only.
>
> I could do better of course and allow adjusting the limit to cover let's
> say 66TiB instead of 128TiB, it just seems to be unnecessary
> complication (we already allocate only required amount of TCE entries on
> demand).
>
>
>
>>>>>> Thinking about it a bit more, since this is NVIDIA-specific,
>>>>>> having an
>>>>>> NVIDIA-only sysfs attribute doesn't help the node-device
driver I
>>>>>> mentioned
>>>>>> above. In general, we try to avoid introducing vendor-specific
>>>>>> code, so nodedev
>>>>>> might not be the right place to check for the attribute (because
>>>>>> it's not
>>>>>> universal and would require vendor-specific elements), but I
think
>>>>>> we could
>>>>>> have an NVLink helper reading this sysfs attribute(s) elsewhere
in
>>>>>> libvirt
>>>>>> codebase.
>>>>>>
>>>>>> Erik
>
>
--
Alexey