On Wed, 2 Aug 2017 21:16:28 +0530
Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
On 8/2/2017 6:29 PM, Gao, Ping A wrote:
>
> On 2017/8/2 18:19, Kirti Wankhede wrote:
>>
>> On 8/2/2017 3:56 AM, Alex Williamson wrote:
>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>> "Gao, Ping A" <ping.a.gao(a)intel.com> wrote:
>>>
>>>> On 2017/7/28 0:00, Gao, Ping A wrote:
>>>>> On 2017/7/27 0:43, Alex Williamson wrote:
>>>>>> [cc +libvir-list]
>>>>>>
>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>> "Gao, Ping A" <ping.a.gao(a)intel.com> wrote:
>>>>>>
>>>>>>> The vfio-mdev provide the capability to let different guest
share the
>>>>>>> same physical device through mediate sharing, as result it
bring a
>>>>>>> requirement about how to control the device sharing, we need
a QoS
>>>>>>> related interface for mdev to management virtual device
resource.
>>>>>>>
>>>>>>> E.g. In practical use, vGPUs assigned to different quests
almost has
>>>>>>> different performance requirements, some guests may need
higher priority
>>>>>>> for real time usage, some other may need more portion of the
GPU
>>>>>>> resource to get higher 3D performance, corresponding we can
define some
>>>>>>> interfaces like weight/cap for overall budget control,
priority for
>>>>>>> single submission control.
>>>>>>>
>>>>>>> So I suggest to add some common attributes which are vendor
agnostic in
>>>>>>> mdev core sysfs for QoS purpose.
>>>>>> I think what you're asking for is just some standardization
of a QoS
>>>>>> attribute_group which a vendor can optionally include within
the
>>>>>> existing mdev_parent_ops.mdev_attr_groups. The mdev core will
>>>>>> transparently enable this, but it really only provides the
standard,
>>>>>> all of the support code is left for the vendor. I'm fine
with that,
>>>>>> but of course the trouble with and sort of standardization is
arriving
>>>>>> at an agreed upon standard. Are there QoS knobs that are
generic
>>>>>> across any mdev device type? Are there others that are more
specific
>>>>>> to vGPU? Are there existing examples of this that we can steal
their
>>>>>> specification?
>>>>> Yes, you are right, standardization QoS knobs are exactly what I
wanted.
>>>>> Only when it become a part of the mdev framework and libvirt, then
QoS
>>>>> such critical feature can be leveraged by cloud usage. HW vendor
only
>>>>> need to focus on the implementation of the corresponding QoS
algorithm
>>>>> in their back-end driver.
>>>>>
>>>>> Vfio-mdev framework provide the capability to share the device that
lack
>>>>> of HW virtualization support to guests, no matter the device type,
>>>>> mediated sharing actually is a time sharing multiplex method, from
this
>>>>> point of view, QoS can be take as a generic way about how to control
the
>>>>> time assignment for virtual mdev device that occupy HW. As result we
can
>>>>> define QoS knob generic across any device type by this way. Even if
HW
>>>>> has build in with some kind of QoS support, I think it's not a
problem
>>>>> for back-end driver to convert mdev standard QoS definition to
their
>>>>> specification to reach the same performance expectation. Seems there
are
>>>>> no examples for us to follow, we need define it from scratch.
>>>>>
>>>>> I proposal universal QoS control interfaces like below:
>>>>>
>>>>> Cap: The cap limits the maximum percentage of time a mdev device can
own
>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60%
of
>>>>> total physical resource.
>>>>>
>>>>> Weight: The weight define proportional control of the mdev device
>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>
>>>>> Priority: The guest who has higher priority will get execution
first,
>>>>> target to some real time usage and speeding interactive response.
>>>>>
>>>>> Above QoS interfaces cover both overall budget control and single
>>>>> submission control. I will sent out detail design later once get
aligned.
>>>> Hi Alex,
>>>> Any comments about the interface mentioned above?
>>> Not really.
>>>
>>> Kirti, are there any QoS knobs that would be interesting
>>> for NVIDIA devices?
>>>
>> We have different types of vGPU for different QoS factors.
>>
>> When mdev devices are created, its resources are allocated irrespective
>> of which VM/userspace app is going to use that mdev device. Any
>> parameter we add here should be tied to particular mdev device and not
>> to the guest/app that are going to use it. 'Cap' and 'Priority'
are
>> along that line. All mdev device might not need/use these parameters,
>> these can be made optional interfaces.
>
> We also define some QoS parameters in Intel vGPU types, but it only
> provided a default fool-style way. We still need a flexible approach
> that give user the ability to change QoS parameters freely and
> dynamically according to their requirement , not restrict to the current
> limited and static vGPU types.
>
>> In the above proposal, I'm not sure how 'Weight' would work for
mdev
>> devices on same physical device.
>>
>> In the above example, "if guest 1 should take double mdev device
>> resource compare with guest 2" but what if guest 2 never booted, how
>> will you calculate resources?
>
> Cap is try to limit the max physical GPU resource for vGPU, it's a
> vertical limitation, but weight is a horizontal limitation that define
> the GPU resource consumption ratio between vGPUs. Cap is easy to
> understand as it's just a percentage. For weight. for example, if we
> define the max weight is 16, the vGPU_1 who get weight 8 should been
> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
> we can translate it to this formula: resource_of_vGPU_1 = 8 / (8+4) *
> total_physical_GPU_resource.
>
How will vendor driver provide max weight to userspace
application/libvirt? Max weight will be per physical device, right?
How would such resource allocation reflect in 'available_instances'?
Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
FB free but you have reached max weight, so will you make
available_instances = 0 for all types on that physical GPU?
No, per the algorithm above, the available scheduling for the remaining
mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
we'd need to define or make the range discoverable, 16 seems rather
arbitrary). We can always add new scheduling participants. AIUI,
Intel uses round-robin scheduling now, where you could consider all
mdev devices to have the same weight. Whether we consider that to be a
weight of 16 or zero or 8 doesn't really matter.
> If there is only one guest exist, then there is no target to
compare,
> weight become meaningless and the single guest enjoy the whole physical GPU.
>
If single VM is running for long time say vGPU_1, i.e. it enjoy whole
GPU, but then other VM boots with weight 4, so you will cut down
resources of vGPU_1 at runtime? Doesn't that would show performance
degradation for VM with vGPU_1 at runtime?
Yes. We have this already though, vGPU_1 may enjoy the whole GPU
simply because the other vGPUs are idle, that can change at any time
and may reduce the resources available to vGPU_1. Do we want a QoS
knob for fixed scheduling slices? With only cap, weight, and priority,
how could I provide an SLA for no less than 40% of the GPU? I guess we
can get that with careful use of weight, but I wonder if we could make
it more simple for users.
>> If libvirt/other toolstack decides to do smart allocation
based on type
>> name without taking physical host device as input, guest 1 and guest 2
>> might get mdev devices created on different physical device. Then would
>> weightage matter here?
>
> What your mean if it's the case that there are two discrete GPU cards
> exist and the vGPU types can be freely allocated on them, IMO the
> back-end driver should handle such case, as the number of physical
> device is transparent to tool stack. e.g. present multi-physical device
> as a logic one to mdev.
>
No, generally toolstack is aware of available physical devices and it
could have smart logic to decide on which physical device mdev device
should be created, i.e. to load one physical device first or to
distribute the load across physical devices when mdev devices are
created. Libvirt don't have such logic now, but it was discussed earlier
about having such logic in libvirt.
Then in that case as I said above doesn't that would show perf
degradation on running VMs at runtime?
It seems that the proposed cap, weight, and priority only handle QoS
within a single parent device. All the knobs are relative to other
scheduling participants on that parent device. The same QoS parameters
for mdev devices on separate parent devices could have wildly different
performance characteristics depending on the load the other mdev
devices are inflicting. If there's only one such parent device on the
system, this works. libvirt has already effectively rejected the idea
of automating mdev placement and perhaps this is another similar case
where we simply require some higher level management tool to have a
global view of the system. Thanks,
Alex