On 8/7/2017 1:11 PM, Gao, Ping A wrote:
On 2017/8/4 5:11, Alex Williamson wrote:
> On Thu, 3 Aug 2017 20:26:14 +0800
> "Gao, Ping A" <ping.a.gao(a)intel.com> wrote:
>
>> On 2017/8/3 0:58, Alex Williamson wrote:
>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>>>
>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:
>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:
>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:
>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>> "Gao, Ping A" <ping.a.gao(a)intel.com> wrote:
>>>>>>>
>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:
>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:
>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>
>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>> "Gao, Ping A"
<ping.a.gao(a)intel.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The vfio-mdev provide the capability to let
different guest share the
>>>>>>>>>>> same physical device through mediate sharing,
as result it bring a
>>>>>>>>>>> requirement about how to control the device
sharing, we need a QoS
>>>>>>>>>>> related interface for mdev to management
virtual device resource.
>>>>>>>>>>>
>>>>>>>>>>> E.g. In practical use, vGPUs assigned to
different quests almost has
>>>>>>>>>>> different performance requirements, some
guests may need higher priority
>>>>>>>>>>> for real time usage, some other may need more
portion of the GPU
>>>>>>>>>>> resource to get higher 3D performance,
corresponding we can define some
>>>>>>>>>>> interfaces like weight/cap for overall budget
control, priority for
>>>>>>>>>>> single submission control.
>>>>>>>>>>>
>>>>>>>>>>> So I suggest to add some common attributes
which are vendor agnostic in
>>>>>>>>>>> mdev core sysfs for QoS purpose.
>>>>>>>>>> I think what you're asking for is just some
standardization of a QoS
>>>>>>>>>> attribute_group which a vendor can optionally
include within the
>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups. The
mdev core will
>>>>>>>>>> transparently enable this, but it really only
provides the standard,
>>>>>>>>>> all of the support code is left for the vendor.
I'm fine with that,
>>>>>>>>>> but of course the trouble with and sort of
standardization is arriving
>>>>>>>>>> at an agreed upon standard. Are there QoS knobs
that are generic
>>>>>>>>>> across any mdev device type? Are there others
that are more specific
>>>>>>>>>> to vGPU? Are there existing examples of this
that we can steal their
>>>>>>>>>> specification?
>>>>>>>>> Yes, you are right, standardization QoS knobs are
exactly what I wanted.
>>>>>>>>> Only when it become a part of the mdev framework and
libvirt, then QoS
>>>>>>>>> such critical feature can be leveraged by cloud
usage. HW vendor only
>>>>>>>>> need to focus on the implementation of the
corresponding QoS algorithm
>>>>>>>>> in their back-end driver.
>>>>>>>>>
>>>>>>>>> Vfio-mdev framework provide the capability to share
the device that lack
>>>>>>>>> of HW virtualization support to guests, no matter the
device type,
>>>>>>>>> mediated sharing actually is a time sharing multiplex
method, from this
>>>>>>>>> point of view, QoS can be take as a generic way about
how to control the
>>>>>>>>> time assignment for virtual mdev device that occupy
HW. As result we can
>>>>>>>>> define QoS knob generic across any device type by
this way. Even if HW
>>>>>>>>> has build in with some kind of QoS support, I think
it's not a problem
>>>>>>>>> for back-end driver to convert mdev standard QoS
definition to their
>>>>>>>>> specification to reach the same performance
expectation. Seems there are
>>>>>>>>> no examples for us to follow, we need define it from
scratch.
>>>>>>>>>
>>>>>>>>> I proposal universal QoS control interfaces like
below:
>>>>>>>>>
>>>>>>>>> Cap: The cap limits the maximum percentage of time a
mdev device can own
>>>>>>>>> physical device. e.g. cap=60, means mdev device
cannot take over 60% of
>>>>>>>>> total physical resource.
>>>>>>>>>
>>>>>>>>> Weight: The weight define proportional control of the
mdev device
>>>>>>>>> resource between guests, it’s orthogonal with Cap, to
target load
>>>>>>>>> balancing. E.g. if guest 1 should take double mdev
device resource
>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>
>>>>>>>>> Priority: The guest who has higher priority will get
execution first,
>>>>>>>>> target to some real time usage and speeding
interactive response.
>>>>>>>>>
>>>>>>>>> Above QoS interfaces cover both overall budget
control and single
>>>>>>>>> submission control. I will sent out detail design
later once get aligned.
>>>>>>>> Hi Alex,
>>>>>>>> Any comments about the interface mentioned above?
>>>>>>> Not really.
>>>>>>>
>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>> for NVIDIA devices?
>>>>>>>
>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>
>>>>>> When mdev devices are created, its resources are allocated
irrespective
>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>> parameter we add here should be tied to particular mdev device
and not
>>>>>> to the guest/app that are going to use it. 'Cap' and
'Priority' are
>>>>>> along that line. All mdev device might not need/use these
parameters,
>>>>>> these can be made optional interfaces.
>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>> provided a default fool-style way. We still need a flexible approach
>>>>> that give user the ability to change QoS parameters freely and
>>>>> dynamically according to their requirement , not restrict to the
current
>>>>> limited and static vGPU types.
>>>>>
>>>>>> In the above proposal, I'm not sure how 'Weight'
would work for mdev
>>>>>> devices on same physical device.
>>>>>>
>>>>>> In the above example, "if guest 1 should take double mdev
device
>>>>>> resource compare with guest 2" but what if guest 2 never
booted, how
>>>>>> will you calculate resources?
>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's
a
>>>>> vertical limitation, but weight is a horizontal limitation that
define
>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>> understand as it's just a percentage. For weight. for example, if
we
>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is
4,
>>>>> we can translate it to this formula: resource_of_vGPU_1 = 8 / (8+4)
*
>>>>> total_physical_GPU_resource.
>>>>>
>>>> How will vendor driver provide max weight to userspace
>>>> application/libvirt? Max weight will be per physical device, right?
>>>>
>>>> How would such resource allocation reflect in
'available_instances'?
>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>> FB free but you have reached max weight, so will you make
>>>> available_instances = 0 for all types on that physical GPU?
>>> No, per the algorithm above, the available scheduling for the remaining
>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>> we'd need to define or make the range discoverable, 16 seems rather
>>> arbitrary). We can always add new scheduling participants. AIUI,
>>> Intel uses round-robin scheduling now, where you could consider all
>>> mdev devices to have the same weight. Whether we consider that to be a
>>> weight of 16 or zero or 8 doesn't really matter.
>> QoS is to control the device's process capability like GPU
>> rendering/computing that can be time multiplexing, not used to control
>> the dedicated partition resources like FB, so there is no impact on
>> 'available_instances'.
>>
>> if vGPU_1 weight=8, vGPU_2 weight=4;
>> then vGPU_1_res = 8 / (8 + 4) * total, vGPU_2_res = 4 / (8 + 4) * total;
>> if vGPU_3 created with weight 2;
>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>
>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>> changed after vGPU_3 creating, that's weight doing as it's to define the
>> relationship of all the vGPUs, the performance degradation is meet
>> expectation. The end-user should know about such behavior.
>>
>> However the argument on weight let me has some self-reflection, does the
>> end-user real need weight? does weight has actually application
>> requirement? Maybe the cap and priority are enough?
> What sort of SLAs do you want to be able to offer? For instance if I
> want to be able to offer a GPU in 1/4 increments, how does that work?
> I might sell customers A & B 1/4 increment each and customer C a 1/2
> increment. If weight is removed, can we do better than capping A & B
> at 25% each and C at 50%? That has the downside that nobody gets to
> use the unused capacity of the other clients. The SLA is some sort of
> "up to X% (and no more)" model. With weighting it's as simple as
making
> sure customer C's vGPU has twice the weight of that given to A or B.
> Then you get an "at least X%" SLA model and any customer can use up to
> 100% if the others are idle. Combining weight and cap, we can do "at
> least X%, but no more than Y%".
>
> All of this feels really similar to how cpusets must work since we're
> just dealing with QoS relative to scheduling and we should not try to
> reinvent scheduling QoS. Thanks,
>
Yeah, that's also my original thoughts.
Since we get aligned about the QoS basic definition, I'm going to
prepare the code in kernel side. How about the corresponding part in
libvirt? Implemented separately after the kernel interface finalizing?
Ok. These interfaces should be optional since all vendors drivers of
mdev may not support such QoS.
Thanks,
Kirti.