On 2019/12/12 下午1:47, Yan Zhao wrote:
On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote:
> On 2019/12/6 下午8:49, Yan Zhao wrote:
>> On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
>>> On 2019/12/6 下午4:22, Yan Zhao wrote:
>>>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>>>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>>>>> Hi:
>>>>>>>
>>>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>>>>> For SRIOV devices, VFs are passthroughed into guest
directly without host
>>>>>>>> driver mediation. However, when VMs migrating with
passthroughed VFs,
>>>>>>>> dynamic host mediation is required to (1) get device
states, (2) get
>>>>>>>> dirty pages. Since device states as well as other
critical information
>>>>>>>> required for dirty page tracking for VFs are usually
retrieved from PFs,
>>>>>>>> it is handy to provide an extension in PF driver to
centralizingly control
>>>>>>>> VFs' migration.
>>>>>>>>
>>>>>>>> Therefore, in order to realize (1) passthrough VFs at
normal time, (2)
>>>>>>>> dynamically trap VFs' bars for dirty page tracking
and
>>>>>>> A silly question, what's the reason for doing this, is
this a must for dirty
>>>>>>> page tracking?
>>>>>>>
>>>>>> For performance consideration. VFs' bars should be
passthoughed at
>>>>>> normal time and only enter into trap state on need.
>>>>> Right, but how does this matter for the case of dirty page tracking?
>>>>>
>>>> Take NIC as an example, to trap its VF dirty pages, software way is
>>>> required to trap every write of ring tail that resides in BAR0.
>>> Interesting, but it looks like we need:
>>> - decode the instruction
>>> - mediate all access to BAR0
>>> All of which seems a great burden for the VF driver. I wonder whether or
>>> not doing interrupt relay and tracking head is better in this case.
>>>
>> hi Jason
>>
>> not familiar with the way you mentioned. could you elaborate more?
>
> It looks to me that you want to intercept the bar that contains the
> head. Then you can figure out the buffers submitted from driver and you
> still need to decide a proper time to mark them as dirty.
>
Not need to be accurate, right? just a superset of real dirty bitmap is
enough.
If the superset is too large compared with the dirty pages, it will lead
a lot of side effects.
> What I meant is, intercept the interrupt, then you can figure still
> figure out the buffers which has been modified by the device and make
> them as dirty.
>
> Then there's no need to trap BAR and do decoding/emulation etc.
>
> But it will still be tricky to be correct...
>
intercept the interrupt is a little hard if post interrupt is enabled..
We don't care about the performance too much in this case. Can we simply
disable it?
I think what you worried about here is the timing to mark dirty
pages,
right? upon interrupt receiving, you regard DMAs are finished and safe
to make them dirty.
But with BAR trap way, we at least can keep those dirtied pages as dirty
until device stop. Of course we have other methods to optimize it.
I'm not sure this will not confuse the migration converge time estimation.
>>>> There's
>>>> still no IOMMU Dirty bit available.
>>>>>>>> (3) centralizing
>>>>>>>> VF critical states retrieving and VF controls into one
driver, we propose
>>>>>>>> to introduce mediate ops on top of current vfio-pci
device driver.
>>>>>>>>
>>>>>>>>
>>>>>>>> _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _
>>>>>>>> __________ register mediate ops| ___________
___________ |
>>>>>>>> | |<-----------------------| VF | |
|
>>>>>>>> | vfio-pci | | | mediate | | PF
driver | |
>>>>>>>> |__________|----------------------->| driver |
|___________|
>>>>>>>> | open(pdev) | -----------
| |
>>>>>>>> |
|
>>>>>>>> | |_ _ _ _ _ _ _ _ _
_ _ _|_ _ _ _ _|
>>>>>>>> \|/
\|/
>>>>>>>> -----------
------------
>>>>>>>> | VF | |
PF |
>>>>>>>> -----------
------------
>>>>>>>>
>>>>>>>>
>>>>>>>> VF mediate driver could be a standalone driver that does
not bind to
>>>>>>>> any devices (as in demo code in patches 5-6) or it could
be a built-in
>>>>>>>> extension of PF driver (as in patches 7-9) .
>>>>>>>>
>>>>>>>> Rather than directly bind to VF, VF mediate driver
register a mediate
>>>>>>>> ops into vfio-pci in driver init. vfio-pci maintains a
list of such
>>>>>>>> mediate ops.
>>>>>>>> (Note that: VF mediate driver can register mediate ops
into vfio-pci
>>>>>>>> before vfio-pci binding to any devices. And VF mediate
driver can
>>>>>>>> support mediating multiple devices.)
>>>>>>>>
>>>>>>>> When opening a device (e.g. a VF), vfio-pci goes through
the mediate ops
>>>>>>>> list and calls each vfio_pci_mediate_ops->open() with
pdev of the opening
>>>>>>>> device as a parameter.
>>>>>>>> VF mediate driver should return success or failure
depending on it
>>>>>>>> supports the pdev or not.
>>>>>>>> E.g. VF mediate driver would compare its supported VF
devfn with the
>>>>>>>> devfn of the passed-in pdev.
>>>>>>>> Once vfio-pci finds a successful
vfio_pci_mediate_ops->open(), it will
>>>>>>>> stop querying other mediate ops and bind the opening
device with this
>>>>>>>> mediate ops using the returned mediate handle.
>>>>>>>>
>>>>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl,
rw, mmap) on the
>>>>>>>> VF will be intercepted into VF mediate driver as
>>>>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>>>>> vfio_pci_mediate_ops->rw,
>>>>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>>>>> For vfio_pci_mediate_ops->rw and
vfio_pci_mediate_ops->mmap, they will
>>>>>>>> further return 'pt' to indicate whether vfio-pci
should further
>>>>>>>> passthrough data to hw.
>>>>>>>>
>>>>>>>> when vfio-pci closes the VF, it calls its
vfio_pci_mediate_ops->release()
>>>>>>>> with a mediate handle as parameter.
>>>>>>>>
>>>>>>>> The mediate handle returned from
vfio_pci_mediate_ops->open() lets VF
>>>>>>>> mediate driver be able to differentiate two opening VFs
of the same device
>>>>>>>> id and vendor id.
>>>>>>>>
>>>>>>>> When VF mediate driver exits, it unregisters its mediate
ops from
>>>>>>>> vfio-pci.
>>>>>>>>
>>>>>>>>
>>>>>>>> In this patchset, we enable vfio-pci to provide 3
things:
>>>>>>>> (1) calling mediate ops to allow vendor driver
customizing default
>>>>>>>> region info/rw/mmap of a region.
>>>>>>>> (2) provide a migration region to support migration
>>>>>>> What's the benefit of introducing a region? It looks to
me we don't expect
>>>>>>> the region to be accessed directly from guest. Could we
simply extend device
>>>>>>> fd ioctl for doing such things?
>>>>>>>
>>>>>> You may take a look on mdev live migration discussions in
>>>>>>
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>>>>
>>>>>> or previous discussion at
>>>>>>
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>>>>> which has kernel side implemetation
https://patchwork.freedesktop.org/series/56876/
>>>>>>
>>>>>> generaly speaking, qemu part of live migration is consistent for
>>>>>> vfio-pci + mediate ops way or mdev way.
>>>>> So in mdev, do you still have a mediate driver? Or you expect the
parent
>>>>> to implement the region?
>>>>>
>>>> No, currently it's only for vfio-pci.
>>> And specific to PCI.
>>>
>>>> mdev parent driver is free to customize its regions and hence does not
>>>> requires this mediate ops hooks.
>>>>
>>>>>> The region is only a channel for
>>>>>> QEMU and kernel to communicate information without introducing
IOCTLs.
>>>>> Well, at least you introduce new type of region in uapi. So this
does
>>>>> not answer why region is better than ioctl. If the region will only
be
>>>>> used by qemu, using ioctl is much more easier and straightforward.
>>>>>
>>>> It's not introduced by me :)
>>>> mdev live migration is actually using this way, I'm just keeping
>>>> compatible to the uapi.
>>> I meant e.g VFIO_REGION_TYPE_MIGRATION.
>>>
>> here's the history of vfio live migration:
>>
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg05564.html
>>
https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html
>>
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>
>> If you have any concern of this region way, feel free to comment to the
>> latest v9 patchset:
>>
https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>
>> The patchset here will always keep compatible to there.
>
> Sure.
>
>
>>>> From my own perspective, my answer is that a region is more flexible
>>>> compared to ioctl. vendor driver can freely define the size,
>>>>
>>> Probably not since it's an ABI I think.
>>>
>> that's why I need to define VFIO_REGION_TYPE_MIGRATION here in this
>> patchset, as it's not upstreamed yet.
>> maybe I should make it into a prerequisite patch, indicating it is not
>> introduced by this patchset
>
> Yes.
>
>
>>>> mmap cap of
>>>> its data subregion.
>>>>
>>> It doesn't help much unless it can be mapped into guest (which I
don't
>>> think it was the case here).
>>>
>> it's access by host qemu, the same as how linux app access an mmaped
>> memory. the mmap here is to reduce memory copy from kernel to user.
>> No need to get mapped into guest.
>
> But copy_to_user() is not a bad choice. If I read the code correctly
> only the dirty bitmap was mmaped. This means you probably need to deal
> with dcache carefully on some archs. [1]
>
> Note KVM doesn't use shared dirty bitmap, it uses copy_to_user().
>
> [1]
https://lkml.org/lkml/2019/4/9/5
>
on those platforms, mmap can be safely disabled by vendor driver at will.
Then you driver need to detect and behave differently according to the
arch.
Also, when mmap is disabled, copy_to_user() is also used in region
way.
Any way, please raise you concern in kirti's thread for this common part.
Well, if I read the code correctly, they are all invented in this
series. Kirti's thread just a user.
>>>> Also, there're already too many ioctls in vfio.
>>> Probably not :) We had a brunch of subsystems that have much more
>>> ioctls than VFIO. (e.g DRM)
>>>
>>>>>>>> (3) provide a dynamic trap bar info region to allow
vendor driver
>>>>>>>> control trap/untrap of device pci bars
>>>>>>>>
>>>>>>>> This vfio-pci + mediate ops way differs from mdev way in
that
>>>>>>>> (1) medv way needs to create a 1:1 mdev device on top of
one VF, device
>>>>>>>> specific mdev parent driver is bound to VF directly.
>>>>>>>> (2) vfio-pci + mediate ops way does not create mdev
devices and VF
>>>>>>>> mediate driver does not bind to VFs. Instead, vfio-pci
binds to VFs.
>>>>>>>>
>>>>>>>> The reason why we don't choose the way of writing
mdev parent driver is
>>>>>>>> that
>>>>>>>> (1) VFs are almost all the time directly passthroughed.
Directly binding
>>>>>>>> to vfio-pci can make most of the code shared/reused.
>>>>>>> Can we split out the common parts from vfio-pci?
>>>>>>>
>>>>>> That's very attractive. but one cannot implement a vfio-pci
except
>>>>>> export everything in it as common part :)
>>>>> Well, I think there should be not hard to do that. E..g you can route
it
>>>>> back to like:
>>>>>
>>>>> vfio -> vfio_mdev -> parent -> vfio_pci
>>>>>
>>>> it's desired for us to have mediate driver binding to PF device.
>>>> so once a VF device is created, only PF driver and vfio-pci are
>>>> required. Just the same as what needs to be done for a normal VF
passthrough.
>>>> otherwise, a separate parent driver binding to VF is required.
>>>> Also, this parent driver has many drawbacks as I mentions in this
>>>> cover-letter.
>>> Well, as discussed, no need to duplicate the code, bar trick should
>>> still work. The main issues I saw with this proposal is:
>>>
>>> 1) PCI specific, other bus may need something similar
>> vfio-pci is only for PCI of course.
>
> I meant if what propose here makes sense, other bus driver like
> vfio-platform may want something similar.
>
sure they can follow.
>>> 2) Function duplicated with mdev and mdev can do even more
>>>
>> could you elaborate how mdev can do solve the above saying problem ?
>
> Well, I think both of us agree the mdev can do what mediate ops did,
> mdev device implementation just need to add the direct PCI access part.
>
>
>>>>>>>> If we write a
>>>>>>>> vendor specific mdev parent driver, most of the code
(like passthrough
>>>>>>>> style of rw/mmap) still needs to be copied from vfio-pci
driver, which is
>>>>>>>> actually a duplicated and tedious work.
>>>>>>> The mediate ops looks quite similar to what vfio-mdev did.
And it looks to
>>>>>>> me we need to consider live migration for mdev as well. In
that case, do we
>>>>>>> still expect mediate ops through VFIO directly?
>>>>>>>
>>>>>>>
>>>>>>>> (2) For features like dynamically trap/untrap pci bars,
if they are in
>>>>>>>> vfio-pci, they can be available to most people without
repeated code
>>>>>>>> copying and re-testing.
>>>>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of
the time, people
>>>>>>>> have to decide whether to bind VFs to vfio-pci or mdev
parent driver before
>>>>>>>> it runs into a real migration need. However, if vfio-pci
is bound
>>>>>>>> initially, they have no chance to do live migration when
there's a need
>>>>>>>> later.
>>>>>>> We can teach management layer to do this.
>>>>>>>
>>>>>> No. not possible as vfio-pci by default has no migration region
and
>>>>>> dirty page tracking needs vendor's mediation at least for
most
>>>>>> passthrough devices now.
>>>>> I'm not quite sure I get here but in this case, just tech them to
use
>>>>> the driver that has migration support?
>>>>>
>>>> That's a way, but as more and more passthrough devices have demands
and
>>>> caps to do migration, will vfio-pci be used in future any more ?
>>> This should not be a problem:
>>> - If we introduce a common mdev for vfio-pci, we can just bind that
>>> driver always
>> what is common mdev for vfio-pci? a common mdev parent driver that have
>> the same implementation as vfio-pci?
>
> The common part is not PCI of course. The common part is the both mdev
> and mediate ops want to do some kind of mediation. Mdev is bus agnostic,
> but what you propose here is PCI specific but should be bus agnostic as
> well. Assume we implement a bug agnostic mediate ops, mdev could be even
> built on top.
>
I believe Alex has already replied the above better than me.
>> There's actually already a solution of creating only one mdev on top
>> of each passthrough device, and make mdev share the same iommu group
>> with it. We've also made an implementation on it already. here's a
>> sample one made by Yi at
https://patchwork.kernel.org/cover/11134695/.
>>
>> But, as I said, it's desired to re-use vfio-pci directly for SRIOV,
>> which is straghtforward :)
>
> Can we have a device that is capable of both SRIOV and function slicing?
> If yes, does it mean you need to provides two drivers? One for mdev,
> another for mediate ops?
>
what do you mean by "function slicing"? SIOV?
SIOV could be one of the solution.
For vendor driver, in SRIOV
- with mdev approach, two drivers required: one for mdev parent driver on
VF, one for PF driver.
- with mediate ops + vfio-pci: one driver on PF.
in SIOV, only one driver on PF in both case.
The point is, e.g if you have a card that support both SRIOV and SIOV.
(I don't think you will ship a card with SIOV only). Then, when SRIOV is
used, you need go for mediate ops, when SIOV is used, you need go for
e.g mdev. It means you need prepare two set of drivers for migration.
>>> - The most straightforward way to support dirty page tracking is done by
>>> IOMMU instead of device specific operations.
>>>
>> No such IOMMU yet. And all kinds of platforms should be cared, right?
>
> Or the device can track dirty pages by itself, otherwise it would be
> very hard to implement dirty page tracking correctly without the help of
> switching to software datapath (or maybe you can post the part of BAR0
I think you mixed "correct" and "accurate".
Nope, it depends on how to define correct. We need try our best to be
"accurate" instead of just a superset of dirty pages.
DMA pre-inspection is a long existing term and we have implemented
and
verified it in NIC for both precopy and postcopy case.
Interesting, for postcopy, you probably need mediate the whole datapath
I guess, but again there's no codes to demonstrate how it works in this
series.
Though I can't promise
there's 100% no bug, the method is right.
I fully believe mediation is correct.
Also, whether to trap BARs for dirty page is vendor specific and is not
what should be cared about from this interface part.
The problem is that you introduce a generic interface, it needs to be
proved to be useful for other devices. (E.g for virtio migration, it
doesn't help).
> mediation and dirty page tracking which is missed in this series?)
>
Currently, that part of code is owned by shaopeng's team. The code I
posted is only for demonstrating how to use the interface. Shaopeng's
team is responsible for upsteam of their part at their timing.
It would be very hard to evaluate a framework without any real users. If
possible, please post with driver codes in next version.
Thanks
Thanks
Yan
>>>>>>>> In this patchset,
>>>>>>>> - patches 1-4 enable vfio-pci to call mediate ops
registered by vendor
>>>>>>>> driver to mediate/customize region info/rw/mmap.
>>>>>>>>
>>>>>>>> - patches 5-6 provide a standalone sample driver to
register a mediate ops
>>>>>>>> for Intel Graphics Devices. It does not bind to
IGDs directly but decides
>>>>>>>> what devices it supports via its pciidlist. It
also demonstrates how to
>>>>>>>> dynamic trap a device's PCI bars. (by adding
more pciids in its
>>>>>>>> pciidlist, this sample driver actually is not
necessarily limited to
>>>>>>>> support IGDs)
>>>>>>>>
>>>>>>>> - patch 7-9 provide a sample on i40e driver that supports
Intel(R)
>>>>>>>> Ethernet Controller XL710 Family of devices. It
supports VF precopy live
>>>>>>>> migration on Intel's 710 SRIOV. (but we
commented out the real
>>>>>>>> implementation of dirty page tracking and device
state retrieving part
>>>>>>>> to focus on demonstrating framework part. Will
send out them in future
>>>>>>>> versions)
>>>>>>>> patch 7 registers/unregisters VF mediate ops when
PF driver
>>>>>>>> probes/removes. It specifies its supporting VFs
via
>>>>>>>> vfio_pci_mediate_ops->open(pdev)
>>>>>>>>
>>>>>>>> patch 8 reports device cap of
VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>>>>> provides a sample implementation of migration
region.
>>>>>>>> The QEMU part of vfio migration is based on v8
>>>>>>>>
https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>>>>> We do not based on recent v9 because we think
there are still opens in
>>>>>>>> dirty page track part in that series.
>>>>>>>>
>>>>>>>> patch 9 reports device cap of
VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>>>>> provides an example on how to trap part of bar0
when migration starts
>>>>>>>> and passthrough this part of bar0 again when
migration fails.
>>>>>>>>
>>>>>>>> Yan Zhao (9):
>>>>>>>> vfio/pci: introduce mediate ops to intercept
vfio-pci ops
>>>>>>>> vfio/pci: test existence before calling
region->ops
>>>>>>>> vfio/pci: register a default migration region
>>>>>>>> vfio-pci: register default dynamic-trap-bar-info
region
>>>>>>>> samples/vfio-pci/igd_dt: sample driver to mediate
a passthrough IGD
>>>>>>>> sample/vfio-pci/igd_dt: dynamically trap/untrap
subregion of IGD bar0
>>>>>>>> i40e/vf_migration: register mediate_ops to
vfio-pci
>>>>>>>> i40e/vf_migration: mediate migration region
>>>>>>>> i40e/vf_migration: support dynamic trap of bar0
>>>>>>>>
>>>>>>>> drivers/net/ethernet/intel/Kconfig | 2
+-
>>>>>>>> drivers/net/ethernet/intel/i40e/Makefile | 3
+-
>>>>>>>> drivers/net/ethernet/intel/i40e/i40e.h | 2
+
>>>>>>>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3
+
>>>>>>>> .../ethernet/intel/i40e/i40e_vf_migration.c | 626
++++++++++++++++++
>>>>>>>> .../ethernet/intel/i40e/i40e_vf_migration.h | 78
+++
>>>>>>>> drivers/vfio/pci/vfio_pci.c | 189
+++++-
>>>>>>>> drivers/vfio/pci/vfio_pci_private.h | 2
+
>>>>>>>> include/linux/vfio.h | 18
+
>>>>>>>> include/uapi/linux/vfio.h | 160
+++++
>>>>>>>> samples/Kconfig | 6
+
>>>>>>>> samples/Makefile | 1
+
>>>>>>>> samples/vfio-pci/Makefile | 2
+
>>>>>>>> samples/vfio-pci/igd_dt.c | 367
++++++++++
>>>>>>>> 14 files changed, 1455 insertions(+), 4
deletions(-)
>>>>>>>> create mode 100644
drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>>>>> create mode 100644
drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>>>>> create mode 100644 samples/vfio-pci/Makefile
>>>>>>>> create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>>>>