On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
On Wed, 4 Dec 2019 22:25:36 -0500
Yan Zhao <yan.y.zhao(a)intel.com> wrote:
> when vfio-pci is bound to a physical device, almost all the hardware
> resources are passthroughed.
> Sometimes, vendor driver of this physcial device may want to mediate some
> hardware resource access for a short period of time, e.g. dirty page
> tracking during live migration.
>
> Here we introduce mediate ops in vfio-pci for this purpose.
>
> Vendor driver can register a mediate ops to vfio-pci.
> But rather than directly bind to the passthroughed device, the
> vendor driver is now either a module that does not bind to any device or
> a module binds to other device.
> E.g. when passing through a VF device that is bound to vfio-pci modules,
> PF driver that binds to PF device can register to vfio-pci to mediate
> VF's regions, hence supporting VF live migration.
>
> The sequence goes like this:
> 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
>
> 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
>
> 3. Whenever vfio-pci opens a device, it searches the list and call
> vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> mediating this device.
> Upon a success return value of from vfio_pci_mediate_ops->open(),
> vfio-pci will stop list searching and store a mediate handle to
> represent this open into vendor driver.
> (so if multiple vendor drivers support mediating a device through
> vfio_pci_mediate_ops, only one will win, depending on their registering
> sequence)
>
> 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> vendor driver is able to override a region's default flags and caps,
> e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> region.
>
> 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> passthrough this read/write/mmap to physical device, otherwise it just
> returns without touch physical device.
>
> 6. When vfio-pci closes a device, vfio_pci_release() chains into
> vfio_pci_mediate_ops->release() to close the reference in vendor driver.
>
> 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
>
> Cc: Kevin Tian <kevin.tian(a)intel.com>
>
> Signed-off-by: Yan Zhao <yan.y.zhao(a)intel.com>
> ---
> drivers/vfio/pci/vfio_pci.c | 146 ++++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_private.h | 2 +
> include/linux/vfio.h | 16 +++
> 3 files changed, 164 insertions(+)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 02206162eaa9..55080ff29495 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(disable_idle_d3,
> "Disable using the PCI D3 low power state for idle, unused devices");
>
> +static LIST_HEAD(mediate_ops_list);
> +static DEFINE_MUTEX(mediate_ops_list_lock);
> +struct vfio_pci_mediate_ops_list_entry {
> + struct vfio_pci_mediate_ops *ops;
> + int refcnt;
> + struct list_head next;
> +};
> +
> static inline bool vfio_vga_disabled(void)
> {
> #ifdef CONFIG_VFIO_PCI_VGA
> @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> if (!(--vdev->refcnt)) {
> vfio_spapr_pci_eeh_release(vdev->pdev);
> vfio_pci_disable(vdev);
> + if (vdev->mediate_ops && vdev->mediate_ops->release) {
> + vdev->mediate_ops->release(vdev->mediate_handle);
> + vdev->mediate_ops = NULL;
> + }
> }
>
> mutex_unlock(&vdev->reflck->lock);
> @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> {
> struct vfio_pci_device *vdev = device_data;
> int ret = 0;
> + struct vfio_pci_mediate_ops_list_entry *mentry;
>
> if (!try_module_get(THIS_MODULE))
> return -ENODEV;
> @@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
> goto error;
>
> vfio_spapr_pci_eeh_open(vdev->pdev);
> + mutex_lock(&mediate_ops_list_lock);
> + list_for_each_entry(mentry, &mediate_ops_list, next) {
> + u64 caps;
> + u32 handle;
Wouldn't it seem likely that the ops provider might use this handle as
a pointer, so we'd want it to be an opaque void*?
yes, you are right, handle as a pointer is much better. will change it.
Thanks :)
> +
> + memset(&caps, 0, sizeof(caps));
@caps has no purpose here, add it if/when we do something with it.
It's also a standard type, why are we memset'ing it rather than just
=0??
> + ret = mentry->ops->open(vdev->pdev, &caps, &handle);
> + if (!ret) {
> + vdev->mediate_ops = mentry->ops;
> + vdev->mediate_handle = handle;
> +
> + pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for
%x:%x\n",
> + vdev->mediate_ops->name, caps,
> + handle, vdev->pdev->vendor,
> + vdev->pdev->device);
Generally not advisable to make user accessible printks.
ok.
> + /*
> + * only find the first matching mediate_ops,
> + * and add its refcnt
> + */
> + mentry->refcnt++;
> + break;
> + }
> + }
> + mutex_unlock(&mediate_ops_list_lock);
> }
> vdev->refcnt++;
> error:
> @@ -736,6 +773,14 @@ static long vfio_pci_ioctl(void *device_data,
> info.size = pdev->cfg_size;
> info.flags = VFIO_REGION_INFO_FLAG_READ |
> VFIO_REGION_INFO_FLAG_WRITE;
> +
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
These would be a lot cleaner if we could just call a helper function:
void vfio_pci_region_info_mediation_hook(vdev, info, caps, etc...)
{
if (vdev->mediate_ops
vdev->mediate_ops->get_region_info)
vdev->mediate_ops->get_region_info(vdev->mediate_handle,
&info, &caps, NULL);
}
I'm not thrilled with all these hooks, but not open coding every one of
them might help.
ok. got it.
> +
> break;
> case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> @@ -756,6 +801,13 @@ static long vfio_pci_ioctl(void *device_data,
> }
> }
>
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
> +
> break;
> case VFIO_PCI_ROM_REGION_INDEX:
> {
> @@ -794,6 +846,14 @@ static long vfio_pci_ioctl(void *device_data,
> }
>
> pci_write_config_word(pdev, PCI_COMMAND, orig_cmd);
> +
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
> +
> break;
> }
> case VFIO_PCI_VGA_REGION_INDEX:
> @@ -805,6 +865,13 @@ static long vfio_pci_ioctl(void *device_data,
> info.flags = VFIO_REGION_INFO_FLAG_READ |
> VFIO_REGION_INFO_FLAG_WRITE;
>
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, NULL);
> + }
> +
> break;
> default:
> {
> @@ -839,6 +906,13 @@ static long vfio_pci_ioctl(void *device_data,
> if (ret)
> return ret;
> }
> +
> + if (vdev->mediate_ops &&
> + vdev->mediate_ops->get_region_info) {
> + vdev->mediate_ops->get_region_info(
> + vdev->mediate_handle,
> + &info, &caps, &cap_type);
> + }
> }
> }
>
> @@ -1151,6 +1225,16 @@ static ssize_t vfio_pci_rw(void *device_data, char __user
*buf,
> if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> return -EINVAL;
>
> + if (vdev->mediate_ops && vdev->mediate_ops->rw) {
> + int ret;
> + bool pt = true;
> +
> + ret = vdev->mediate_ops->rw(vdev->mediate_handle,
> + buf, count, ppos, iswrite, &pt);
> + if (!pt)
> + return ret;
> + }
> +
> switch (index) {
> case VFIO_PCI_CONFIG_REGION_INDEX:
> return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> @@ -1200,6 +1284,15 @@ static int vfio_pci_mmap(void *device_data, struct
vm_area_struct *vma)
> u64 phys_len, req_len, pgoff, req_start;
> int ret;
>
> + if (vdev->mediate_ops && vdev->mediate_ops->mmap) {
> + int ret;
> + bool pt = true;
> +
> + ret = vdev->mediate_ops->mmap(vdev->mediate_handle, vma, &pt);
> + if (!pt)
> + return ret;
> + }
There must be a better way to do all these. Do we really want to call
into ops for every rw or mmap, have the vendor code decode a region,
and maybe or maybe not have it handle it? It's pretty ugly. Do we
do you think below flow is good ?
1. in mediate_ops->open(), return
(1) region[] indexed by region index, if a mediate driver supports mediating
region[i], region[i].ops->get_region_info, regions[i].ops->rw, or
regions[i].ops->mmap is not null.
(2) irq_info[] indexed by irq index, if a mediate driver supports mediating
irq_info[i], irq_info[i].ops->get_irq_info or irq_info[i].ops->set_irq_info
is not null.
Then, vfio_pci_rw/vfio_pci_mmap/vfio_pci_ioctl only call into those
non-null hooks.
need the mediation provider to be able to dynamically setup the ops
per
May I confirm that you are not saying dynamic registering mediate ops
after vfio-pci already opened a device, right?
region and export the default handlers out for them to call?
could we still keep checking return value of the hooks rather than
export default handlers? Otherwise at least vfio_pci_default_ioctl(),
vfio_pci_default_rw(), and vfio_pci_default_mmap() need to be exported.
> +
> index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>
> if (vma->vm_end < vma->vm_start)
> @@ -1629,8 +1722,17 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device
*vdev)
>
> static void __exit vfio_pci_cleanup(void)
> {
> + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> +
> pci_unregister_driver(&vfio_pci_driver);
> vfio_pci_uninit_perm_bits();
> +
> + mutex_lock(&mediate_ops_list_lock);
> + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> + list_del(&mentry->next);
> + kfree(mentry);
> + }
> + mutex_unlock(&mediate_ops_list_lock);
Is it even possible to unload vfio-pci while there are mediation
drivers registered? I don't think the module interactions are well
thought out here, ex. do you really want i40e to have build and runtime
dependencies on vfio-pci? I don't think so.
Currently, yes, i40e has build dependency on vfio-pci.
It's like this, if i40e decides to support SRIOV and compiles in vf
related code who depends on vfio-pci, it will also have build dependency
on vfio-pci. isn't it natural?
> }
>
> static void __init vfio_pci_fill_ids(void)
> @@ -1697,6 +1799,50 @@ static int __init vfio_pci_init(void)
> return ret;
> }
>
> +int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops)
> +{
> + struct vfio_pci_mediate_ops_list_entry *mentry;
> +
> + mutex_lock(&mediate_ops_list_lock);
> + mentry = kzalloc(sizeof(*mentry), GFP_KERNEL);
> + if (!mentry) {
> + mutex_unlock(&mediate_ops_list_lock);
> + return -ENOMEM;
> + }
> +
> + mentry->ops = ops;
> + mentry->refcnt = 0;
It's kZalloc'd, this is unnecessary.
right :)
> + list_add(&mentry->next, &mediate_ops_list);
Check for duplicates?
ok. will do it.
> +
> + pr_info("registered dm ops %s\n", ops->name);
> + mutex_unlock(&mediate_ops_list_lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(vfio_pci_register_mediate_ops);
> +
> +void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops)
> +{
> + struct vfio_pci_mediate_ops_list_entry *mentry, *n;
> +
> + mutex_lock(&mediate_ops_list_lock);
> + list_for_each_entry_safe(mentry, n, &mediate_ops_list, next) {
> + if (mentry->ops != ops)
> + continue;
> +
> + mentry->refcnt--;
Whose reference is this removing?
I intended to prevent mediate driver from calling unregister mediate ops
while there're still opened devices in it.
after a successful mediate_ops->open(), mentry->refcnt++.
after calling mediate_ops->release(). mentry->refcnt--.
(seems in this RFC, I missed a mentry->refcnt-- after calling
mediate_ops->release())
> + if (!mentry->refcnt) {
> + list_del(&mentry->next);
> + kfree(mentry);
> + } else
> + pr_err("vfio_pci unregister mediate ops %s error\n",
> + mentry->ops->name);
This is bad, we should hold a reference to the module providing these
ops for each use of it such that the module cannot be removed while
it's in use. Otherwise we enter a very bad state here and it's
trivially accessible by an admin remove the module while in use.
mediate driver is
supposed to ref its own module on a success
mediate_ops->open(), and deref its own module on mediate_ops->release().
so, it can't be accidentally removed.
Thanks
Yan
Thanks,
Alex
> + }
> + mutex_unlock(&mediate_ops_list_lock);
> +
> +}
> +EXPORT_SYMBOL(vfio_pci_unregister_mediate_ops);
> +
> module_init(vfio_pci_init);
> module_exit(vfio_pci_cleanup);
>
> diff --git a/drivers/vfio/pci/vfio_pci_private.h
b/drivers/vfio/pci/vfio_pci_private.h
> index ee6ee91718a4..bad4a254360e 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -122,6 +122,8 @@ struct vfio_pci_device {
> struct list_head dummy_resources_list;
> struct mutex ioeventfds_lock;
> struct list_head ioeventfds_list;
> + struct vfio_pci_mediate_ops *mediate_ops;
> + u32 mediate_handle;
> };
>
> #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711a2800..0265e779acd1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -195,4 +195,20 @@ extern int vfio_virqfd_enable(void *opaque,
> void *data, struct virqfd **pvirqfd, int fd);
> extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>
> +struct vfio_pci_mediate_ops {
> + char *name;
> + int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
> + void (*release)(int handle);
> + void (*get_region_info)(int handle,
> + struct vfio_region_info *info,
> + struct vfio_info_cap *caps,
> + struct vfio_region_info_cap_type *cap_type);
> + ssize_t (*rw)(int handle, char __user *buf,
> + size_t count, loff_t *ppos, bool iswrite, bool *pt);
> + int (*mmap)(int handle, struct vm_area_struct *vma, bool *pt);
> +
> +};
> +extern int vfio_pci_register_mediate_ops(struct vfio_pci_mediate_ops *ops);
> +extern void vfio_pci_unregister_mediate_ops(struct vfio_pci_mediate_ops *ops);
> +
> #endif /* VFIO_H */