On Thu, Dec 12, 2019 at 11:07:42AM +0800, Alex Williamson wrote:
On Wed, 11 Dec 2019 21:02:40 -0500
Yan Zhao <yan.y.zhao(a)intel.com> wrote:
> On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > On Wed, 11 Dec 2019 01:25:55 -0500
> > Yan Zhao <yan.y.zhao(a)intel.com> wrote:
> >
> > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > Yan Zhao <yan.y.zhao(a)intel.com> wrote:
> > > >
> > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > Yan Zhao <yan.y.zhao(a)intel.com> wrote:
> > > > > >
> > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex
Williamson wrote:
> > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > Yan Zhao <yan.y.zhao(a)intel.com> wrote:
> > > > > > > >
> > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800,
Alex Williamson wrote:
> > > > > > > > > > On Wed, 4 Dec 2019 22:26:50 -0500
> > > > > > > > > > Yan Zhao <yan.y.zhao(a)intel.com>
wrote:
> > > > > > > > > >
> > > > > > > > > > > Dynamic trap bar info region is a
channel for QEMU and vendor driver to
> > > > > > > > > > > communicate dynamic trap info. It
is of type
> > > > > > > > > > >
VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > >
VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > >
> > > > > > > > > > > This region has two fields: dt_fd
and trap.
> > > > > > > > > > > When QEMU detects a device regions
of this type, it will create an
> > > > > > > > > > > eventfd and write its eventfd id
to dt_fd field.
> > > > > > > > > > > When vendor drivre signals this
eventfd, QEMU reads trap field of this
> > > > > > > > > > > info region.
> > > > > > > > > > > - If trap is true, QEMU would
search the device's PCI BAR
> > > > > > > > > > > regions and disable all the sparse
mmaped subregions (if the sparse
> > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > - If trap is false, QEMU would
re-enable those subregions.
> > > > > > > > > > >
> > > > > > > > > > > A typical usage is
> > > > > > > > > > > 1. vendor driver first cuts its
bar 0 into several sections, all in a
> > > > > > > > > > > sparse mmap array. So initally,
all its bar 0 are passthroughed.
> > > > > > > > > > > 2. vendor driver specifys part of
bar 0 sections to be disablable.
> > > > > > > > > > > 3. on migration starts, vendor
driver signals dt_fd and set trap to true
> > > > > > > > > > > to notify QEMU disabling the bar 0
sections of disablable flags on.
> > > > > > > > > > > 4. QEMU disables those bar 0
section and hence let vendor driver be able
> > > > > > > > > > > to trap access of bar 0 registers
and make dirty page tracking possible.
> > > > > > > > > > > 5. on migration failure, vendor
driver signals dt_fd to QEMU again.
> > > > > > > > > > > QEMU reads trap field of this info
region which is false and QEMU
> > > > > > > > > > > re-passthrough the whole bar 0
region.
> > > > > > > > > > >
> > > > > > > > > > > Vendor driver specifies whether it
supports dynamic-trap-bar-info region
> > > > > > > > > > > through cap
VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > >
> > > > > > > > > > > If vfio-pci detects this cap, it
will create a default
> > > > > > > > > > > dynamic_trap_bar_info region on
behalf of vendor driver with region len=0
> > > > > > > > > > > and region->ops=null.
> > > > > > > > > > > Vvendor driver should override
this region's len, flags, rw, mmap in its
> > > > > > > > > > > vfio_pci_mediate_ops.
> > > > > > > > > >
> > > > > > > > > > TBH, I don't like this interface at
all. Userspace doesn't pass data
> > > > > > > > > > to the kernel via INFO ioctls. We have
a SET_IRQS ioctl for
> > > > > > > > > > configuring user signaling with
eventfds. I think we only need to
> > > > > > > > > > define an IRQ type that tells the user
to re-evaluate the sparse mmap
> > > > > > > > > > information for a region. The user
would enumerate the device IRQs via
> > > > > > > > > > GET_IRQ_INFO, find one of this type
where the IRQ info would also
> > > > > > > > > > indicate which region(s) should be
re-evaluated on signaling. The user
> > > > > > > > > > would enable that signaling via
SET_IRQS and simply re-evaluate the
> > > > > > > > > ok. I'll try to switch to this way.
Thanks for this suggestion.
> > > > > > > > >
> > > > > > > > > > sparse mmap capability for the
associated regions when signaled.
> > > > > > > > >
> > > > > > > > > Do you like the "disablable" flag
of sparse mmap ?
> > > > > > > > > I think it's a lightweight way for user
to switch mmap state of a whole region,
> > > > > > > > > otherwise going through a complete flow of
GET_REGION_INFO and re-setup
> > > > > > > > > region might be too heavy.
> > > > > > > >
> > > > > > > > No, I don't like the disable-able flag. At
what frequency do we expect
> > > > > > > > regions to change? It seems like we'd only
change when switching into
> > > > > > > > and out of the _SAVING state, which is rare. It
seems easy for
> > > > > > > > userspace, at least QEMU, to drop the entire mmap
configuration and
> > > > > > > ok. I'll try this way.
> > > > > > >
> > > > > > > > re-read it. Another concern here is how do we
synchronize the event?
> > > > > > > > Are we assuming that this event would occur when
a user switch to
> > > > > > > > _SAVING mode on the device? That operation is
synchronous, the device
> > > > > > > > must be in saving mode after the write to device
state completes, but
> > > > > > > > it seems like this might be trying to add an
asynchronous dependency.
> > > > > > > > Will the write to device_state only complete once
the user handles the
> > > > > > > > eventfd? How would the kernel know when the mmap
re-evaluation is
> > > > > > > > complete. It seems like there are gaps here that
the vendor driver
> > > > > > > > could miss traps required for migration because
the user hasn't
> > > > > > > > completed the mmap transition yet. Thanks,
> > > > > > > >
> > > > > > > > Alex
> > > > > > >
> > > > > > > yes, this asynchronous event notification will cause
vendor driver miss
> > > > > > > traps. But it's supposed to be of very short
period time. That's also a
> > > > > > > reason for us to wish the re-evaluation to be
lightweight. E.g. if it's
> > > > > > > able to be finished before the first iterate, it's
still safe.
> > > > > >
> > > > > > Making the re-evaluation lightweight cannot solve the race,
it only
> > > > > > masks it.
> > > > > >
> > > > > > > But I agree, the timing is not guaranteed, and so
it's best for kernel
> > > > > > > to wait for mmap re-evaluation to complete.
> > > > > > >
> > > > > > > migration_thread
> > > > > > > |->qemu_savevm_state_setup
> > > > > > > | |->ram_save_setup
> > > > > > > | | |->migration_bitmap_sync
> > > > > > > | | |->kvm_log_sync
> > > > > > > | | |->vfio_log_sync
> > > > > > > | |
> > > > > > > | |->vfio_save_setup
> > > > > > > | |->set_device_state(_SAVING)
> > > > > > > |
> > > > > > > |->qemu_savevm_state_pending
> > > > > > > | |->ram_save_pending
> > > > > > > | | |->migration_bitmap_sync
> > > > > > > | | |->kvm_log_sync
> > > > > > > | | |->vfio_log_sync
> > > > > > > | |->vfio_save_pending
> > > > > > > |
> > > > > > > |->qemu_savevm_state_iterate
> > > > > > > | |->ram_save_iterate //send pages
> > > > > > > | |->vfio_save_iterate
> > > > > > > ...
> > > > > > >
> > > > > > >
> > > > > > > Actually, we previously let qemu trigger the
re-evaluation when migration starts.
> > > > > > > And now the reason for we to wish kernel to trigger
the mmap re-evaluation is that
> > > > > > > there're other two possible use cases:
> > > > > > > (1) keep passing through devices when migration starts
and track dirty pages
> > > > > > > using hardware IOMMU. Then when migration is about
to complete, stop the
> > > > > > > device and start trap PCI BARs for software
emulation. (we made some
> > > > > > > changes to let device stop ahead of vcpu )
> > > > > >
> > > > > > How is that possible? I/O devices need to continue to work
until the
> > > > > > vCPU stops otherwise the vCPU can get blocked on the
device. Maybe QEMU
> > > > > hi Alex
> > > > > For devices like DSA [1], it can support SVM mode. In this mode,
when a
> > > > > page fault happens, the Intel DSA device blocks until the page
fault is
> > > > > resolved, if PRS is enabled; otherwise it is reported as an
error.
> > > > >
> > > > > Therefore, to pass through DSA into guest and do live migration
with it,
> > > > > it is desired to stop DSA before stopping vCPU, as there may be
an
> > > > > outstanding page fault to be resolved.
> > > > >
> > > > > During the period when DSA is stopped and vCPUs are still
running, all the
> > > > > pass-through resources are trapped and emulated by host
mediation driver until
> > > > > vCPUs stop.
> > > >
> > > > If the DSA is stopped and resources are trapped and emulated, then
is
> > > > the device really stopped from a QEMU perspective or has it simply
> > > > switched modes underneath QEMU? If the device is truly stopped,
then
> > > > I'd like to understand how a vCPU doing a PIO read from the
device
> > > > wouldn't wedge the VM.
> > > >
> > > It doesn't matter if the device is truly stopped or not (although
from
> > > my point of view, just draining commands and keeping device running is
> > > better as it handles live migration failure better).
> > > PIOs also need to be trapped and emulated if a vCPU accesses them.
> >
> > We seem to be talking around each other here. If PIOs are trapped and
> > emulated then the device is not "stopped" as far as QEMU is
concerned,
> > right? "Stopping" a device suggests to me that a running vCPU doing
a
> > PIO read from the device would block and cause problems in the still
> > running VM. So I think you're suggesting some sort of mode switch in
> > the device where direct access is disabled an emulation takes over
> > until the vCPUs are stopped.
>
> sorry for this confusion.
> yes, it's a kind of mode switch from a QEMU perspective.
> Currently, its implementation in our local branch is like that:
> 1. before migration thread stopping vCPUs, a migration state
> (COMPLETING) notification is sent to vfio migration state notifier, and
> this notifier would put device state to !RUNNING, and put all BARs to trap
> state.
> 2. in the kernel, when device state is set to !RUNNING, draining all
> pending device requests, and starts emulation.
>
> This implementation has two issues:
> 1. it requires hardcode in QEMU to put all BARs trapped and the time
> spending on revoking mmaps is not necessary for devices that do not need it.
> 2. !RUNNING state here is not accurate and it will confuse vendor
> drivers who stop devices after vCPUs stop.
>
> For the 2nd issue, I think we can propose a new device state like
> PRE-STOPPING.
Yes, this is absolutely abusing the !RUNNING state, if the device is
still processing accesses by the vCPU, it's still running.
> But for the 1st issue, not sure how to fix it right now.
> Maybe we can still add an asynchronous kernel notification and wait until
> QEMU have switched the region mmap state?
It seems like you're preemptively trying to optimize the SAVING state
before we even have migration working. Shouldn't SAVING be the point
at which you switch to trapping the device in order to track it?
Thanks,
But for some devices, start trapping on entering SAVING state is too
early. They don't really need the trapping until PRE_STOPPING stage.
E.g. for DSA, it can get dirty pages without trapping. The intention for
it to enter trap is not for SAVING, but for emulation.
Thanks
Yan