On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> e) Someone made a good suggestion (sorry can't remember
who) - that the
> RDMA migration structure was the wrong way around - it should be the
> destination which initiates an RDMA read, rather than the source
> doing a write; then things might become a LOT simpler; you just need
> to send page ranges to the destination and it can pull it.
> That might work nicely for postcopy.
I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy. It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs. UFFDIO_COPY (or friend) is currently the only solution for
that.
And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.
Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy. QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous. And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.
https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899...
Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages). It'll also work with things like rmda I think in
similar ways. It's just that it'll not work on anonymous.
I definitely hijacked the thread to somewhere too far away. I'll stop
here..
Thanks,
--
Peter Xu