Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

5 Jun 2024

      Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +0000, Dr. David Alan Gilbert wrote:
...
* Michael Galaxy (mgalaxy@akamai.com) wrote:
...
One thing to keep in mind here (despite me not having any hardware to test)
was that one of the original goals here
in the RDMA implementation was not simply raw throughput nor raw latency,
but a lack of CPU utilization in kernel
space due to the offload. While it is entirely possible that newer hardware
w/ TCP might compete, the significant
reductions in CPU usage in the TCP/IP stack were a big win at the time.
Just something to consider while you're doing the testing........
I just noticed this thread; some random notes from a somewhat
fragmented memory of this:
a) Long long ago, I also tried rsocket; 
      https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
     as I remember the library was quite flaky at the time.
Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gonglei@hua...

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.
...
b) A lot of the complexity in the rdma migration code comes from
    emulating a stream to carry the migration control data and interleaving
    that with the actual RAM copy.   I believe the original design used
    a separate TCP socket for the control data, and just used the RDMA
    for the data - that should be a lot simpler (but alas was rejected
    in review early on)
c) I can't rememmber the last benchmarks I did; but I think I did
    manage to beat RDMA with multifd; but yes, multifd does eat host CPU
    where as RDMA barely uses a whisper.
I think my first impression on this matter came from you on this one. :)
...
d) The 'zero-copy-send' option in migrate may well get some of that
     CPU time back; but if I remember we were still bottle necked on
     the receive side. (I can't remember if zero-copy-send worked with
     multifd?)
Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.
...
e) Someone made a good suggestion (sorry can't remember who) - that the
     RDMA migration structure was the wrong way around - it should be the
     destination which initiates an RDMA read, rather than the source
     doing a write; then things might become a LOT simpler; you just need
     to send page ranges to the destination and it can pull it.
     That might work nicely for postcopy.
I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu