
Hey, Dave! On Wed, Jun 05, 2024 at 12:31:56AM +0000, Dr. David Alan Gilbert wrote:
* Michael Galaxy (mgalaxy@akamai.com) wrote:
One thing to keep in mind here (despite me not having any hardware to test) was that one of the original goals here in the RDMA implementation was not simply raw throughput nor raw latency, but a lack of CPU utilization in kernel space due to the offload. While it is entirely possible that newer hardware w/ TCP might compete, the significant reductions in CPU usage in the TCP/IP stack were a big win at the time.
Just something to consider while you're doing the testing........
I just noticed this thread; some random notes from a somewhat fragmented memory of this:
a) Long long ago, I also tried rsocket; https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html as I remember the library was quite flaky at the time.
Hmm interesting. There also looks like a thread doing rpoll(). Btw, not sure whether you noticed, but there's the series posted for the latest rsocket conversion here: https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gonglei@hua... I hope Lei and his team has tested >4G mem, otherwise definitely worth checking. Lei also mentioned there're rsocket bugs they found in the cover letter, but not sure what's that about.
b) A lot of the complexity in the rdma migration code comes from emulating a stream to carry the migration control data and interleaving that with the actual RAM copy. I believe the original design used a separate TCP socket for the control data, and just used the RDMA for the data - that should be a lot simpler (but alas was rejected in review early on)
c) I can't rememmber the last benchmarks I did; but I think I did manage to beat RDMA with multifd; but yes, multifd does eat host CPU where as RDMA barely uses a whisper.
I think my first impression on this matter came from you on this one. :)
d) The 'zero-copy-send' option in migrate may well get some of that CPU time back; but if I remember we were still bottle necked on the receive side. (I can't remember if zero-copy-send worked with multifd?)
Yes, and zero-copy requires multifd for now. I think it's because we didn't want to complicate the header processings in the migration stream where it may not be page aligned.
e) Someone made a good suggestion (sorry can't remember who) - that the RDMA migration structure was the wrong way around - it should be the destination which initiates an RDMA read, rather than the source doing a write; then things might become a LOT simpler; you just need to send page ranges to the destination and it can pull it. That might work nicely for postcopy.
I'm not sure whether it'll still be a problem if rdma recv side is based on zero-copy. It would be a matter of whether atomicity can be guaranteed so that we don't want the guest vcpus to see a partially copied page during on-flight DMAs. UFFDIO_COPY (or friend) is currently the only solution for that. Thanks, -- Peter Xu