Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2 May 2024

      On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
...
On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
...
Peter Xu <peterx@redhat.com> writes:
...
On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
...
Hi All (and Peter),
Hi, Michael,
...
My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
(highly irregular for a male) and yes, that's my real last name:
https://www.linkedin.com/in/mrgalaxy/)
I'm the original author of the RDMA implementation. I've been discussing
with Yu Zhang for a little bit about potentially handing over maintainership
of the codebase to his team.
I simply have zero access to RoCE or Infiniband hardware at all,
unfortunately. so I've never been able to run tests or use what I wrote at
work, and as all of you know, if you don't have a way to test something,
then you can't maintain it.
Yu Zhang put a (very kind) proposal forward to me to ask the community if
they feel comfortable training his team to maintain the codebase (and run
tests) while they learn about it.
The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.
How much experience we expect of maintainers depends on the subsystem
and other circumstances.  The hard requirement isn't experience, it's
trust.  See the recent attack on xz.
I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
I'm merely reminding y'all what's at stake.
I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
common scenario is that volunteer maintainers are honest people. QEMU is
in a massively better peer review situation. With xz there was basically no
oversight of the new maintainer. With QEMU, we have oversight from 1000's
of people on the list, a huge pool of general maintainers, the specific
migration maintainers, and the release manager merging code.
With a lack of historical experiance with QEMU maintainership, I'd suggest
that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
file with only the 'Reviewer' classification. The main migration maintainers
would still handle pull requests, but wait for a R-b from one of the RMDA
volunteers. After some period of time the RDMA folks could graduate to full
maintainer status if the migration maintainers needed to reduce their load.
I suspect that might prove unneccesary though, given RDMA isn't an area of
code with a high turnover of patches.
Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

So even if we want to keep RDMA around I hope with this chance we can at
least have clear picture on when we should still suggest any new user to
use RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
user to use RDMA at all (because at least it'll lose many new migration
features).

Thanks,

-- 
Peter Xu