Hi Peter,
On Tue, Apr 9, 2024 at 9:47 PM Peter Xu <peterx(a)redhat.com> wrote:
On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
>
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu <peterx(a)redhat.com> wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu <peterx(a)redhat.com> wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a
bit surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet
notified
> > > > from this, we'll redo the deprecation procedure even if it'll
be the final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are
still users and whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test),
which led to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack
of test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of
the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug
between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult
from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real
users of RDMA
> > > > taking some care of it. Having it broken easily for >1 releases
definitely
> > > > is a sign of lack of users. It is an implication to the community
that we
> > > > should consider dropping some features so that we can get the best
use of
> > > > the community resources for the things that may have a broader
audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to
not
> > > > break RDMA paths, hopefully in CI. That should not rely on RDMA
hardwares
> > > > but just to sanity check the migration+rdma code running all fine.
RDMA
> > > > taught us the lesson so we're requesting CI coverage for all
other new
> > > > features that will be merged at least for migration subsystem, so
that we
> > > > plan to not merge anything that is not covered by CI unless
extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should
start with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure
how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly
older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use
case for RDMA
> > > > migrations, would it work if such a scenario uses the old binary? Is
it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers. I meant, can it work with the current "tcp:" protocol with
ipoib
> > even if there's rdma/ib hardwares underneath? Is it because of
performance
> > improvements so that we must use a separate path comparing to generic
> > "tcp:" protocol here?
> using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
> talking directly to NIC which bypasses the kernel overhead, less cpu
> utilization and better performance.
>
> While IPoIB is more for compatibility to applications using tcp, but
> can't get full benefit of RDMA. When you have mix generation of IB
> devices, there are performance issue on IPoIB, we've seen 40G HCA can
> only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> speed.
>
> I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
>
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> -----------------------------------------------------------
> Server listening on 5201
> -----------------------------------------------------------
> Time: Tue, 09 Apr 2024 06:55:02 GMT
> Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
> Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
> TCP MSS: 0 (default)
> [ 5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
> 2a02:247f:401:4:2:0:b:3 port 41136
> Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
> 0 seconds, 10 second test, tos 0
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.80 GBytes 15.5 Gbits/sec
> [ 5] 1.00-2.00 sec 1.85 GBytes 15.9 Gbits/sec
> [ 5] 2.00-3.00 sec 1.88 GBytes 16.2 Gbits/sec
> [ 5] 3.00-4.00 sec 1.87 GBytes 16.1 Gbits/sec
> [ 5] 4.00-5.00 sec 1.88 GBytes 16.2 Gbits/sec
> [ 5] 5.00-6.00 sec 1.93 GBytes 16.6 Gbits/sec
> [ 5] 6.00-7.00 sec 2.00 GBytes 17.2 Gbits/sec
> [ 5] 7.00-8.00 sec 1.93 GBytes 16.6 Gbits/sec
> [ 5] 8.00-9.00 sec 1.86 GBytes 16.0 Gbits/sec
> [ 5] 9.00-10.00 sec 1.95 GBytes 16.8 Gbits/sec
> [ 5] 10.00-10.04 sec 85.2 MBytes 17.3 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> Test Complete. Summary Results:
> [ ID] Interval Transfer Bitrate
> [ 5] (sender statistics not available)
> [ 5] 0.00-10.04 sec 19.0 GBytes 16.3 Gbits/sec receiver
> rcv_tcp_congestion cubic
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> -----------------------------------------------------------
> Server listening on 5201
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> 1 jwang@ps404a-3.stg:~$ sudo ib_send_bw -F -a
>
> ************************************
> * Waiting for client to connect... *
> ************************************
>
---------------------------------------------------------------------------------------
> Send BW Test
> Dual-port : OFF Device : mlx5_0
> Number of qps : 1 Transport type : IB
> Connection type : RC Using SRQ : OFF
> PCIe relax order: ON
> ibv_wr* API : ON
> RX depth : 512
> CQ Moderation : 100
> Mtu : 4096[B]
> Link type : IB
> Max inline data : 0[B]
> rdma_cm QPs : OFF
> Data ex. method : Ethernet
>
---------------------------------------------------------------------------------------
> local address: LID 0x24 QPN 0x0174 PSN 0x300138
> remote address: LID 0x17 QPN 0x004a PSN 0xc54d6f
>
---------------------------------------------------------------------------------------
> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
> 2 1000 0.00 6.46 3.385977
> 4 1000 0.00 10.38 2.721894
> 8 1000 0.00 25.69 3.367830
> 16 1000 0.00 41.46 2.716859
> 32 1000 0.00 102.98 3.374577
> 64 1000 0.00 206.12 3.377053
> 128 1000 0.00 405.03 3.318007
> 256 1000 0.00 821.52 3.364939
> 512 1000 0.00 2150.78 4.404803
> 1024 1000 0.00 4288.13 4.391044
> 2048 1000 0.00 8518.25 4.361346
> 4096 1000 0.00 11440.77 2.928836
> 8192 1000 0.00 11526.45 1.475385
> 16384 1000 0.00 11526.06 0.737668
> 32768 1000 0.00 11524.86 0.368795
> 65536 1000 0.00 11331.84 0.181309
> 131072 1000 0.00 11524.75 0.092198
> 262144 1000 0.00 11525.82 0.046103
> 524288 1000 0.00 11524.70 0.023049
> 1048576 1000 0.00 11510.84 0.011511
> 2097152 1000 0.00 11524.58 0.005762
> 4194304 1000 0.00 11514.26 0.002879
> 8388608 1000 0.00 11511.01 0.001439
>
---------------------------------------------------------------------------------------
>
> you can see with ipoib, it reaches 16 Gb/s using TCP, 1 streams,
> 131072 byte blocks
> with RDMA at 4k+ message size it reaches 100 Gb/s
I get it now, thank you!
>
>
> >
> > >
> > > Switching back to TCP will lead us to the old problems which was
> > > solved by RDMA migration.
> >
> > Can you elaborate the problems, and why tcp won't work in this case? They
> > may not be directly relevant to the issue we're discussing, but I'm
happy
> > to learn more.
> >
> > What is the NICs you were testing before? Did the test carry out with
> > things like modern ones (50Gbps-200Gbps NICs), or the test was done when
> > these hardwares are not common?
> We use Mellanox/NVidia IB HCA from 40 Gb/s to 200 Gb/s mixed
> generation across globe.
> >
> > Per my recent knowledge on the new Intel hardwares, at least the ones that
> > support QPL, it's easy to achieve single core 50Gbps+.
> In good case, I've also seen 50 Gbps + on Mellanox HCA.
I see. Have you compared the HCAs v.s. the modern NICs? Now NICs can
achieve similar performance from their spec as I said; I am not sure how
they perform in real life, but maybe worth trying. I only tried 100G nic
and I rem I can hit 70+Gbps with multifd migrations at peak bandwidth.
Have you tried that before?
Yes, I recently tried 100 G Eth NIC, with only iperf
not yet with qemu
migration.
yes, iperf can reach 90 Gbps with multiple streams.
Note that here I didn't want to compare the performance between the two and
find a winner. The issue we're facing now is we have the RDMA migration
now mostly having its own path all over the place, while the rest protocols
(socket, fd, file, etc.) all share the rest.
Then, _if_ modern NICs can work similarly v.s. rdma, I don't yet see a good
reason to keep it. It could be that technology just improved so we can use
less code to do as good. It's a good news to help QEMU evolve by dropping
unused code.
For some details there on the rdma complications for migration:
(1) RDMA is the only protocol that doesn't yet support QIOChannel, while
migration uses QIOChannels mostly everywhere now.. e.g. in multifd,
it means it won't easily support any new things using QIOChannels.
(2) RDMA is the only protocol that mostly hard-coded everywhere in the
RAM migrations, polluting the core logic with much more code
internally to support this protocol.
For (1), see migrate_fd_connect() from rdma_start_outgoing_migration().
While the rest protocols all go via migration_channel_connect().
For (2), see all the "rdma_*" functions in migration/ram.c, where I don't
think it's common to a protocol - most of the rest protocols don't need
those hard-coded stuff. migration/rdma.c has 4000+ LOC for these stuff,
while to do a not-so-fair comparison, migration/fd.c only has <100 LOC.
Then, we found we don't even know who's using it.
I hope I explained why people started this idea, and also why I think that
makes sense at least to me.
Yes, I can understand rdma migration become more a
burden for upstream
maintainers.
> >
> >
https://lore.kernel.org/r/PH7PR11MB5941A91AC1E514BCC32896A6A3342@PH7PR11M...
> >
> > Quote from Yuan:
> >
> > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
60Gbps.
> > [ ID] Interval Transfer Bitrate Retr Cwnd
> > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 MBytes
> > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 Mbytes
> >
> > And in the live migration test, a multifd thread's CPU utilization is
almost 100%
> >
> > It boils down to what old problems were there with tcp first, though.
> Yeah, this is the key reason we use RDMA. (low cpu ulitization and
> better performance)
> >
> > >
> > > >
> > > > Per our best knowledge, RDMA users are rare, and please let anyone
know if
> > > > you are aware of such users. IIUC the major reason why RDMA stopped
being
> > > > the trend is because the network is not like ten years ago; I
don't think I
> > > > have good knowledge in RDMA at all nor network, but my understanding
is
> > > > it's pretty easy to fetch modern NIC to outperform RDMAs, then it
may make
> > > > little sense to maintain multiple protocols, considering RDMA
migration
> > > > code is so special so that it has the most custom code comparing to
other
> > > > protocols.
> > > +cc some guys from Huawei.
> > >
> > > I'm surprised RDMA users are rare, I guess maybe many are just
> > > working with different code base.
> >
> > Yes, please cc whoever might be interested (or surprised.. :) to know this,
> > and let's be open to all possibilities.
> >
> > I don't think it makes sense if there're a lot of users of a feature
then
> > we deprecate that without a good reason. However there's always the
> > resource limitation issue we're facing, so it could still have the
> > possibility that this gets deprecated if nobody is working on our upstream
> > branch. Say, if people use private branches anyway to support rdma without
> > collaborating upstream, keeping such feature upstream then may not make
> > much sense either, unless there's some way to collaborate. We'll see.
>
> Is there document/link about the unittest/CI for migration tests, Why
> are those tests missing?
> Is it hard or very special to set up an environment for that? maybe we
> can help in this regards.
See tests/qtest/migration-test.c. We put most of our migration tests
there and that's covered in CI.
Yu is looking into that see if we can run the
CI on our side.
I think one major issue is CI systems don't normally have rdma devices.
Can rdma migration test be carried out without a real hardware?
As Zhijian
mentioned we can use the SoftRoCE (rxe)
> >
> > It seems there can still be people joining this discussion. I'll hold off
> > a bit on merging this patch to provide enough window for anyone to chim in.
>
> Thx for discussion and understanding.
Thanks for all these inputs so far. These can help us make a wiser and
clearer step no matter which way we choose.
--
Peter Xu
Thx!