
On Wed, May 29, 2024 at 11:35 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
-----Original Message----- From: Jinpu Wang [mailto:jinpu.wang@ionos.com] Sent: Wednesday, May 29, 2024 5:18 PM To: Gonglei (Arei) <arei.gonglei@huawei.com> Cc: Greg Sword <gregsword0@gmail.com>; Peter Xu <peterx@redhat.com>; Yu Zhang <yu.zhang@ionos.com>; Michael Galaxy <mgalaxy@akamai.com>; Elmar Gerdes <elmar.gerdes@ionos.com>; zhengchuan <zhengchuan@huawei.com>; Daniel P. Berrangé <berrange@redhat.com>; Markus Armbruster <armbru@redhat.com>; Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>; qemu-devel@nongnu.org; Yuval Shaia <yuval.shaia.ml@gmail.com>; Kevin Wolf <kwolf@redhat.com>; Prasanna Kumar Kalever <prasanna.kalever@redhat.com>; Cornelia Huck <cohuck@redhat.com>; Michael Roth <michael.roth@amd.com>; Prasanna Kumar Kalever <prasanna4324@gmail.com>; Paolo Bonzini <pbonzini@redhat.com>; qemu-block@nongnu.org; devel@lists.libvirt.org; Hanna Reitz <hreitz@redhat.com>; Michael S. Tsirkin <mst@redhat.com>; Thomas Huth <thuth@redhat.com>; Eric Blake <eblake@redhat.com>; Song Gao <gaosong@loongson.cn>; Marc-André Lureau <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>; Wainer dos Santos Moschetta <wainersm@redhat.com>; Beraldo Leal <bleal@redhat.com>; Pannengyuan <pannengyuan@huawei.com>; Xiexiangyou <xiexiangyou@huawei.com>; Fabiano Rosas <farosas@suse.de>; RDMA mailing list <linux-rdma@vger.kernel.org>; shefty@nvidia.com; Haris Iqbal <haris.iqbal@ionos.com> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
Hi Gonglei,
On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) <arei.gonglei@huawei.com
wrote:
-----Original Message----- From: Greg Sword [mailto:gregsword0@gmail.com] Sent: Wednesday, May 29, 2024 2:06 PM To: Jinpu Wang <jinpu.wang@ionos.com> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Wed, May 29, 2024 at 12:33 PM Jinpu Wang <jinpu.wang@ionos.com> wrote:
On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) <arei.gonglei@huawei.com>
wrote:
Hi,
> -----Original Message----- > From: Peter Xu [mailto:peterx@redhat.com] > Sent: Tuesday, May 28, 2024 11:55 PM > > > > Exactly, not so compelling, as I did it first only on > > > > servers widely used for production in our data center. > > > > The network adapters are > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries > > > > NetXtreme > > > > BCM5720 2-port Gigabit Ethernet PCIe > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 > > > looks more > reasonable. > > > > > > >
https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> 15 > > > wVAqk81vXtKzx-LfJQ@mail.gmail.com/ > > > > > > Appreciate a lot for everyone helping on the testings. > > > > > > > InfiniBand controller: Mellanox Technologies MT27800 > > > > Family [ConnectX-5] > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP > > > > for VM migration. RDMA traffic is through InfiniBand and > > > > TCP through Ethernet on these two hosts. One is standby > > > > while the other is active. > > > > > > > > Now I'll try on a server with more recent Ethernet and > > > > InfiniBand network adapters. One of them has: > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller > > > > (rev > > > > 01) > > > > > > > > The comparison between RDMA and TCP on the same NIC > > > > could make more > > > sense. > > > > > > It looks to me NICs are powerful now, but again as I > > > mentioned I don't think it's a reason we need to deprecate > > > rdma, especially if QEMU's rdma migration has the chance > > > to be refactored using rsocket. > > > > > > Is there anyone who started looking into that direction? > > > Would it make sense we start some PoC now? > > > > > > > My team has finished the PoC refactoring which works well. > > > > Progress: > > 1. Implement io/channel-rdma.c, 2. Add unit test > > tests/unit/test-io-channel-rdma.c and verifying it is > > successful, 3. Remove the original code from migration/rdma.c, 4. > > Rewrite the rdma_start_outgoing_migration and > > rdma_start_incoming_migration logic, 5. Remove all rdma_xxx > > functions from migration/ram.c. (to prevent RDMA live > > migration from polluting the > core logic of live migration), 6. The soft-RoCE implemented > by software is used to test the RDMA live migration. It's successful. > > > > We will be submit the patchset later. > > That's great news, thank you! > > -- > Peter Xu
For rdma programming, the current mainstream implementation is to use rdma_cm to establish a connection, and then use verbs to transmit data.
rdma_cm and ibverbs create two FDs respectively. The two FDs have different responsibilities. rdma_cm fd is used to notify connection establishment events, and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin event can be monitored, which means that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, only the pollin event can be listened, which indicates that a new CQE is generated.
Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls that are completely similar to socket interfaces. However, this library returns only the rdma_cm fd for listening to link setup-related events and does not expose the verbs fd (readable and writable events for listening to data). Only the rpoll interface provided by the RSocket can be used to listen to related events. However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept API). And cannot listen to the verbs fd event. I'm confused, the rs_poll_arm : https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c# L3290 For STREAM, rpoll setup fd for both cq fd and cm fd.
Do you guys have any ideas? Thanks.
+cc linux-rdma
Why include rdma community?
Can rdma/rsocket provide an API to expose the verbs fd? Why do we need verbs fd? looks rsocket during rsend/rrecv is handling the new completion if any via rs_get_comp
Actually I said the reason in the previous mail. Listing some header in librdmacm.
/* verbs.h */ struct ibv_comp_channel { struct ibv_context *context; int fd; int refcnt; };
/* rdma_cma.h */ struct rdma_event_channel { int fd; };
/* rdma_cma.h */ struct rdma_cm_id { struct ibv_context *verbs; struct rdma_event_channel *channel; //==> it can be gotten by rsocket.h void *context; struct ibv_qp *qp; struct rdma_route route; enum rdma_port_space ps; uint8_t port_num; struct rdma_cm_event *event; struct ibv_comp_channel *send_cq_channel; // ==> can't be gotten so that Qemu can't read the CQE dat
ok, but the send_cq_channel is set the same as recv_cq_channel: https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#L855 and also use the same recv_cq as send_cq.
struct ibv_cq *send_cq; struct ibv_comp_channel *recv_cq_channel; struct ibv_cq *recv_cq; struct ibv_srq *srq; struct ibv_pd *pd; enum ibv_qp_type qp_type; };
/* rsocket.h */ int raccept(int socket, struct sockaddr *addr, socklen_t *addrlen); int rpoll(struct pollfd *fds, nfds_t nfds, int timeout);
Another question to my mind is Daniel suggested a bit different way of using rsocket: https://lore.kernel.org/qemu-devel/ZjtOreamN8xF9FDE@redhat.com/ Have you considered that?
We do use 'rsocket' APIs to refactor the RDMA code in QEMU and encounter the issue.
Regards, -Gonglei