Hi Peter
On Thu, May 2, 2024 at 6:20 PM Peter Xu <peterx(a)redhat.com> wrote:
On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
>
>
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy <mgalaxy(a)akamai.com> wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.
Thanks both. Please keep us posted.
Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE
is not
involved, am I right?
kinds of. Our new hardware is RDMA capable, we can configure
it to run
in "rdma" transport or "tcp"
it is more straight comparison,
When run "rdma" transport, RoCE is involved, eg the
rdma-core/ibverbs/rdmacm/vendor verbs driver are used.
The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case. I'd suggest we start with 8 threads if it's 100Gbps.
I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months. If you want I can try to see how we can
test together. And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides? IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?
Yes, the simple setup can be just two hosts directly connected. This remind me,
I may also able to find a test setup with 100 G nic in lab, will keep
you posted.
Regards!
>
> >
> > >
> > > - Michael
> >
> > Thx!
> > Jinpu
> > >
> > > On 5/1/24 11:16, Peter Xu wrote:
> > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > > >>> What I worry more is whether this is really what we want to
keep rdma in
> > > >>> qemu, and that's also why I was trying to request for some
serious
> > > >>> performance measurements comparing rdma v.s. nics. And here
when I said
> > > >>> "we" I mean both QEMU community and any company that
will support keeping
> > > >>> rdma around.
> > > >>>
> > > >>> The problem is if NICs now are fast enough to perform at least
equally
> > > >>> against rdma, and if it has a lower cost of overall
maintenance, does it
> > > >>> mean that rdma migration will only be used by whoever wants to
keep them in
> > > >>> the products and existed already? In that case we should
simply ask new
> > > >>> users to stick with tcp, and rdma users should only drop but
not increase.
> > > >>>
> > > >>> It seems also destined that most new migration features will
not support
> > > >>> rdma: see how much we drop old features in migration now
(which rdma
> > > >>> _might_ still leverage, but maybe not), and how much we add
mostly multifd
> > > >>> relevant which will probably not apply to rdma at all. So in
general what
> > > >>> I am worrying is a both-loss condition, if the company might
be easier to
> > > >>> either stick with an old qemu (depending on whether other new
features are
> > > >>> requested to be used besides RDMA alone), or do periodic
rebase with RDMA
> > > >>> downstream only.
> > > >> I don't know much about the originals of RDMA support in QEMU
and why
> > > >> this particular design was taken. It is indeed a huge maint burden
to
> > > >> have a completely different code flow for RDMA with 4000+ lines
of
> > > >> custom protocol signalling which is barely understandable.
> > > >>
> > > >> I would note that /usr/include/rdma/rsocket.h provides a higher
level
> > > >> API that is a 1-1 match of the normal kernel 'sockets'
API. If we had
> > > >> leveraged that, then QIOChannelSocket class and the QAPI
SocketAddress
> > > >> type could almost[1] trivially have supported RDMA. There would
have
> > > >> been almost no RDMA code required in the migration subsystem, and
all
> > > >> the modern features like compression, multifd, post-copy, etc
would
> > > >> "just work".
> > > >>
> > > >> I guess the 'rsocket.h' shim may well limit some of the
possible
> > > >> performance gains, but it might still have been a better tradeoff
> > > >> to have not quite so good peak performance, but with massively
> > > >> less maint burden.
> > > > My understanding so far is RDMA is sololy for performance but nothing
else,
> > > > then it's a question on whether rdma existing users would like to
do so if
> > > > it will run slower.
> > > >
> > > > Jinpu mentioned on the explicit usages of ib verbs but I am just
mostly
> > > > quotting that word as I don't really know such details:
> > > >
> > > >
https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/CAMGffEm2T...
> > > >
> > > > So not sure whether that applies here too, in that having qiochannel
> > > > wrapper may not allow direct access to those ib verbs.
> > > >
> > > > Thanks,
> > > >
> > > >> With regards,
> > > >> Daniel
> > > >>
> > > >> [1] "almost" trivially, because the poll() integration
for rsockets
> > > >> requires a bit more magic sauce since rsockets FDs are not
> > > >> really FDs from the kernel's POV. Still, QIOCHannel
likely can
> > > >> abstract that probme.
> > > >> --
> > > >> |:
https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF...
-o-
https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!G...
:|
> > > >> |:
https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_...
-o-
https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6...
:|
> > > >> |:
https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HG...
-o-
https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz...
:|
> > > >>
> >
>
> --
> Peter Xu
>