Reviewed-by: Michael Galaxy <mgalaxy(a)akamai.com>
Thanks Yu Zhang and Peter.
- Michael
On 4/29/24 15:45, Yu Zhang wrote:
> Hello Michael and Peter,
>
> We are very glad at your quick and kind reply about our plan to take
> over the maintenance of your code. The message is for presenting our
> plan and working together.
> If we were able to obtain the maintainer's role, our plan is:
>
> 1. Create the necessary unit-test cases and get them integrated into
> the current QEMU GitLab-CI pipeline
> 2. Review and test the code changes by other developers to ensure that
> nothing is broken in the changed code before being merged by the
> community
> 3. Based on our current practice and application scenario, look for
> possible improvements when necessary
>
> Besides that, a patch is attached to announce this change in the community.
>
> With your generous support, we hope that the development community
> will make a positive decision for us.
>
> Kind regards,
> Yu Zhang@ IONOS Cloud
>
> On Mon, Apr 29, 2024 at 4:57 PM Peter Xu <peterx(a)redhat.com> wrote:
>> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
>>> Hi All (and Peter),
>> Hi, Michael,
>>
>>> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
>>> (highly irregular for a male) and yes, that's my real last name:
>>>
https://urldefense.com/v3/__https://www.linkedin.com/in/mrgalaxy/__;!!Gjv...
)
>>>
>>> I'm the original author of the RDMA implementation. I've been
discussing
>>> with Yu Zhang for a little bit about potentially handing over maintainership
>>> of the codebase to his team.
>>>
>>> I simply have zero access to RoCE or Infiniband hardware at all,
>>> unfortunately. so I've never been able to run tests or use what I wrote
at
>>> work, and as all of you know, if you don't have a way to test something,
>>> then you can't maintain it.
>>>
>>> Yu Zhang put a (very kind) proposal forward to me to ask the community if
>>> they feel comfortable training his team to maintain the codebase (and run
>>> tests) while they learn about it.
>> The "while learning" part is fine at least to me. IMHO the
"ownership" to
>> the code, or say, taking over the responsibility, may or may not need 100%
>> mastering the code base first. There should still be some fundamental
>> confidence to work on the code though as a starting point, then it's about
>> serious use case to back this up, and careful testings while getting more
>> familiar with it.
>>
>>> If you don't mind, I'd like to let him send over his (very detailed)
>>> proposal,
>> Yes please, it's exactly the time to share the plan. The hope is we try to
>> reach a consensus before or around the middle of this release (9.1).
>> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
>> not yet out, but I think it means we make a decision before or around
>> middle of June.
>>
>> Thanks,
>>
>>> - Michael
>>>
>>> On 4/11/24 11:36, Yu Zhang wrote:
>>>>> 1) Either a CI test covering at least the major RDMA paths, or at
least
>>>>> periodically tests for each QEMU release will be needed.
>>>> We use a batch of regression test cases for the stack, which covers the
>>>> test for QEMU. I did such test for most of the QEMU releases planned as
>>>> candidates for rollout.
>>>>
>>>> The migration test needs a pair of (either physical or virtual) servers
with
>>>> InfiniBand network, which makes it difficult to do on a single server.
The
>>>> nested VM could be a possible approach, for which we may need virtual
>>>> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you
know.
>>>>
>>>> [1]
https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/artic...
>>>>
>>>> Thanks and best regards!
>>>>
>>>> On Thu, Apr 11, 2024 at 4:20 PM Peter Xu <peterx(a)redhat.com>
wrote:
>>>>> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
>>>>>> On Wed, Apr 10, 2024 at 02:28:59AM +0000, Zhijian Li (Fujitsu)
via wrote:
>>>>>>> on 4/10/2024 3:46 AM, Peter Xu wrote:
>>>>>>>
>>>>>>>>> Is there document/link about the unittest/CI for
migration tests, Why
>>>>>>>>> are those tests missing?
>>>>>>>>> Is it hard or very special to set up an environment
for that? maybe we
>>>>>>>>> can help in this regards.
>>>>>>>> See tests/qtest/migration-test.c. We put most of our
migration tests
>>>>>>>> there and that's covered in CI.
>>>>>>>>
>>>>>>>> I think one major issue is CI systems don't normally
have rdma devices.
>>>>>>>> Can rdma migration test be carried out without a real
hardware?
>>>>>>> Yeah, RXE aka. SOFT-RoCE is able to emulate the RDMA, for
example
>>>>>>> $ sudo rdma link add rxe_eth0 type rxe netdev eth0 # on
host
>>>>>>> then we can get a new RDMA interface "rxe_eth0".
>>>>>>> This new RDMA interface is able to do the QEMU RDMA
migration.
>>>>>>>
>>>>>>> Also, the loopback(lo) device is able to emulate the RDMA
interface
>>>>>>> "rxe_lo", however when
>>>>>>> I tried(years ago) to do RDMA migration over this
>>>>>>> interface(rdma:127.0.0.1:3333) , it got something wrong.
>>>>>>> So i gave up enabling the RDMA migration qtest at that time.
>>>>>> Thanks, Zhijian.
>>>>>>
>>>>>> I'm not sure adding an emu-link for rdma is doable for CI
systems, though.
>>>>>> Maybe someone more familiar with how CI works can chim in.
>>>>> Some people got dropped on the cc list for unknown reason, I'm
adding them
>>>>> back (Fabiano, Peter Maydell, Phil). Let's make sure nobody is
dropped by
>>>>> accident.
>>>>>
>>>>> I'll try to summarize what is still missing, and I think these
will be
>>>>> greatly helpful if we don't want to deprecate rdma migration:
>>>>>
>>>>> 1) Either a CI test covering at least the major RDMA paths, or at
least
>>>>> periodically tests for each QEMU release will be needed.
>>>>>
>>>>> 2) Some performance tests between modern RDMA and NIC devices
are
>>>>> welcomed. The current knowledge is modern NIC can work
similarly to
>>>>> RDMA in performance, then it's debatable why we still
maintain so much
>>>>> rdma specific code.
>>>>>
>>>>> 3) No need to be soild patchsets for this one, but some plan to
improve
>>>>> RDMA migration code so that it is not almost isolated from the
rest
>>>>> protocols.
>>>>>
>>>>> 4) Someone to look after this code for real.
>>>>>
>>>>> For 2) and 3) more info is here:
>>>>>
>>>>>
https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1...
>>>>>
>>>>> Here 4) can be the most important as Markus pointed out. We just
didn't
>>>>> get there yet on the discussions, but maybe Markus is right that we
should
>>>>> talk that first.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Peter Xu
>>>>>
>> --
>> Peter Xu
>>