Heya,
We’ve recently hit a number of failures on QEMU P2P live migrations
which appear to be caused by transient networking disconnects at
different points in the migration process. We would like to implement
smarter retry logic in our control plane to ensure such issues don’t
stall critical workflows. On the other hand, we cannot blindly retry
every failed migration because doing so greatly lengthens the time to
fail high level autmation when there is a real problem.
Are there currently any generally understood best practices for retrying
migrations from a control plane perspective? Ideally we would decide
whether or not to retry based on error codes, but especially in the QEMU
P2P migration path many generic codes are returned. For example, see [1]
where we attempted to improve an error code for a likely retry-able set
of failure cases.
[1]
https://listman.redhat.com/archives/libvir-list/2022-January/msg00217.html
Thanks,
Raphael