About virsh(1) and Postcopy migration

Hello, * virsh(1) offers multiple options to initiate Postcopy migration: 1) virsh migrate --postcopy --postcopy-after-precopy 2) virsh migrate --postcopy + virsh migrate-postcopy 3) virsh migrate --postcopy --timeout <N> --timeout-postcopy When Postcopy migration is invoked via options (2) or (3) above, the migrated guest on the destination host hangs sometimes. But such a hang is not reproducible with option (1) above. * When using option (1) above, libvirtd(8) waits for the first pass of pre-copy to finish before enabling postcopy migration. * Does the same waiting happen when using options (2) and (3) above? === 2024-07-24 14:16:27.448+0000: msg={"execute":"migrate" 2024-07-24 14:16:29.318+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 14:28:39.737+0000: msg={"execute":"migrate" 2024-07-24 14:28:41.119+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 14:44:11.684+0000: msg={"execute":"migrate" 2024-07-24 14:44:12.835+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 14:48:00.675+0000: msg={"execute":"migrate" 2024-07-24 14:48:02.319+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 15:03:36.110+0000: msg={"execute":"migrate" 2024-07-24 15:03:37.341+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 16:05:25.602+0000: msg={"execute":"migrate" 2024-07-24 16:05:26.756+0000: msg={"execute":"migrate-start-postcopy" === * While running migration tests with options (2) and (3) above, switch to postcopy appears to happen within 2 seconds of starting migration. - Is that reasonable time to switch from pre-copy to postcopy? - Is there an ideal time to wait before switching to postcopy? * The feature page below suggests to wait until one cycle of RAM migration has completed -> https://wiki.qemu.org/Features/PostCopyLiveMigration * I'd much appreciate any clarification/confirmation about this. Thank you. --- -Prasad

On Thu, Aug 29, 2024 at 10:11:05 +0000, Prasad Pandit wrote:
Hello,
* virsh(1) offers multiple options to initiate Postcopy migration:
1) virsh migrate --postcopy --postcopy-after-precopy 2) virsh migrate --postcopy + virsh migrate-postcopy 3) virsh migrate --postcopy --timeout <N> --timeout-postcopy
When Postcopy migration is invoked via options (2) or (3) above, the migrated guest on the destination host hangs sometimes. But such a hang is not reproducible with option (1) above.
* When using option (1) above, libvirtd(8) waits for the first pass of pre-copy to finish before enabling postcopy migration.
Right.
* Does the same waiting happen when using options (2) and (3) above?
No. The explicit "virsh migrate-postcopy" request expects the user to decide when to switch to post-copy by monitoring the migration.
=== 2024-07-24 14:16:27.448+0000: msg={"execute":"migrate" 2024-07-24 14:16:29.318+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 14:28:39.737+0000: msg={"execute":"migrate" 2024-07-24 14:28:41.119+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 14:44:11.684+0000: msg={"execute":"migrate" 2024-07-24 14:44:12.835+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 14:48:00.675+0000: msg={"execute":"migrate" 2024-07-24 14:48:02.319+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 15:03:36.110+0000: msg={"execute":"migrate" 2024-07-24 15:03:37.341+0000: msg={"execute":"migrate-start-postcopy" 2024-07-24 16:05:25.602+0000: msg={"execute":"migrate" 2024-07-24 16:05:26.756+0000: msg={"execute":"migrate-start-postcopy" ===
* While running migration tests with options (2) and (3) above, switch to postcopy appears to happen within 2 seconds of starting migration. - Is that reasonable time to switch from pre-copy to postcopy?
No, that's not very reasonable. Basically every memory page access would have to be delayed until the page is transferred from the source host.
- Is there an ideal time to wait before switching to postcopy?
Not really. As the name suggests this is meant as a timeout, i.e., switch to post-copy if pre-copy migration is taking too long and thus is unlikely to ever converge. So logically the timeout should be long enough to give pre-copy migration to do its job. In this case, switching to post-copy is an alternative approach to CPU throttling for helping migration to converge.
* The feature page below suggests to wait until one cycle of RAM migration has completed -> https://wiki.qemu.org/Features/PostCopyLiveMigration
Right, that's definitely a good approach as only memory pages that changed during migration will have to be transferred from the source. Jirka

Hello Jiri, On Thursday 29 August, 2024 at 04:58:39 pm IST, Jiri Denemark <jdenemar@redhat.com> wrote:
* Does the same waiting happen when using options (2) and (3) above?
No. The explicit "virsh migrate-postcopy" request expects the user to decide when to switch to post-copy by monitoring the migration.
* While running migration tests switch to postcopy appears to happen within 2 seconds of starting migration. - Is that reasonable time to switch from pre-copy to postcopy?
No, that's not very reasonable. Basically every memory page access would have to be delayed until the page is transferred from the source host.
- Is there an ideal time to wait before switching to postcopy?
Not really. As the name suggests this is meant as a timeout, i.e., switch to post-copy if pre-copy migration is taking too long and thus is unlikely to ever converge. So logically the timeout should be long enough to give pre-copy migration to do its job. In this case, switching to post-copy is an alternative approach to CPU throttling for helping migration to converge.
* The feature page below suggests to wait until one cycle of RAM migration has completed -> https://wiki.qemu.org/Features/PostCopyLiveMigration
Right, that's definitely a good approach as only memory pages that changed during migration will have to be transferred from the source.
* Okay, great! Thanks so much for these details, I appreciate it. Thank you. --- - Prasad

Hello Jiri, David [+cc dgilbert] On Thursday 29 August, 2024 at 04:58:39 pm IST, Jiri Denemark <jdenemar@redhat.com> wrote:
- Is there an ideal time to wait before switching to postcopy?
Not really. As the name suggests this is meant as a timeout, So logically the timeout should be long enough to give pre-copy migration time to do its job.
* The feature page below suggests to wait until one cycle of RAM migration has completed -> https://wiki.qemu.org/Features/PostCopyLiveMigration
Right, that's definitely a good approach as only memory pages that changed during migration will have to be transferred from the source.
Some follow-up questions: * Why is it recommended to complete at least one cycle of RAM migration before switching to Postcopy mode? * What are the risks if we switch to postcopy before completing one cycle of RAM migration? * Is it plausible to start migration in postcopy mode from the beginning? ie. without the first pre-copy pass? * I'm asking because even when we switch to postcopy before completing the first (pre-copy) pass, the migration seems to work fine most of the times. Errors occur occasionally. It does not _always_ fail with errors. - Could we file bugs for this occasional failure? OR that is expected because we did not wait to complete first (pre-copy) round of migration. * I'd appreciate if you could help to clarify this. Thank you. --- -Prasad
participants (2)
-
Jiri Denemark
-
Prasad Pandit