Hi,
We are running libvirt 8.0.0, and sometimes live migration could not finish
(because the guest is dirtying the memory too fast). We implemented a
monitor that increases max downtime when it observed that "Data Remaining"
bumps up. But we found a strange sequence of events from the monitor, which
leads to a paused domain on the destination hypervisor:
The monitor sees Data Remaining bumping up and increases max downtime up to
20 seconds, but weird thing is that after a period of time, it started
reporting "Data Remaining" and "Data Total" is both 0, but the
migration
job is still unfinished:
"Migration in progress - DataTotal: 85
904728064, DataRemaining: 22201458688, TimeElapsed: 20005, MaxDowntime:
500, DirtyRate: 0"
"Migration in progress - DataTotal:
85904728064, DataRemaining: 43801825280, TimeElapsed: 10005, MaxDowntime:
500, DirtyRate: 0"
"Migration in progress - DataTotal:
85904728064, DataRemaining: 52382912512, TimeElapsed: 10004, MaxDowntime:
500, DirtyRate: 0" (DataRemaining bumps up, we start increasing max
downtime)
"Migration in progress - DataTotal:
85904728064, DataRemaining: 4219596800, TimeElapsed: 40004, MaxDowntime:
1500, DirtyRate: 0" (Last poll where we see the job info)
After which monitor logs
"Migration in progress - DataTotal:
0, DataRemaining: 0, TimeElapsed: 40004, MaxDowntime: 13500, DirtyRate: 0"
The domain is always running on the source hypervisor but there is a paused
domain on the destination hypervisor which is paused at start up.
Trying to understand what might have happened:
- Is this a known issue for live migrating high memory activity guests and
the way we interact with libvirt?
- What is the recommended way to ensure that a started live migration
always run to completion if we don't care about downtime?
Appreciate any help here
Yangchen Ye