Migration with "--p2p --tunnelled" hanging in v6.9.0

Hi, I'm wondering about the best next steps to debug a migration issue. What I found is that with libvirt v6.9.0 a migration hangs if used like: $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test \ qemu+ssh://testkvm-hirsute-to/system Just "--live --p2p" works fine. Also a bunch of other migration option combinations work with the same build/setup, just p2p+tunnelled fails. Also if either source or target are not on 6.9 the migration works (former version used is v6.6 for me). I looked at alternating qemu versions (5.0 / 5.1), but it had no impact. It only depends on libvirt to be on the new version on source&target to trigger the issue. Unfortunately there is no crash or error to debug into, it just gets stuck with the "virsh migrate" hanging on the source and never leaving the "paused" state on the target. I have compared setups with the least amount of "change": good: Qemu 5.1 + Libvirt 6.9 -> Qemu 5.1 / Libvirt 6.6 bad: Qemu 5.1 + Libvirt 6.9 -> Qemu 5.1 / Libvirt 6.9 [1] has the debug logs of those, beginning with a libvirtd restart that one can likely skip and then into the migration that hangs in the bad case. But I failed to see an obvious reason in the log. In git/news I only found these changes which sounded to be relevant: f51cbe92c0 qemu: Allow migration over UNIX socket c69915ccaf peer2peer migration: allow connecting to local sockets But I'm not using unix: and in the logs the only unix: mentions are for the qemu monitor and qemu-guest-agent. I wanted to ask: - if something related was recently changed that comes to mind? - if someone else sees anything in the linked logs that I missed? - if someone else has seen/reproduced the same? - for best practise to debug a hanging migration Thanks in advance! [1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437... -- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd

On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
Hi,
[...]
In git/news I only found these changes which sounded to be relevant: f51cbe92c0 qemu: Allow migration over UNIX socket c69915ccaf peer2peer migration: allow connecting to local sockets But I'm not using unix: and in the logs the only unix: mentions are for the qemu monitor and qemu-guest-agent.
One very important part of '--tunnelled' migration is the use of virStream APIs to transport the migration data. Perhaps something there is broken since it doesn't reproduce when not using the tunneel

On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
Hi,
[...]
BTW to reduce the scope what to think about - I have rebuilt 6.8 as well it works. Thereby I can confirm that the offending change should be in between 6.8.0 -> 6.9.0.
In git/news I only found these changes which sounded to be relevant: f51cbe92c0 qemu: Allow migration over UNIX socket c69915ccaf peer2peer migration: allow connecting to local sockets But I'm not using unix: and in the logs the only unix: mentions are for the qemu monitor and qemu-guest-agent.
One very important part of '--tunnelled' migration is the use of virStream APIs to transport the migration data. Perhaps something there is broken since it doesn't reproduce when not using the tunnel
Thanks for the hint Peter. I was now looking there as well, but other than the switch to g_new0 there is neither a change carrying the words "stream" nor one that is affecting the related files that implement virStream*. -- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd

On Wed, Nov 25, 2020 at 10:55 AM Christian Ehrhardt <christian.ehrhardt@canonical.com> wrote:
On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
Hi,
[...]
BTW to reduce the scope what to think about - I have rebuilt 6.8 as well it works. Thereby I can confirm that the offending change should be in between 6.8.0 -> 6.9.0.
I was able to get this working in git bisect builds from git between v6.8 / v6.9. I identified the following offending commit: 7d959c30 rpc: Fix virt-ssh-helper detection Ok that makes a bit of sense, first we had in 6.8 f8ec7c84 rpc: use new virt-ssh-helper binary for remote tunnelling That makes it related to tunneling which matches our broken use-case. The identified commit "7d959c30 rpc: Fix virt-ssh-helper detection" might finally really enable the new helper and that is then broken? With that knowledge I was able to confirm that it really is the native mode $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=netcat <works> $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=native <hangs> I recently discussed with Andrea if we'd need apparmor rules for virt-ssh-helper, but there are no denials nor libvirt log entries related to virt-ssh-helper. But we don't need such rules since it is spawned on the ssh login and not under libvirtd itself. PS output of the hanging receiving virt-ssh-helper (looks not too unhappy): Source: 4 0 41305 1 20 0 1627796 23360 poll_s Ssl ? 0:05 /usr/sbin/libvirtd 0 0 41523 41305 20 0 9272 4984 poll_s S ? 0:02 \_ ssh -T -e none -- testkvm-hirsute-to sh -c 'virt-ssh-helper 'qemu:///system'' Target 4 0 213 1 20 0 13276 4132 poll_s Ss ? 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 250-500 startups 4 0 35148 213 20 0 19048 11320 poll_s Ss ? 0:02 \_ sshd: root@notty 4 0 35206 35148 20 0 2584 544 do_wai Ss ? 0:00 \_ sh -c virt-ssh-helper qemu:///system 0 0 35207 35206 20 0 81348 26684 - R ? 0:34 \_ virt-ssh-helper qemu:///system I've looked at it with strace [1] and gdb for backtraces [2] - it is not dead or stuck and keeps working. Could it be just so slow that it appears to hang until it times out? Or is the event mechanism having issues and it wakes up too rarely? Also did anyone else see the same with >=v6.9.0? [1]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1904584/comments/12 [2]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1904584/comments/13
In git/news I only found these changes which sounded to be relevant: f51cbe92c0 qemu: Allow migration over UNIX socket c69915ccaf peer2peer migration: allow connecting to local sockets But I'm not using unix: and in the logs the only unix: mentions are for the qemu monitor and qemu-guest-agent.
One very important part of '--tunnelled' migration is the use of virStream APIs to transport the migration data. Perhaps something there is broken since it doesn't reproduce when not using the tunnel
Thanks for the hint Peter. I was now looking there as well, but other than the switch to g_new0 there is neither a change carrying the words "stream" nor one that is affecting the related files that implement virStream*.
-- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd
-- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd

On Wed, Nov 25, 2020 at 01:28:09PM +0100, Christian Ehrhardt wrote:
On Wed, Nov 25, 2020 at 10:55 AM Christian Ehrhardt <christian.ehrhardt@canonical.com> wrote:
On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
Hi,
[...]
BTW to reduce the scope what to think about - I have rebuilt 6.8 as well it works. Thereby I can confirm that the offending change should be in between 6.8.0 -> 6.9.0.
I was able to get this working in git bisect builds from git between v6.8 / v6.9. I identified the following offending commit: 7d959c30 rpc: Fix virt-ssh-helper detection
Ok that makes a bit of sense, first we had in 6.8 f8ec7c84 rpc: use new virt-ssh-helper binary for remote tunnelling That makes it related to tunneling which matches our broken use-case.
The identified commit "7d959c30 rpc: Fix virt-ssh-helper detection" might finally really enable the new helper and that is then broken?
With that knowledge I was able to confirm that it really is the native mode
$ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=netcat <works> $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=native <hangs>
I recently discussed with Andrea if we'd need apparmor rules for virt-ssh-helper, but there are no denials nor libvirt log entries related to virt-ssh-helper. But we don't need such rules since it is spawned on the ssh login and not under libvirtd itself.
PS output of the hanging receiving virt-ssh-helper (looks not too unhappy): Source: 4 0 41305 1 20 0 1627796 23360 poll_s Ssl ? 0:05 /usr/sbin/libvirtd 0 0 41523 41305 20 0 9272 4984 poll_s S ? 0:02 \_ ssh -T -e none -- testkvm-hirsute-to sh -c 'virt-ssh-helper 'qemu:///system'' Target 4 0 213 1 20 0 13276 4132 poll_s Ss ? 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 250-500 startups 4 0 35148 213 20 0 19048 11320 poll_s Ss ? 0:02 \_ sshd: root@notty 4 0 35206 35148 20 0 2584 544 do_wai Ss ? 0:00 \_ sh -c virt-ssh-helper qemu:///system 0 0 35207 35206 20 0 81348 26684 - R ? 0:34 \_ virt-ssh-helper qemu:///system
I've looked at it with strace [1] and gdb for backtraces [2] - it is not dead or stuck and keeps working. Could it be just so slow that it appears to hang until it times out? Or is the event mechanism having issues and it wakes up too rarely?
Lets take migration out of the picture. What if you simply do virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native list does that work ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Nov 25, 2020 at 1:38 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Nov 25, 2020 at 01:28:09PM +0100, Christian Ehrhardt wrote:
On Wed, Nov 25, 2020 at 10:55 AM Christian Ehrhardt <christian.ehrhardt@canonical.com> wrote:
On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
Hi,
[...]
BTW to reduce the scope what to think about - I have rebuilt 6.8 as well it works. Thereby I can confirm that the offending change should be in between 6.8.0 -> 6.9.0.
I was able to get this working in git bisect builds from git between v6.8 / v6.9. I identified the following offending commit: 7d959c30 rpc: Fix virt-ssh-helper detection
Ok that makes a bit of sense, first we had in 6.8 f8ec7c84 rpc: use new virt-ssh-helper binary for remote tunnelling That makes it related to tunneling which matches our broken use-case.
The identified commit "7d959c30 rpc: Fix virt-ssh-helper detection" might finally really enable the new helper and that is then broken?
With that knowledge I was able to confirm that it really is the native mode
$ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=netcat <works> $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=native <hangs>
I recently discussed with Andrea if we'd need apparmor rules for virt-ssh-helper, but there are no denials nor libvirt log entries related to virt-ssh-helper. But we don't need such rules since it is spawned on the ssh login and not under libvirtd itself.
PS output of the hanging receiving virt-ssh-helper (looks not too unhappy): Source: 4 0 41305 1 20 0 1627796 23360 poll_s Ssl ? 0:05 /usr/sbin/libvirtd 0 0 41523 41305 20 0 9272 4984 poll_s S ? 0:02 \_ ssh -T -e none -- testkvm-hirsute-to sh -c 'virt-ssh-helper 'qemu:///system'' Target 4 0 213 1 20 0 13276 4132 poll_s Ss ? 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 250-500 startups 4 0 35148 213 20 0 19048 11320 poll_s Ss ? 0:02 \_ sshd: root@notty 4 0 35206 35148 20 0 2584 544 do_wai Ss ? 0:00 \_ sh -c virt-ssh-helper qemu:///system 0 0 35207 35206 20 0 81348 26684 - R ? 0:34 \_ virt-ssh-helper qemu:///system
I've looked at it with strace [1] and gdb for backtraces [2] - it is not dead or stuck and keeps working. Could it be just so slow that it appears to hang until it times out? Or is the event mechanism having issues and it wakes up too rarely?
Lets take migration out of the picture. What if you simply do
virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native list
does that work ?
Yes it does, no hang and proper results I migrated the system there (with ?proxy=netcat) to ensure there is something to report and indeed some data is coming back through this. root@testkvm-hirsute-from:~# virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=netcat root@testkvm-hirsute-from:~# virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native list Id Name State ----------------------------- 6 h-migr-test running
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
-- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd

On Wed, Nov 25, 2020 at 02:33:44PM +0100, Christian Ehrhardt wrote:
On Wed, Nov 25, 2020 at 1:38 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Nov 25, 2020 at 01:28:09PM +0100, Christian Ehrhardt wrote:
On Wed, Nov 25, 2020 at 10:55 AM Christian Ehrhardt <christian.ehrhardt@canonical.com> wrote:
On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote:
Hi,
[...]
BTW to reduce the scope what to think about - I have rebuilt 6.8 as well it works. Thereby I can confirm that the offending change should be in between 6.8.0 -> 6.9.0.
I was able to get this working in git bisect builds from git between v6.8 / v6.9. I identified the following offending commit: 7d959c30 rpc: Fix virt-ssh-helper detection
Ok that makes a bit of sense, first we had in 6.8 f8ec7c84 rpc: use new virt-ssh-helper binary for remote tunnelling That makes it related to tunneling which matches our broken use-case.
The identified commit "7d959c30 rpc: Fix virt-ssh-helper detection" might finally really enable the new helper and that is then broken?
With that knowledge I was able to confirm that it really is the native mode
$ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=netcat <works> $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=native <hangs>
I recently discussed with Andrea if we'd need apparmor rules for virt-ssh-helper, but there are no denials nor libvirt log entries related to virt-ssh-helper. But we don't need such rules since it is spawned on the ssh login and not under libvirtd itself.
PS output of the hanging receiving virt-ssh-helper (looks not too unhappy): Source: 4 0 41305 1 20 0 1627796 23360 poll_s Ssl ? 0:05 /usr/sbin/libvirtd 0 0 41523 41305 20 0 9272 4984 poll_s S ? 0:02 \_ ssh -T -e none -- testkvm-hirsute-to sh -c 'virt-ssh-helper 'qemu:///system'' Target 4 0 213 1 20 0 13276 4132 poll_s Ss ? 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 250-500 startups 4 0 35148 213 20 0 19048 11320 poll_s Ss ? 0:02 \_ sshd: root@notty 4 0 35206 35148 20 0 2584 544 do_wai Ss ? 0:00 \_ sh -c virt-ssh-helper qemu:///system 0 0 35207 35206 20 0 81348 26684 - R ? 0:34 \_ virt-ssh-helper qemu:///system
I've looked at it with strace [1] and gdb for backtraces [2] - it is not dead or stuck and keeps working. Could it be just so slow that it appears to hang until it times out? Or is the event mechanism having issues and it wakes up too rarely?
Lets take migration out of the picture. What if you simply do
virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native list
does that work ?
Yes it does, no hang and proper results
Ok, so that shows virt-ssh-helper is not completely broken at least. Makes me think there is possibly something related to streams code that causes the issue. You might try the virsh "console" or "vol-upload" commands to test the streams stuff in isolation. If that also works, then the problem is specific to migration, and we'll probably wnt to colllect debug level logs from src+dst hosts. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Nov 25, 2020 at 2:47 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Nov 25, 2020 at 02:33:44PM +0100, Christian Ehrhardt wrote:
On Wed, Nov 25, 2020 at 1:38 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Nov 25, 2020 at 01:28:09PM +0100, Christian Ehrhardt wrote:
On Wed, Nov 25, 2020 at 10:55 AM Christian Ehrhardt <christian.ehrhardt@canonical.com> wrote:
On Tue, Nov 24, 2020 at 4:30 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Nov 24, 2020 at 16:05:53 +0100, Christian Ehrhardt wrote: > Hi,
[...]
BTW to reduce the scope what to think about - I have rebuilt 6.8 as well it works. Thereby I can confirm that the offending change should be in between 6.8.0 -> 6.9.0.
I was able to get this working in git bisect builds from git between v6.8 / v6.9. I identified the following offending commit: 7d959c30 rpc: Fix virt-ssh-helper detection
Ok that makes a bit of sense, first we had in 6.8 f8ec7c84 rpc: use new virt-ssh-helper binary for remote tunnelling That makes it related to tunneling which matches our broken use-case.
The identified commit "7d959c30 rpc: Fix virt-ssh-helper detection" might finally really enable the new helper and that is then broken?
With that knowledge I was able to confirm that it really is the native mode
$ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=netcat <works> $ virsh migrate --unsafe --live --p2p --tunnelled h-migr-test qemu+ssh://testkvm-hirsute-to/system?proxy=native <hangs>
I recently discussed with Andrea if we'd need apparmor rules for virt-ssh-helper, but there are no denials nor libvirt log entries related to virt-ssh-helper. But we don't need such rules since it is spawned on the ssh login and not under libvirtd itself.
PS output of the hanging receiving virt-ssh-helper (looks not too unhappy): Source: 4 0 41305 1 20 0 1627796 23360 poll_s Ssl ? 0:05 /usr/sbin/libvirtd 0 0 41523 41305 20 0 9272 4984 poll_s S ? 0:02 \_ ssh -T -e none -- testkvm-hirsute-to sh -c 'virt-ssh-helper 'qemu:///system'' Target 4 0 213 1 20 0 13276 4132 poll_s Ss ? 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 250-500 startups 4 0 35148 213 20 0 19048 11320 poll_s Ss ? 0:02 \_ sshd: root@notty 4 0 35206 35148 20 0 2584 544 do_wai Ss ? 0:00 \_ sh -c virt-ssh-helper qemu:///system 0 0 35207 35206 20 0 81348 26684 - R ? 0:34 \_ virt-ssh-helper qemu:///system
I've looked at it with strace [1] and gdb for backtraces [2] - it is not dead or stuck and keeps working. Could it be just so slow that it appears to hang until it times out? Or is the event mechanism having issues and it wakes up too rarely?
Lets take migration out of the picture. What if you simply do
virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native list
does that work ?
Yes it does, no hang and proper results
Ok, so that shows virt-ssh-helper is not completely broken at least.
Makes me think there is possibly something related to streams code that causes the issue.
You might try the virsh "console" or "vol-upload" commands to test the streams stuff in isolation. If that also works, then the problem is specific to migration,
Thanks for the hint Daniel, it is indeed not migration specific - it seems that virs-ssh-helper is just very slow. rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42285 -rw-r--r-- 1 root root 24M Nov 25 15:20 testfile -rw-r--r-- 1 root root 220M Nov 25 15:20 testfile -rw-r--r-- 1 root root 396M Nov 25 15:20 testfile -rw-r--r-- 1 root root 558M Nov 25 15:20 testfile -rw-r--r-- 1 root root 756M Nov 25 15:20 testfile -rw-r--r-- 1 root root 868M Nov 25 15:20 testfile [1]+ Done virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42307 -rw-r--r-- 1 root root 1.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 6.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 9.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 13M Nov 25 15:21 testfile -rw-r--r-- 1 root root 15M Nov 25 15:21 testfile -rw-r--r-- 1 root root 16M Nov 25 15:21 testfile -rw-r--r-- 1 root root 18M Nov 25 15:21 testfile -rw-r--r-- 1 root root 19M Nov 25 15:22 testfile -rw-r--r-- 1 root root 21M Nov 25 15:22 testfile -rw-r--r-- 1 root root 22M Nov 25 15:22 testfile -rw-r--r-- 1 root root 23M Nov 25 15:22 testfile -rw-r--r-- 1 root root 25M Nov 25 15:22 testfile -rw-r--r-- 1 root root 26M Nov 25 15:22 testfile -rw-r--r-- 1 root root 27M Nov 25 15:22 testfile -rw-r--r-- 1 root root 28M Nov 25 15:22 testfile -rw-r--r-- 1 root root 29M Nov 25 15:22 testfile -rw-r--r-- 1 root root 30M Nov 25 15:22 testfile -rw-r--r-- 1 root root 31M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile That is ~150-200 MB/s vs 1-5 MB/s and as seen it seems to start slow AND degrades further. I'm not at 90MB overall and down to ~150 KB/s
and we'll probably wnt to colllect debug level logs from src+dst hosts.
I already had debug level logs of the migration [1] attached to the launchpad bug I use to take my notes on this. Taken with these configs: log_filters="1:qemu 1:libvirt 3:object 3:json 3:event 1:util" log_outputs="1:file:/var/log/libvirtd.log" You can fetch the logs (of a migration), but I'm happy to generate you logs of any other command (or with other log settings) as you'd prefer them. The network used in this case is a bridge between two containers, but we can cut out more components. I found that the same vol-download vs 127.0.0.1 gives the same results. That in turn makes it easier to gather results as we only need one system. Here are logs of: virsh -c qemu+ssh://127.0.0.1/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile => ~150-220MB/s [2] vs virsh -c qemu+ssh://127.0.0.1/system?proxy=native vol-download --pool uvtool h-migr-test.qcow testfile => 5 MB/s degrading to ~200 KB/s and less [3] [1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437... [2]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437... [3]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437...
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
-- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd

On Wed, Nov 25, 2020 at 04:49:14PM +0100, Christian Ehrhardt wrote:
Thanks for the hint Daniel, it is indeed not migration specific - it seems that virs-ssh-helper is just very slow.
rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42285 -rw-r--r-- 1 root root 24M Nov 25 15:20 testfile -rw-r--r-- 1 root root 220M Nov 25 15:20 testfile -rw-r--r-- 1 root root 396M Nov 25 15:20 testfile -rw-r--r-- 1 root root 558M Nov 25 15:20 testfile -rw-r--r-- 1 root root 756M Nov 25 15:20 testfile -rw-r--r-- 1 root root 868M Nov 25 15:20 testfile [1]+ Done virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile
rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42307 -rw-r--r-- 1 root root 1.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 6.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 9.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 13M Nov 25 15:21 testfile -rw-r--r-- 1 root root 15M Nov 25 15:21 testfile -rw-r--r-- 1 root root 16M Nov 25 15:21 testfile -rw-r--r-- 1 root root 18M Nov 25 15:21 testfile -rw-r--r-- 1 root root 19M Nov 25 15:22 testfile -rw-r--r-- 1 root root 21M Nov 25 15:22 testfile -rw-r--r-- 1 root root 22M Nov 25 15:22 testfile -rw-r--r-- 1 root root 23M Nov 25 15:22 testfile -rw-r--r-- 1 root root 25M Nov 25 15:22 testfile -rw-r--r-- 1 root root 26M Nov 25 15:22 testfile -rw-r--r-- 1 root root 27M Nov 25 15:22 testfile -rw-r--r-- 1 root root 28M Nov 25 15:22 testfile -rw-r--r-- 1 root root 29M Nov 25 15:22 testfile -rw-r--r-- 1 root root 30M Nov 25 15:22 testfile -rw-r--r-- 1 root root 31M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile
That is ~150-200 MB/s vs 1-5 MB/s and as seen it seems to start slow AND degrades further. I'm not at 90MB overall and down to ~150 KB/s
and we'll probably wnt to colllect debug level logs from src+dst hosts.
I already had debug level logs of the migration [1] attached to the launchpad bug I use to take my notes on this. Taken with these configs: log_filters="1:qemu 1:libvirt 3:object 3:json 3:event 1:util" log_outputs="1:file:/var/log/libvirtd.log"
You can fetch the logs (of a migration), but I'm happy to generate you logs of any other command (or with other log settings) as you'd prefer them.
The network used in this case is a bridge between two containers, but we can cut out more components. I found that the same vol-download vs 127.0.0.1 gives the same results. That in turn makes it easier to gather results as we only need one system.
Yep, that's useful, I'm able to reproduce this problem myself too now. Will do some local tests and report back...
Here are logs of: virsh -c qemu+ssh://127.0.0.1/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile => ~150-220MB/s [2] vs virsh -c qemu+ssh://127.0.0.1/system?proxy=native vol-download --pool uvtool h-migr-test.qcow testfile => 5 MB/s degrading to ~200 KB/s and less [3]
The degradation is particularly wierd as it doesn't seem like we're doing anything different as time goes by based on strace logs.
[1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437... [2]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437... [3]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1904584/+attachment/5437...
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Nov 25, 2020 at 04:36:39PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:49:14PM +0100, Christian Ehrhardt wrote:
Thanks for the hint Daniel, it is indeed not migration specific - it seems that virs-ssh-helper is just very slow.
rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42285 -rw-r--r-- 1 root root 24M Nov 25 15:20 testfile -rw-r--r-- 1 root root 220M Nov 25 15:20 testfile -rw-r--r-- 1 root root 396M Nov 25 15:20 testfile -rw-r--r-- 1 root root 558M Nov 25 15:20 testfile -rw-r--r-- 1 root root 756M Nov 25 15:20 testfile -rw-r--r-- 1 root root 868M Nov 25 15:20 testfile [1]+ Done virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile
rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42307 -rw-r--r-- 1 root root 1.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 6.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 9.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 13M Nov 25 15:21 testfile -rw-r--r-- 1 root root 15M Nov 25 15:21 testfile -rw-r--r-- 1 root root 16M Nov 25 15:21 testfile -rw-r--r-- 1 root root 18M Nov 25 15:21 testfile -rw-r--r-- 1 root root 19M Nov 25 15:22 testfile -rw-r--r-- 1 root root 21M Nov 25 15:22 testfile -rw-r--r-- 1 root root 22M Nov 25 15:22 testfile -rw-r--r-- 1 root root 23M Nov 25 15:22 testfile -rw-r--r-- 1 root root 25M Nov 25 15:22 testfile -rw-r--r-- 1 root root 26M Nov 25 15:22 testfile -rw-r--r-- 1 root root 27M Nov 25 15:22 testfile -rw-r--r-- 1 root root 28M Nov 25 15:22 testfile -rw-r--r-- 1 root root 29M Nov 25 15:22 testfile -rw-r--r-- 1 root root 30M Nov 25 15:22 testfile -rw-r--r-- 1 root root 31M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile
That is ~150-200 MB/s vs 1-5 MB/s and as seen it seems to start slow AND degrades further. I'm not at 90MB overall and down to ~150 KB/s
and we'll probably wnt to colllect debug level logs from src+dst hosts.
I already had debug level logs of the migration [1] attached to the launchpad bug I use to take my notes on this. Taken with these configs: log_filters="1:qemu 1:libvirt 3:object 3:json 3:event 1:util" log_outputs="1:file:/var/log/libvirtd.log"
You can fetch the logs (of a migration), but I'm happy to generate you logs of any other command (or with other log settings) as you'd prefer them.
The network used in this case is a bridge between two containers, but we can cut out more components. I found that the same vol-download vs 127.0.0.1 gives the same results. That in turn makes it easier to gather results as we only need one system.
Yep, that's useful, I'm able to reproduce this problem myself too now. Will do some local tests and report back...
Sigh, the problem is way too many reallocs, repeatedly growing and shrinking the buffer we use for I/O. I guess we never noticed this awfulness in the virsh console code it was copied from, as the data volumes are lower. Switching to a fixed size buffer makes it massively faster. I'll prep a patch asap. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Nov 25, 2020 at 05:00:20PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:36:39PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:49:14PM +0100, Christian Ehrhardt wrote:
Thanks for the hint Daniel, it is indeed not migration specific - it seems that virs-ssh-helper is just very slow.
rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42285 -rw-r--r-- 1 root root 24M Nov 25 15:20 testfile -rw-r--r-- 1 root root 220M Nov 25 15:20 testfile -rw-r--r-- 1 root root 396M Nov 25 15:20 testfile -rw-r--r-- 1 root root 558M Nov 25 15:20 testfile -rw-r--r-- 1 root root 756M Nov 25 15:20 testfile -rw-r--r-- 1 root root 868M Nov 25 15:20 testfile [1]+ Done virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=netcat vol-download --pool uvtool h-migr-test.qcow testfile
rm testfile; virsh -c qemu+ssh://testkvm-hirsute-to/system?proxy=native vol-download --pool uvtool h-migr-test.qcow testfile & for i in $(seq 1 20); do sleep 1s; ll -laFh testfile; done [1] 42307 -rw-r--r-- 1 root root 1.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 6.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 9.8M Nov 25 15:21 testfile -rw-r--r-- 1 root root 13M Nov 25 15:21 testfile -rw-r--r-- 1 root root 15M Nov 25 15:21 testfile -rw-r--r-- 1 root root 16M Nov 25 15:21 testfile -rw-r--r-- 1 root root 18M Nov 25 15:21 testfile -rw-r--r-- 1 root root 19M Nov 25 15:22 testfile -rw-r--r-- 1 root root 21M Nov 25 15:22 testfile -rw-r--r-- 1 root root 22M Nov 25 15:22 testfile -rw-r--r-- 1 root root 23M Nov 25 15:22 testfile -rw-r--r-- 1 root root 25M Nov 25 15:22 testfile -rw-r--r-- 1 root root 26M Nov 25 15:22 testfile -rw-r--r-- 1 root root 27M Nov 25 15:22 testfile -rw-r--r-- 1 root root 28M Nov 25 15:22 testfile -rw-r--r-- 1 root root 29M Nov 25 15:22 testfile -rw-r--r-- 1 root root 30M Nov 25 15:22 testfile -rw-r--r-- 1 root root 31M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile -rw-r--r-- 1 root root 32M Nov 25 15:22 testfile
That is ~150-200 MB/s vs 1-5 MB/s and as seen it seems to start slow AND degrades further. I'm not at 90MB overall and down to ~150 KB/s
and we'll probably wnt to colllect debug level logs from src+dst hosts.
I already had debug level logs of the migration [1] attached to the launchpad bug I use to take my notes on this. Taken with these configs: log_filters="1:qemu 1:libvirt 3:object 3:json 3:event 1:util" log_outputs="1:file:/var/log/libvirtd.log"
You can fetch the logs (of a migration), but I'm happy to generate you logs of any other command (or with other log settings) as you'd prefer them.
The network used in this case is a bridge between two containers, but we can cut out more components. I found that the same vol-download vs 127.0.0.1 gives the same results. That in turn makes it easier to gather results as we only need one system.
Yep, that's useful, I'm able to reproduce this problem myself too now. Will do some local tests and report back...
Sigh, the problem is way too many reallocs, repeatedly growing and shrinking the buffer we use for I/O.
I guess we never noticed this awfulness in the virsh console code it was copied from, as the data volumes are lower.
Switching to a fixed size buffer makes it massively faster. I'll prep a patch asap.
Actually, it is also important to have a bigger buffer. Using a 1 MB buffer makes all the difference in throughput. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Nov 25, 2020 at 05:20:41PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 05:00:20PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:36:39PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:49:14PM +0100, Christian Ehrhardt wrote:
I found that the same vol-download vs 127.0.0.1 gives the same results. That in turn makes it easier to gather results as we only need one system.
Yep, that's useful, I'm able to reproduce this problem myself too now. Will do some local tests and report back...
Sigh, the problem is way too many reallocs, repeatedly growing and shrinking the buffer we use for I/O.
I guess we never noticed this awfulness in the virsh console code it was copied from, as the data volumes are lower.
Switching to a fixed size buffer makes it massively faster. I'll prep a patch asap.
Actually, it is also important to have a bigger buffer. Using a 1 MB buffer makes all the difference in throughput.
I've sent two patches to improve the performance and managed to test vol-download with our ssh helper to beat netcat. I've not had a chance to test with tunnelled migration yet though, so if you want to try the patches with your migration test scenario, that'd be useful confirmation Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Nov 25, 2020 at 7:06 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Wed, Nov 25, 2020 at 05:20:41PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 05:00:20PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:36:39PM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 25, 2020 at 04:49:14PM +0100, Christian Ehrhardt wrote:
I found that the same vol-download vs 127.0.0.1 gives the same results. That in turn makes it easier to gather results as we only need one system.
Yep, that's useful, I'm able to reproduce this problem myself too now. Will do some local tests and report back...
Sigh, the problem is way too many reallocs, repeatedly growing and shrinking the buffer we use for I/O.
I guess we never noticed this awfulness in the virsh console code it was copied from, as the data volumes are lower.
Switching to a fixed size buffer makes it massively faster. I'll prep a patch asap.
Actually, it is also important to have a bigger buffer. Using a 1 MB buffer makes all the difference in throughput.
I've sent two patches to improve the performance and managed to test vol-download with our ssh helper to beat netcat. I've not had a chance to test with tunnelled migration yet though, so if you want to try the patches with your migration test scenario, that'd be useful confirmation
Thank you Daniel, I've done extensive testing (all good) and replied with details and a Tested-by to the patch submission.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
-- Christian Ehrhardt Staff Engineer, Ubuntu Server Canonical Ltd
participants (3)
-
Christian Ehrhardt
-
Daniel P. Berrangé
-
Peter Krempa