[libvirt] [PATCH] virsh: use keepalive protocol in virsh for migration

Bugzilla:https://bugzilla.redhat.com/show_bug.cgi?id=822839 For non-p2p migration, if the network is down in the process of migrate, The virsh client will hang up for a fair long time. The patch will add keepalive into virsh to determine the status of network connection with remote libvirtd, aboring migration job after 30 seconds later since disconnection. --- tools/virsh.c | 29 +++++++++++++++++++++++------ 1 files changed, 23 insertions(+), 6 deletions(-) diff --git a/tools/virsh.c b/tools/virsh.c index 5226bd8..9099328 100644 --- a/tools/virsh.c +++ b/tools/virsh.c @@ -414,13 +414,14 @@ typedef struct __vshCtrlData { vshControl *ctl; const vshCmd *cmd; int writefd; + virConnectPtr dconn; } vshCtrlData; typedef void (*jobWatchTimeoutFunc) (vshControl *ctl, virDomainPtr dom, void *opaque); static bool -vshWatchJob(vshControl *ctl, +vshWatchJob(vshCtrlData *data, virDomainPtr dom, bool verbose, int pipe_fd, @@ -3277,6 +3278,7 @@ cmdSave(vshControl *ctl, const vshCmd *cmd) data.ctl = ctl; data.cmd = cmd; data.writefd = p[1]; + data.dconn = NULL; if (virThreadCreate(&workerThread, true, @@ -3284,7 +3286,7 @@ cmdSave(vshControl *ctl, const vshCmd *cmd) &data) < 0) goto cleanup; - ret = vshWatchJob(ctl, dom, verbose, p[0], 0, NULL, NULL, _("Save")); + ret = vshWatchJob(&data, dom, verbose, p[0], 0, NULL, NULL, _("Save")); virThreadJoin(&workerThread); @@ -3584,6 +3586,7 @@ cmdManagedSave(vshControl *ctl, const vshCmd *cmd) data.ctl = ctl; data.cmd = cmd; data.writefd = p[1]; + data.dconn = NULL; if (virThreadCreate(&workerThread, true, @@ -3591,7 +3594,7 @@ cmdManagedSave(vshControl *ctl, const vshCmd *cmd) &data) < 0) goto cleanup; - ret = vshWatchJob(ctl, dom, verbose, p[0], 0, + ret = vshWatchJob(&data, dom, verbose, p[0], 0, NULL, NULL, _("Managedsave")); virThreadJoin(&workerThread); @@ -4062,6 +4065,7 @@ cmdDump(vshControl *ctl, const vshCmd *cmd) data.ctl = ctl; data.cmd = cmd; data.writefd = p[1]; + data.dconn = NULL; if (virThreadCreate(&workerThread, true, @@ -4069,7 +4073,7 @@ cmdDump(vshControl *ctl, const vshCmd *cmd) &data) < 0) goto cleanup; - ret = vshWatchJob(ctl, dom, verbose, p[0], 0, NULL, NULL, _("Dump")); + ret = vshWatchJob(&data, dom, verbose, p[0], 0, NULL, NULL, _("Dump")); virThreadJoin(&workerThread); @@ -7189,6 +7193,10 @@ doMigrate (void *opaque) dconn = virConnectOpenAuth (desturi, virConnectAuthPtrDefault, 0); if (!dconn) goto out; + data->dconn = dconn; + if (virConnectSetKeepAlive(dconn, 5, 5) < 0) + vshDebug(ctl, VSH_ERR_WARNING, "migrate: Failed to start keepalive\n"); + ddom = virDomainMigrate2(dom, dconn, xml, flags, dname, migrateuri, 0); if (ddom) { virDomainFree(ddom); @@ -7244,7 +7252,7 @@ vshMigrationTimeout(vshControl *ctl, } static bool -vshWatchJob(vshControl *ctl, +vshWatchJob(vshCtrlData *data, virDomainPtr dom, bool verbose, int pipe_fd, @@ -7262,6 +7270,7 @@ vshWatchJob(vshControl *ctl, char retchar; bool functionReturn = false; sigset_t sigmask, oldsigmask; + vshControl *ctl = data->ctl; sigemptyset(&sigmask); sigaddset(&sigmask, SIGINT); @@ -7305,6 +7314,13 @@ repoll: goto cleanup; } + if (data->dconn && virConnectIsAlive(data->dconn) <= 0) { + virDomainAbortJob(dom); + vshError(ctl, "%s", + _("Lost connection to destination host")); + goto cleanup; + } + GETTIMEOFDAY(&curr); if (timeout && (((int)(curr.tv_sec - start.tv_sec) * 1000 + (int)(curr.tv_usec - start.tv_usec) / 1000) > @@ -7378,13 +7394,14 @@ cmdMigrate(vshControl *ctl, const vshCmd *cmd) data.ctl = ctl; data.cmd = cmd; data.writefd = p[1]; + data.dconn = NULL; if (virThreadCreate(&workerThread, true, doMigrate, &data) < 0) goto cleanup; - functionReturn = vshWatchJob(ctl, dom, verbose, p[0], timeout, + functionReturn = vshWatchJob(&data, dom, verbose, p[0], timeout, vshMigrationTimeout, NULL, _("Migration")); virThreadJoin(&workerThread); -- 1.7.7.5

On Mon, Jun 04, 2012 at 21:51:02 +0800, Guannan Ren wrote:
Bugzilla:https://bugzilla.redhat.com/show_bug.cgi?id=822839 For non-p2p migration, if the network is down in the process of migrate, The virsh client will hang up for a fair long time. The patch will add keepalive into virsh to determine the status of network connection with remote libvirtd, aboring migration job after 30 seconds later since disconnection.
This turns on keepalive only for the connection to destination daemon during migrations. I think we want to use keepalive for every connection open by virsh, i.e., even for the main one. And even more importantly, we don't want to hardcode keepalive parameters, users should be able to set them using virsh command line options. Jirka

On 06/04/2012 10:13 PM, Jiri Denemark wrote:
On Mon, Jun 04, 2012 at 21:51:02 +0800, Guannan Ren wrote:
Bugzilla:https://bugzilla.redhat.com/show_bug.cgi?id=822839 For non-p2p migration, if the network is down in the process of migrate, The virsh client will hang up for a fair long time. The patch will add keepalive into virsh to determine the status of network connection with remote libvirtd, aboring migration job after 30 seconds later since disconnection. This turns on keepalive only for the connection to destination daemon during migrations. I think we want to use keepalive for every connection open by virsh, i.e., even for the main one. And even more importantly, we don't want to hardcode keepalive parameters, users should be able to set them using virsh command line options.
Jirka
I consider the keepalive parameters as one of the options to migrate command. There are two reasons I choose hardcode value. one is that the newly-added option works only for non-p2p migrate. two is that based on my testing, if the network disconnection time exceeds 30 seconds. the migration will fail even though we recover the network service. I think 30 seconds is enough amount of time. I could add command line options if we need it. the above is only my thoughts. Guannan Ren

On Mon, Jun 04, 2012 at 22:41:00 +0800, Guannan Ren wrote:
On 06/04/2012 10:13 PM, Jiri Denemark wrote:
On Mon, Jun 04, 2012 at 21:51:02 +0800, Guannan Ren wrote:
Bugzilla:https://bugzilla.redhat.com/show_bug.cgi?id=822839 For non-p2p migration, if the network is down in the process of migrate, The virsh client will hang up for a fair long time. The patch will add keepalive into virsh to determine the status of network connection with remote libvirtd, aboring migration job after 30 seconds later since disconnection. This turns on keepalive only for the connection to destination daemon during migrations. I think we want to use keepalive for every connection open by virsh, i.e., even for the main one. And even more importantly, we don't want to hardcode keepalive parameters, users should be able to set them using virsh command line options.
Jirka
I consider the keepalive parameters as one of the options to migrate command. There are two reasons I choose hardcode value. one is that the newly-added option works only for non-p2p migrate.
I think they should be general virsh options similar to -c and they should be applied to any connection virsh makes. Applying them only to migration doesn't make much sense. Any libvirt daemon can hang or lose network connection, it's not in any way to daemons that serve as destination during non-p2p migrations.
two is that based on my testing, if the network disconnection time exceeds 30 seconds. the migration will fail even though we recover the network service. I think 30 seconds is enough amount of time.
Yes, it can be a good default value. Did you try increasing the keepalive timeout on destination deamon (/etc/libvirt/libvirtd.conf) too? Anyway, as with any timeout, no matter what value we choose someone will need to increase it :-) Jirka
participants (2)
-
Guannan Ren
-
Jiri Denemark