[libvirt] [PATCH 0/2] Fix hang during concurrent guest migrations

older
[libvirt] [RFC PATCH 0/3+] cleanup...

jdenemar＠redhat.com

12 May 2010 12 May '10

7:10 p.m.

From: Jiri Denemark <jdenemar@redhat.com> As described in https://bugzilla.redhat.com/show_bug.cgi?id=582278 libvirt may hang during concurrent P2P migration of several KVM guests. These two patches fix that for me. Hopefully the fix will be confirmed by the reporter. Jirka Jiri Denemark (2): Remove watches before calling REMOTE_PROC_CLOSE Fix monitor ref counting when adding event handle src/qemu/qemu_monitor.c | 11 ++++++++++- src/remote/remote_driver.c | 10 +++++----- 2 files changed, 15 insertions(+), 6 deletions(-)

Show replies by date

jdenemar＠redhat.com

12 May 12 May

7:10 p.m.

New subject: [libvirt] [PATCH 1/2] Remove watches before calling REMOTE_PROC_CLOSE

From: Jiri Denemark <jdenemar@redhat.com> First calling REMOTE_PROC_CLOSE and then removing watches might lead to a hang as HANGUP event can be triggered before the watches are actually removed but after virConnectPtr is already freed. As a result of that remoteDomainEventFired() would try to lock uninitialized mutex, which would hang for ever. --- src/remote/remote_driver.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index e4a68ad..990bfce 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -1418,11 +1418,6 @@ verify_certificate (virConnectPtr conn ATTRIBUTE_UNUSED, static int doRemoteClose (virConnectPtr conn, struct private_data *priv) { - if (call (conn, priv, 0, REMOTE_PROC_CLOSE, - (xdrproc_t) xdr_void, (char *) NULL, - (xdrproc_t) xdr_void, (char *) NULL) == -1) - return -1; - if (priv->eventFlushTimer >= 0) { /* Remove timeout */ virEventRemoveTimeout(priv->eventFlushTimer); @@ -1431,6 +1426,11 @@ doRemoteClose (virConnectPtr conn, struct private_data *priv) priv->watch = -1; } + if (call (conn, priv, 0, REMOTE_PROC_CLOSE, + (xdrproc_t) xdr_void, (char *) NULL, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + return -1; + /* Close socket. */ if (priv->uses_tls && priv->session) { gnutls_bye (priv->session, GNUTLS_SHUT_RDWR); -- 1.7.1

Daniel Veillard

8:40 p.m.

New subject: [libvirt] [PATCH 1/2] Remove watches before calling REMOTE_PROC_CLOSE

On Wed, May 12, 2010 at 12:10:18PM +0200, jdenemar@redhat.com wrote:

...

From: Jiri Denemark <jdenemar@redhat.com>

First calling REMOTE_PROC_CLOSE and then removing watches might lead to a hang as HANGUP event can be triggered before the watches are actually removed but after virConnectPtr is already freed. As a result of that remoteDomainEventFired() would try to lock uninitialized mutex, which would hang for ever. --- src/remote/remote_driver.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index e4a68ad..990bfce 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -1418,11 +1418,6 @@ verify_certificate (virConnectPtr conn ATTRIBUTE_UNUSED, static int doRemoteClose (virConnectPtr conn, struct private_data *priv) { - if (call (conn, priv, 0, REMOTE_PROC_CLOSE, - (xdrproc_t) xdr_void, (char *) NULL, - (xdrproc_t) xdr_void, (char *) NULL) == -1) - return -1; - if (priv->eventFlushTimer >= 0) { /* Remove timeout */ virEventRemoveTimeout(priv->eventFlushTimer); @@ -1431,6 +1426,11 @@ doRemoteClose (virConnectPtr conn, struct private_data *priv) priv->watch = -1; }

+ if (call (conn, priv, 0, REMOTE_PROC_CLOSE, + (xdrproc_t) xdr_void, (char *) NULL, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + return -1; + /* Close socket. */ if (priv->uses_tls && priv->session) { gnutls_bye (priv->session, GNUTLS_SHUT_RDWR);

Sounds reasonnable, ACK, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Chris Lalancette

10:23 p.m.

New subject: [libvirt] [PATCH 1/2] Remove watches before calling REMOTE_PROC_CLOSE

On 05/12/2010 06:10 AM, jdenemar@redhat.com wrote:

...

From: Jiri Denemark <jdenemar@redhat.com>

First calling REMOTE_PROC_CLOSE and then removing watches might lead to a hang as HANGUP event can be triggered before the watches are actually removed but after virConnectPtr is already freed. As a result of that remoteDomainEventFired() would try to lock uninitialized mutex, which would hang for ever. --- src/remote/remote_driver.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index e4a68ad..990bfce 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -1418,11 +1418,6 @@ verify_certificate (virConnectPtr conn ATTRIBUTE_UNUSED, static int doRemoteClose (virConnectPtr conn, struct private_data *priv) { - if (call (conn, priv, 0, REMOTE_PROC_CLOSE, - (xdrproc_t) xdr_void, (char *) NULL, - (xdrproc_t) xdr_void, (char *) NULL) == -1) - return -1; - if (priv->eventFlushTimer >= 0) { /* Remove timeout */ virEventRemoveTimeout(priv->eventFlushTimer); @@ -1431,6 +1426,11 @@ doRemoteClose (virConnectPtr conn, struct private_data *priv) priv->watch = -1; }

+ if (call (conn, priv, 0, REMOTE_PROC_CLOSE, + (xdrproc_t) xdr_void, (char *) NULL, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + return -1; + /* Close socket. */ if (priv->uses_tls && priv->session) { gnutls_bye (priv->session, GNUTLS_SHUT_RDWR);

Ah, nasty. Good catch. ACK -- Chris Lalancette

Daniel P. Berrange

10:38 p.m.

New subject: [libvirt] [PATCH 1/2] Remove watches before calling REMOTE_PROC_CLOSE

On Wed, May 12, 2010 at 12:10:18PM +0200, jdenemar@redhat.com wrote:

...

From: Jiri Denemark <jdenemar@redhat.com>

First calling REMOTE_PROC_CLOSE and then removing watches might lead to a hang as HANGUP event can be triggered before the watches are actually removed but after virConnectPtr is already freed. As a result of that remoteDomainEventFired() would try to lock uninitialized mutex, which would hang for ever. --- src/remote/remote_driver.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index e4a68ad..990bfce 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -1418,11 +1418,6 @@ verify_certificate (virConnectPtr conn ATTRIBUTE_UNUSED, static int doRemoteClose (virConnectPtr conn, struct private_data *priv) { - if (call (conn, priv, 0, REMOTE_PROC_CLOSE, - (xdrproc_t) xdr_void, (char *) NULL, - (xdrproc_t) xdr_void, (char *) NULL) == -1) - return -1; - if (priv->eventFlushTimer >= 0) { /* Remove timeout */ virEventRemoveTimeout(priv->eventFlushTimer); @@ -1431,6 +1426,11 @@ doRemoteClose (virConnectPtr conn, struct private_data *priv) priv->watch = -1; }

+ if (call (conn, priv, 0, REMOTE_PROC_CLOSE, + (xdrproc_t) xdr_void, (char *) NULL, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + return -1; + /* Close socket. */ if (priv->uses_tls && priv->session) { gnutls_bye (priv->session, GNUTLS_SHUT_RDWR);

ACK, this will prevent us seeing the HANGUP condition. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

jdenemar＠redhat.com

7:10 p.m.

New subject: [libvirt] [PATCH 2/2] Fix monitor ref counting when adding event handle

From: Jiri Denemark <jdenemar@redhat.com> When closing a monitor using qemuMonitorClose(), we are aware of the possibility the monitor is still being used somewhere: /* NB: ordinarily one might immediately set mon->watch to -1 * and mon->fd to -1, but there may be a callback active * that is still relying on these fields being valid. So * we merely close them, but not clear their values and * use this explicit 'closed' flag to track this state */ but since we call virEventAddHandle() on that monitor without increasing its ref counter, the monitor is still freed which makes possible users of it quite unhappy. The unhappiness can lead to a hang if qemuMonitorIO tries to lock mutex which no longer exists. --- src/qemu/qemu_monitor.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-) diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index abf1338..7517e39 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -223,6 +223,14 @@ int qemuMonitorUnref(qemuMonitorPtr mon) return mon->refs; } +static void +qemuMonitorUnwatch(void *monitor) +{ + qemuMonitorPtr mon = monitor; + + qemuMonitorLock(mon); + qemuMonitorUnref(mon); +} static int qemuMonitorOpenUnix(const char *monitor) @@ -648,11 +656,12 @@ qemuMonitorOpen(virDomainObjPtr vm, VIR_EVENT_HANDLE_ERROR | VIR_EVENT_HANDLE_READABLE, qemuMonitorIO, - mon, NULL)) < 0) { + mon, qemuMonitorUnwatch)) < 0) { qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("unable to register monitor events")); goto cleanup; } + qemuMonitorRef(mon); VIR_DEBUG("New mon %p fd =%d watch=%d", mon, mon->fd, mon->watch); qemuMonitorUnlock(mon); -- 1.7.1

Daniel Veillard

8:45 p.m.

New subject: [libvirt] [PATCH 2/2] Fix monitor ref counting when adding event handle

On Wed, May 12, 2010 at 12:10:19PM +0200, jdenemar@redhat.com wrote:

...

From: Jiri Denemark <jdenemar@redhat.com>

When closing a monitor using qemuMonitorClose(), we are aware of the possibility the monitor is still being used somewhere:

/* NB: ordinarily one might immediately set mon->watch to -1 * and mon->fd to -1, but there may be a callback active * that is still relying on these fields being valid. So * we merely close them, but not clear their values and * use this explicit 'closed' flag to track this state */

but since we call virEventAddHandle() on that monitor without increasing its ref counter, the monitor is still freed which makes possible users of it quite unhappy. The unhappiness can lead to a hang if qemuMonitorIO tries to lock mutex which no longer exists. --- src/qemu/qemu_monitor.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index abf1338..7517e39 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -223,6 +223,14 @@ int qemuMonitorUnref(qemuMonitorPtr mon) return mon->refs; }

+static void +qemuMonitorUnwatch(void *monitor) +{ + qemuMonitorPtr mon = monitor; + + qemuMonitorLock(mon); + qemuMonitorUnref(mon); +}

static int qemuMonitorOpenUnix(const char *monitor) @@ -648,11 +656,12 @@ qemuMonitorOpen(virDomainObjPtr vm, VIR_EVENT_HANDLE_ERROR | VIR_EVENT_HANDLE_READABLE, qemuMonitorIO, - mon, NULL)) < 0) { + mon, qemuMonitorUnwatch)) < 0) { qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("unable to register monitor events")); goto cleanup; } + qemuMonitorRef(mon);

VIR_DEBUG("New mon %p fd =%d watch=%d", mon, mon->fd, mon->watch); qemuMonitorUnlock(mon);

I was wondering if we should instead qemuMonitorRef() before calling virEventAddHandle(), but as long as we are in the function there is no risk I think so that's fine, ACK, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Jiri Denemark

10:45 p.m.

New subject: [libvirt] [PATCH 2/2] Fix monitor ref counting when adding event handle

...

...
@@ -648,11 +656,12 @@ qemuMonitorOpen(virDomainObjPtr vm, VIR_EVENT_HANDLE_ERROR | VIR_EVENT_HANDLE_READABLE, qemuMonitorIO, - mon, NULL)) < 0) { + mon, qemuMonitorUnwatch)) < 0) { qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("unable to register monitor events")); goto cleanup; } + qemuMonitorRef(mon);

VIR_DEBUG("New mon %p fd =%d watch=%d", mon, mon->fd, mon->watch); qemuMonitorUnlock(mon);

I was wondering if we should instead qemuMonitorRef() before calling virEventAddHandle(), but as long as we are in the function there is no risk I think so that's fine,

Yeah, the monitor is locked all the time so it doesn't really matter. And if called qemuMonitorRef() before virEventAddHandle(), we would have to unref the monitor in case of failure. That is, this version is one line shorter :-) Jirka

Chris Lalancette

10:27 p.m.

New subject: [libvirt] [PATCH 2/2] Fix monitor ref counting when adding event handle

On 05/12/2010 06:10 AM, jdenemar@redhat.com wrote:

...

From: Jiri Denemark <jdenemar@redhat.com>

When closing a monitor using qemuMonitorClose(), we are aware of the possibility the monitor is still being used somewhere:

/* NB: ordinarily one might immediately set mon->watch to -1 * and mon->fd to -1, but there may be a callback active * that is still relying on these fields being valid. So * we merely close them, but not clear their values and * use this explicit 'closed' flag to track this state */

but since we call virEventAddHandle() on that monitor without increasing its ref counter, the monitor is still freed which makes possible users of it quite unhappy. The unhappiness can lead to a hang if qemuMonitorIO tries to lock mutex which no longer exists. --- src/qemu/qemu_monitor.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index abf1338..7517e39 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -223,6 +223,14 @@ int qemuMonitorUnref(qemuMonitorPtr mon) return mon->refs; }

+static void +qemuMonitorUnwatch(void *monitor) +{ + qemuMonitorPtr mon = monitor; + + qemuMonitorLock(mon); + qemuMonitorUnref(mon); +}

static int qemuMonitorOpenUnix(const char *monitor) @@ -648,11 +656,12 @@ qemuMonitorOpen(virDomainObjPtr vm, VIR_EVENT_HANDLE_ERROR | VIR_EVENT_HANDLE_READABLE, qemuMonitorIO, - mon, NULL)) < 0) { + mon, qemuMonitorUnwatch)) < 0) { qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("unable to register monitor events")); goto cleanup; } + qemuMonitorRef(mon);

VIR_DEBUG("New mon %p fd =%d watch=%d", mon, mon->fd, mon->watch); qemuMonitorUnlock(mon);

Also looks reasonable. ACK -- Chris Lalancette

Daniel P. Berrange

10:39 p.m.

New subject: [libvirt] [PATCH 2/2] Fix monitor ref counting when adding event handle

On Wed, May 12, 2010 at 12:10:19PM +0200, jdenemar@redhat.com wrote:

...

From: Jiri Denemark <jdenemar@redhat.com>

When closing a monitor using qemuMonitorClose(), we are aware of the possibility the monitor is still being used somewhere:

/* NB: ordinarily one might immediately set mon->watch to -1 * and mon->fd to -1, but there may be a callback active * that is still relying on these fields being valid. So * we merely close them, but not clear their values and * use this explicit 'closed' flag to track this state */

but since we call virEventAddHandle() on that monitor without increasing its ref counter, the monitor is still freed which makes possible users of it quite unhappy. The unhappiness can lead to a hang if qemuMonitorIO tries to lock mutex which no longer exists. --- src/qemu/qemu_monitor.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index abf1338..7517e39 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -223,6 +223,14 @@ int qemuMonitorUnref(qemuMonitorPtr mon) return mon->refs; }

+static void +qemuMonitorUnwatch(void *monitor) +{ + qemuMonitorPtr mon = monitor; + + qemuMonitorLock(mon); + qemuMonitorUnref(mon); +}

static int qemuMonitorOpenUnix(const char *monitor) @@ -648,11 +656,12 @@ qemuMonitorOpen(virDomainObjPtr vm, VIR_EVENT_HANDLE_ERROR | VIR_EVENT_HANDLE_READABLE, qemuMonitorIO, - mon, NULL)) < 0) { + mon, qemuMonitorUnwatch)) < 0) { qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("unable to register monitor events")); goto cleanup; } + qemuMonitorRef(mon);

VIR_DEBUG("New mon %p fd =%d watch=%d", mon, mon->fd, mon->watch); qemuMonitorUnlock(mon);

ACK Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|