[libvirt] [RESEND] libvirt: xen: Shutdown detection fixes (RH-BZ#746007)

Resending this as there was no feedback overall and at least the first change fixes the following bug: https://bugzilla.redhat.com/show_bug.cgi?id=746007 --- The issue is easy to observe when using virt-manager as a frontend. Its main window seems to open connections for the up/down main display once and then only repeats DomainGetInfo on the same connection to check whether the instance is up or shut down. The initial open on the connection sets the domain id to -1 for inactive managed domains. And once running, the hypervisor driver returns the status. But after shutdown, the domain cannot be found by the hypervisor driver and the domain info is obtained by the xend driver. However, that expects the domain id to be -1 for inactive domains. This is not true (anymore). In testing I found that domain/status seems to be always present and better suited to check whether a domain is running or not (0 for shutdown and 2 for running, there are probably other values, too). Beside the change, which does fix this problem, the question is whether there should be a mechanism that resets domain->id to -1 when the instance is found to be shut down. And if that should happen, should it be done by any sub-driver or by the unified driver. The second problem I observed when I shut down an instance from the vnc console provided by virt-manager. As soon as the instance was down, the log produced internal errors about every second. Found this again, to be related to the unified driver approach. The call that causes the many error messages is GetVcpus. While the instance is up, the hypervisor driver returns those. But after shutdown, the domain cannot be found by that driver. The xend driver works, though. So what happens is that the hypervisor driver logs an internal error but the call succeeds because the xend driver returns successful. This generally opens the question whether a sub-driver should log any errors (at least not on error level, maybe debug)? The second patch only quiets down the one message I saw repeated that often. There are some other errors logged without any obvious problems interacting with the instances, but those are by far less often. -Stefan Please cc: me on replies as I am (yeah still) not subscribed to this m-l. Thanks.

xend_internal: Use domain/status for shutdown check On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that. One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running. BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626 Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) { + /* As far as I can see the domain->id is a bad sign for checking + * inactive domains as this is inaccurate after the domain has + * been running once. However domain/status from xend seems to + * be always present and 0 for inactive domains. */ state = VIR_DOMAIN_SHUTOFF; }

On 04/02/2012 10:38 AM, Stefan Bader wrote:
xend_internal: Use domain/status for shutdown check
On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that.
One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) {
Maybe this should be (domain->id < 0 || sexpr_int(root, ... Just to be sure we preserve behavior for older xen. Also, please send patches with [PATCH] prefix. Thanks, Cole
+ /* As far as I can see the domain->id is a bad sign for checking + * inactive domains as this is inaccurate after the domain has + * been running once. However domain/status from xend seems to + * be always present and 0 for inactive domains. */ state = VIR_DOMAIN_SHUTOFF; }

On 04/08/2012 03:08 PM, Cole Robinson wrote:
On 04/02/2012 10:38 AM, Stefan Bader wrote:
xend_internal: Use domain/status for shutdown check
On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that.
One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) {
Maybe this should be
(domain->id < 0 || sexpr_int(root, ...
It would not matter. Since the status is zero for all non-running domains it covers those with domain->id < 0 as well.
Just to be sure we preserve behavior for older xen.
Also, please send patches with [PATCH] prefix.
Will re-send this but first would like to resolve the comment you had. -Stefan
Thanks, Cole
+ /* As far as I can see the domain->id is a bad sign for checking + * inactive domains as this is inaccurate after the domain has + * been running once. However domain/status from xend seems to + * be always present and 0 for inactive domains. */ state = VIR_DOMAIN_SHUTOFF; }

On 04/10/2012 04:46 AM, Stefan Bader wrote:
On 04/08/2012 03:08 PM, Cole Robinson wrote:
On 04/02/2012 10:38 AM, Stefan Bader wrote:
xend_internal: Use domain/status for shutdown check
On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that.
One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) {
Maybe this should be
(domain->id < 0 || sexpr_int(root, ...
It would not matter. Since the status is zero for all non-running domains it covers those with domain->id < 0 as well.
Even for RHEL5 vintage xen? Since we historically try to maintain compatibility with that. It may well work, but unless it's tested I don't think there's much harm in keeping the id < 0 check to preserve old behavior. Thanks, Cole

On 10.04.2012 15:22, Cole Robinson wrote:
On 04/10/2012 04:46 AM, Stefan Bader wrote:
On 04/08/2012 03:08 PM, Cole Robinson wrote:
On 04/02/2012 10:38 AM, Stefan Bader wrote:
xend_internal: Use domain/status for shutdown check
On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that.
One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) {
Maybe this should be
(domain->id < 0 || sexpr_int(root, ...
It would not matter. Since the status is zero for all non-running domains it covers those with domain->id < 0 as well.
Even for RHEL5 vintage xen? Since we historically try to maintain compatibility with that. It may well work, but unless it's tested I don't think there's much harm in keeping the id < 0 check to preserve old behavior.
Thanks, Cole
I checked against CentOS5.5 (close enough). But right, it should not harm to have it. I re-submit the patch as soon as I have recovered my failed attempt to recover a raid failure... :/ -Stefan

On 04/10/2012 09:32 AM, Stefan Bader wrote:
On 10.04.2012 15:22, Cole Robinson wrote:
On 04/10/2012 04:46 AM, Stefan Bader wrote:
On 04/08/2012 03:08 PM, Cole Robinson wrote:
On 04/02/2012 10:38 AM, Stefan Bader wrote:
xend_internal: Use domain/status for shutdown check
On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that.
One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) {
Maybe this should be
(domain->id < 0 || sexpr_int(root, ...
It would not matter. Since the status is zero for all non-running domains it covers those with domain->id < 0 as well.
Even for RHEL5 vintage xen? Since we historically try to maintain compatibility with that. It may well work, but unless it's tested I don't think there's much harm in keeping the id < 0 check to preserve old behavior.
Thanks, Cole
I checked against CentOS5.5 (close enough). But right, it should not harm to have it. I re-submit the patch as soon as I have recovered my failed attempt to recover a raid failure... :/
If you checked against a centos5 host, I'm fine with that. So ACK to this patch. - Cole

On 10.04.2012 15:33, Cole Robinson wrote:
On 04/10/2012 09:32 AM, Stefan Bader wrote:
On 10.04.2012 15:22, Cole Robinson wrote:
On 04/10/2012 04:46 AM, Stefan Bader wrote:
On 04/08/2012 03:08 PM, Cole Robinson wrote:
On 04/02/2012 10:38 AM, Stefan Bader wrote:
xend_internal: Use domain/status for shutdown check
On newer xend (v3.x and after) there is no state and domid reported for inactive domains. When initially creating connections this is handled in various places by assigning domain->id = -1. But once an instance has been running, the id is set to the current domain id. And it does not change when the instance is shut down. So when querying the domain info, the hypervisor driver, which gets asked first will indicate it cannot find information, then the xend driver is asked and will set the status to NOSTATE because it checks for the -1 domain id. Checking domain/status for 0 seems to be more reliable for that.
One note: I am not sure whether the domain->id also should get set back to -1 whenever any sub-driver thinks the instance is no longer running.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=746007 BugLink: http://bugs.launchpad.net/bugs/929626
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xend_internal.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xend_internal.c 2011-12-04 08:15:00.000000000 +0100 +++ libvirt-0.9.8/src/xen/xend_internal.c 2012-03-23 11:07:43.575529377 +0100 @@ -989,9 +989,11 @@ state = VIR_DOMAIN_BLOCKED; else if (strchr(flags, 'r')) state = VIR_DOMAIN_RUNNING; - } else if (domain->id < 0) { - /* Inactive domains don't have a state reported, so - mark them SHUTOFF, rather than NOSTATE */ + } else if (sexpr_int(root, "domain/status") == 0) {
Maybe this should be
(domain->id < 0 || sexpr_int(root, ...
It would not matter. Since the status is zero for all non-running domains it covers those with domain->id < 0 as well.
Even for RHEL5 vintage xen? Since we historically try to maintain compatibility with that. It may well work, but unless it's tested I don't think there's much harm in keeping the id < 0 check to preserve old behavior.
Thanks, Cole
I checked against CentOS5.5 (close enough). But right, it should not harm to have it. I re-submit the patch as soon as I have recovered my failed attempt to recover a raid failure... :/
If you checked against a centos5 host, I'm fine with that. So ACK to this patch.
- Cole
Thinking about it I remember that my installation uses a Xen version from gitco (3.4) while the original one was 3.0 iirc. So better have that safety check in, just to be extra safe. -Stefan

xen-hypervisor: GetVcpus may not find a certain domain Observed on connections that have been running and then shut down. The hypervisor subdriver fills the log with internal errors while the xend driver actually can handle the query. This only handles the case which I observed after shutting down an instance via virt-manager and leaving its console window open. The same reasoning would probably be true for other internal errors as long as they potentially get recovered by other sub-drivers. BugLink: http://bugs.launchpad.net/bugs/963006 Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Index: libvirt-0.9.8/src/xen/xen_driver.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xen_driver.c 2011-12-02 04:59:50.000000000 +0100 +++ libvirt-0.9.8/src/xen/xen_driver.c 2012-03-23 11:31:53.982620050 +0100 @@ -1190,6 +1190,8 @@ if (ret > 0) return ret; } + + xenUnifiedError(VIR_ERR_NO_SUPPORT, __FUNCTION__); return -1; } Index: libvirt-0.9.8/src/xen/xen_hypervisor.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xen_hypervisor.c 2012-03-23 10:58:29.000000000 +0100 +++ libvirt-0.9.8/src/xen/xen_hypervisor.c 2012-03-23 11:26:50.741137585 +0100 @@ -3612,8 +3612,11 @@ &dominfo); if ((ret < 0) || (XEN_GETDOMAININFO_DOMAIN(dominfo) != domain->id)) { - virXenErrorFunc(VIR_ERR_INTERNAL_ERROR, __FUNCTION__, - _("cannot get domain details"), 0); + /* This can happen if an instance is just shut down. It is probably + * better to leave the shouting to the unified caller. + * virXenErrorFunc(VIR_ERR_INTERNAL_ERROR, __FUNCTION__, + * _("cannot get domain details"), 0); + */ return (-1); } nbinfo = XEN_GETDOMAININFO_CPUCOUNT(dominfo) + 1;

On 04/02/2012 10:38 AM, Stefan Bader wrote:
xen-hypervisor: GetVcpus may not find a certain domain
Observed on connections that have been running and then shut down. The hypervisor subdriver fills the log with internal errors while the xend driver actually can handle the query.
This only handles the case which I observed after shutting down an instance via virt-manager and leaving its console window open. The same reasoning would probably be true for other internal errors as long as they potentially get recovered by other sub-drivers.
BugLink: http://bugs.launchpad.net/bugs/963006
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xen_driver.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xen_driver.c 2011-12-02 04:59:50.000000000 +0100 +++ libvirt-0.9.8/src/xen/xen_driver.c 2012-03-23 11:31:53.982620050 +0100 @@ -1190,6 +1190,8 @@ if (ret > 0) return ret; } + + xenUnifiedError(VIR_ERR_NO_SUPPORT, __FUNCTION__); return -1; }
I don't think this piece is correct, since this will overwrite any error raised by sub drivers. It should just be dropped.
Index: libvirt-0.9.8/src/xen/xen_hypervisor.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xen_hypervisor.c 2012-03-23 10:58:29.000000000 +0100 +++ libvirt-0.9.8/src/xen/xen_hypervisor.c 2012-03-23 11:26:50.741137585 +0100 @@ -3612,8 +3612,11 @@ &dominfo);
if ((ret < 0) || (XEN_GETDOMAININFO_DOMAIN(dominfo) != domain->id)) { - virXenErrorFunc(VIR_ERR_INTERNAL_ERROR, __FUNCTION__, - _("cannot get domain details"), 0); + /* This can happen if an instance is just shut down. It is probably + * better to leave the shouting to the unified caller. + * virXenErrorFunc(VIR_ERR_INTERNAL_ERROR, __FUNCTION__, + * _("cannot get domain details"), 0); + */ return (-1); }
This is kind of a hack. The whole xen driver arch is pretty weird though so error handling is just strange as it is. But isn't this all a side effect of the bug that your previous patch fixes? virt-manager won't call GetVCPUs on an inactive VM, it's only doing so in this case because libvirt and xen get out of sync. So I think fixing the previous bug should stop the error log spew. Thanks, Cole

On 04/08/2012 03:19 PM, Cole Robinson wrote:
On 04/02/2012 10:38 AM, Stefan Bader wrote:
xen-hypervisor: GetVcpus may not find a certain domain
Observed on connections that have been running and then shut down. The hypervisor subdriver fills the log with internal errors while the xend driver actually can handle the query.
This only handles the case which I observed after shutting down an instance via virt-manager and leaving its console window open. The same reasoning would probably be true for other internal errors as long as they potentially get recovered by other sub-drivers.
BugLink: http://bugs.launchpad.net/bugs/963006
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Index: libvirt-0.9.8/src/xen/xen_driver.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xen_driver.c 2011-12-02 04:59:50.000000000 +0100 +++ libvirt-0.9.8/src/xen/xen_driver.c 2012-03-23 11:31:53.982620050 +0100 @@ -1190,6 +1190,8 @@ if (ret > 0) return ret; } + + xenUnifiedError(VIR_ERR_NO_SUPPORT, __FUNCTION__); return -1; }
I don't think this piece is correct, since this will overwrite any error raised by sub drivers. It should just be dropped.
Index: libvirt-0.9.8/src/xen/xen_hypervisor.c =================================================================== --- libvirt-0.9.8.orig/src/xen/xen_hypervisor.c 2012-03-23 10:58:29.000000000 +0100 +++ libvirt-0.9.8/src/xen/xen_hypervisor.c 2012-03-23 11:26:50.741137585 +0100 @@ -3612,8 +3612,11 @@ &dominfo);
if ((ret < 0) || (XEN_GETDOMAININFO_DOMAIN(dominfo) != domain->id)) { - virXenErrorFunc(VIR_ERR_INTERNAL_ERROR, __FUNCTION__, - _("cannot get domain details"), 0); + /* This can happen if an instance is just shut down. It is probably + * better to leave the shouting to the unified caller. + * virXenErrorFunc(VIR_ERR_INTERNAL_ERROR, __FUNCTION__, + * _("cannot get domain details"), 0); + */ return (-1); }
This is kind of a hack. The whole xen driver arch is pretty weird though so error handling is just strange as it is.
But isn't this all a side effect of the bug that your previous patch fixes? virt-manager won't call GetVCPUs on an inactive VM, it's only doing so in this case because libvirt and xen get out of sync. So I think fixing the previous bug should stop the error log spew.
I know it is definitely a hack. But it really happened _after_ the other fix was applied. But only when the terminal/vnc viewer window is open. Something in vit-manager must be using GetVCPUs even after shutdown. But I agree that the architecture is weird the first place. Having multiple subdrivers called in sequence it seems a general problem to have any errors reported in one. The next one can succeed and then its not an error. At least in my understanding. -Stefan
Thanks, Cole
participants (2)
-
Cole Robinson
-
Stefan Bader