[libvirt] [PATCH 0/2] don't kill qemu process on restart if networkNotify fails

The individual patches explain themselves. Laine Stump (2): qemu: don't kill qemu process on restart if networkNotify fails network: better log message when network is inactive during reconnect src/network/bridge_driver.c | 7 +++++++ src/qemu/qemu_process.c | 9 +++------ 2 files changed, 10 insertions(+), 6 deletions(-) -- 2.9.3

Nothing that could happen during networkNotifyActualDevice() could justify unceremoniously killing the qemu process, but that's what we were doing. In particular, new code added in commit 85bcc022 (first appearred in libvirt-3.2.0) attempts to reattach tap devices to their assigned bridge devices when libvirtd restarts (to make it easier to recover from a restart of a libvirt network). But if the network has been stopped and *not* restarted, the bridge device won't exist and networkNotifyActualDevice() will fail. This patch changes qemuProcessNotifyNets() to return void, so that qemuProcessReconnect() will soldier on regardless of what happens (any errors will still be logged, but we won't terminate the guest). Partially resolves: https://bugzilla.redhat.com/1442700 --- NB: there are many other things that might fail during reconnection of a qemu process that shouldn't lead to killing the qemu process. qemuProcessReconnect() could use a good audit to change what happens when certain items fail. src/qemu/qemu_process.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 53170d7..fcb5763 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2825,7 +2825,7 @@ int qemuProcessStopCPUs(virQEMUDriverPtr driver, -static int +static void qemuProcessNotifyNets(virDomainDefPtr def) { size_t i; @@ -2840,10 +2840,8 @@ qemuProcessNotifyNets(virDomainDefPtr def) if (virDomainNetGetActualType(net) == VIR_DOMAIN_NET_TYPE_DIRECT) ignore_value(virNetDevMacVLanReserveName(net->ifname, false)); - if (networkNotifyActualDevice(def, net) < 0) - return -1; + networkNotifyActualDevice(def, net); } - return 0; } static int @@ -3480,8 +3478,7 @@ qemuProcessReconnect(void *opaque) if (qemuSecurityReserveLabel(driver->securityManager, obj->def, obj->pid) < 0) goto error; - if (qemuProcessNotifyNets(obj->def) < 0) - goto error; + qemuProcessNotifyNets(obj->def); if (qemuProcessFiltersInstantiate(obj->def)) goto error; -- 2.9.3

On 04/25/2017 12:33 PM, Laine Stump wrote:
Nothing that could happen during networkNotifyActualDevice() could justify unceremoniously killing the qemu process, but that's what we were doing.
So that galaxy in far far away land circa commit id '04711a0f' when all this was added -
In particular, new code added in commit 85bcc022 (first appearred in
appeared
libvirt-3.2.0) attempts to reattach tap devices to their assigned bridge devices when libvirtd restarts (to make it easier to recover from a restart of a libvirt network). But if the network has been stopped and *not* restarted, the bridge device won't exist and networkNotifyActualDevice() will fail.
This patch changes qemuProcessNotifyNets() to return void, so that qemuProcessReconnect() will soldier on regardless of what happens (any errors will still be logged, but we won't terminate the guest).
Partially resolves: https://bugzilla.redhat.com/1442700 ---
NB: there are many other things that might fail during reconnection of a qemu process that shouldn't lead to killing the qemu process. qemuProcessReconnect() could use a good audit to change what happens when certain items fail.
Nose game, not me ;-)
src/qemu/qemu_process.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 53170d7..fcb5763 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2825,7 +2825,7 @@ int qemuProcessStopCPUs(virQEMUDriverPtr driver,
-static int +static void qemuProcessNotifyNets(virDomainDefPtr def) { size_t i; @@ -2840,10 +2840,8 @@ qemuProcessNotifyNets(virDomainDefPtr def) if (virDomainNetGetActualType(net) == VIR_DOMAIN_NET_TYPE_DIRECT) ignore_value(virNetDevMacVLanReserveName(net->ifname, false));
- if (networkNotifyActualDevice(def, net) < 0) - return -1; + networkNotifyActualDevice(def, net);
Seems like networkNotifyActualDevice should then return a void too? And have it's comments/content changed... IIUC thought, essentially if the network goes away, the domain isn't necessarily shut down - it gets notified somehow that the network isn't there any more and soldiers on. So from that aspect I can certain agree that just because we discover an error when libvirtd restarts doesn't mean we should now kill the domain. I think the concept behind the patch is fine - it's the "how much more" should be done... (and yes with the eye on keeping backports to a minimum, yep understood). John
} - return 0; }
static int @@ -3480,8 +3478,7 @@ qemuProcessReconnect(void *opaque) if (qemuSecurityReserveLabel(driver->securityManager, obj->def, obj->pid) < 0) goto error;
- if (qemuProcessNotifyNets(obj->def) < 0) - goto error; + qemuProcessNotifyNets(obj->def);
if (qemuProcessFiltersInstantiate(obj->def)) goto error;

On 04/26/2017 02:39 PM, John Ferlan wrote:
On 04/25/2017 12:33 PM, Laine Stump wrote:
Nothing that could happen during networkNotifyActualDevice() could justify unceremoniously killing the qemu process, but that's what we were doing.
So that galaxy in far far away land circa commit id '04711a0f' when all this was added -
Sure, in 2011. I think the statute of limitations has passed on that though (actually I'm surprised there's still any evidence of it in the git blame, what with all the code reorganization etc going on - that tends to obscure meaningful/functional changes...)
In particular, new code added in commit 85bcc022 (first appearred in
appeared
libvirt-3.2.0) attempts to reattach tap devices to their assigned bridge devices when libvirtd restarts (to make it easier to recover from a restart of a libvirt network). But if the network has been stopped and *not* restarted, the bridge device won't exist and networkNotifyActualDevice() will fail.
This patch changes qemuProcessNotifyNets() to return void, so that qemuProcessReconnect() will soldier on regardless of what happens (any errors will still be logged, but we won't terminate the guest).
Partially resolves: https://bugzilla.redhat.com/1442700 ---
NB: there are many other things that might fail during reconnection of a qemu process that shouldn't lead to killing the qemu process. qemuProcessReconnect() could use a good audit to change what happens when certain items fail.
Nose game, not me ;-)
Nor me. I've done my part by pointing it out :-P
src/qemu/qemu_process.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 53170d7..fcb5763 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2825,7 +2825,7 @@ int qemuProcessStopCPUs(virQEMUDriverPtr driver,
-static int +static void qemuProcessNotifyNets(virDomainDefPtr def) { size_t i; @@ -2840,10 +2840,8 @@ qemuProcessNotifyNets(virDomainDefPtr def) if (virDomainNetGetActualType(net) == VIR_DOMAIN_NET_TYPE_DIRECT) ignore_value(virNetDevMacVLanReserveName(net->ifname, false));
- if (networkNotifyActualDevice(def, net) < 0) - return -1; + networkNotifyActualDevice(def, net);
Seems like networkNotifyActualDevice should then return a void too? And have it's comments/content changed...
That depends. If we go through the entire call stack and obliterate the return codes, then in the future if we decide there is something that we really do want to fail on, there will be more stuff to "un-undo". Also, I'm not as comfortable obliterating the return code of a function that called from another file. That's partially just superstition though.
IIUC thought, essentially if the network goes away, the domain isn't necessarily shut down - it gets notified somehow that the network isn't there any more
Yeah, it's "notified" mainly by the fact that it's no longer receiving any traffic. We *might* want to think about setting the guest's interface ~IFF_UP, but who knows what unwanted side effects that could have in the case of someone just restarting a network.
and soldiers on. So from that aspect I can certain agree that just because we discover an error when libvirtd restarts doesn't mean we should now kill the domain.
Yeah, it's*very* passive-agressive (nay, agressive-agressive) to kill the entire guest just because one little part of it has stopped working. Much nicer to leave it running so that some amount of sane shutdown might be performed even if the non-working piece can't be restored.
I think the concept behind the patch is fine - it's the "how much more" should be done... (and yes with the eye on keeping backports to a minimum, yep understood).
I guess I can also nuke the return value from networkNotiyActualDevice(). I'll redo it that way and resend.
John
} - return 0; }
static int @@ -3480,8 +3478,7 @@ qemuProcessReconnect(void *opaque) if (qemuSecurityReserveLabel(driver->securityManager, obj->def, obj->pid) < 0) goto error;
- if (qemuProcessNotifyNets(obj->def) < 0) - goto error; + qemuProcessNotifyNets(obj->def);
if (qemuProcessFiltersInstantiate(obj->def)) goto error;

If the network isn't active during networkNotifyActualDevice(), we would log an error message stating that the bridge device didn't exist. This patch adds a check to see if the network is active, making the logs more useful in the case that it isn't. Partially resolves: https://bugzilla.redhat.com/1442700 --- src/network/bridge_driver.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index d2d8557..e06f81b 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -4676,6 +4676,13 @@ networkNotifyActualDevice(virDomainDefPtr dom, } netdef = network->def; + if (!virNetworkObjIsActive(network)) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("network '%s' is not active"), + netdef->name); + goto error; + } + /* if we're restarting libvirtd after an upgrade from a version * that didn't save bridge name in actualNetDef for * actualType==network, we need to copy it in so that it will be -- 2.9.3

On 04/25/2017 12:34 PM, Laine Stump wrote:
If the network isn't active during networkNotifyActualDevice(), we would log an error message stating that the bridge device didn't exist. This patch adds a check to see if the network is active, making the logs more useful in the case that it isn't.
Partially resolves: https://bugzilla.redhat.com/1442700 --- src/network/bridge_driver.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index d2d8557..e06f81b 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -4676,6 +4676,13 @@ networkNotifyActualDevice(virDomainDefPtr dom, } netdef = network->def;
+ if (!virNetworkObjIsActive(network)) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("network '%s' is not active"), + netdef->name); + goto error; + } +
/me wonders whether this should just a goto cleanup - IOW: if the network isn't active, so what, why error. Once someone attempts to start it, they'll get errors I assume... Not that goto error or cleanup matters since commit id '4fee4e0' changed the goto cleanup to goto error and added: + +error: + goto cleanup; I guess I don't have the answer readily available in my head as to how much of the subsequent code would be called at network start time? John
/* if we're restarting libvirtd after an upgrade from a version * that didn't save bridge name in actualNetDef for * actualType==network, we need to copy it in so that it will be

On 04/26/2017 02:39 PM, John Ferlan wrote:
On 04/25/2017 12:34 PM, Laine Stump wrote:
If the network isn't active during networkNotifyActualDevice(), we would log an error message stating that the bridge device didn't exist. This patch adds a check to see if the network is active, making the logs more useful in the case that it isn't.
Partially resolves: https://bugzilla.redhat.com/1442700 --- src/network/bridge_driver.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index d2d8557..e06f81b 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -4676,6 +4676,13 @@ networkNotifyActualDevice(virDomainDefPtr dom, } netdef = network->def;
+ if (!virNetworkObjIsActive(network)) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("network '%s' is not active"), + netdef->name); + goto error; + } +
/me wonders whether this should just a goto cleanup - IOW: if the network isn't active, so what, why error. Once someone attempts to start it, they'll get errors I assume...
Not that goto error or cleanup matters since commit id '4fee4e0' changed the goto cleanup to goto error and added:
+ +error: + goto cleanup;
That's missing the point of that commit. "cleanup:" is there only as a place for error: to goto the cleanup code that is common to both the "success:" exit and the "error:" exit (and the three labels are all there so that that this function is structured similarly to networkAllocateActualDevice() - I wanted them to be as close to the same as possible). In the body of the function you either declare the allocate/notification a success (with "goto success") in which case the connect count for the network goes up, or you declare it a failure (with "goto error") in which case the connect count for the network remains unchanged, but you never directly goto the noncommittal "cleanup:".
I guess I don't have the answer readily available in my head as to how much of the subsequent code would be called at network start time?
None of this is called when the network is started - we have no way for the networkStart() function to learn which interfaces in which domains are supposed to be connected to a particular network. That needs more infrastructure change than I wanted to put in right now (we need some sort of event or callback that will allow the network driver to notify all active hypervisor drivers whenever a network is started (or destroyed?), giving the hypervisor driver a chance to call networkNotifyActualDevice() for any affected domain interface). So have I explained myself well enough?
John
/* if we're restarting libvirtd after an upgrade from a version * that didn't save bridge name in actualNetDef for * actualType==network, we need to copy it in so that it will be

On 04/26/2017 05:57 PM, Laine Stump wrote:
On 04/26/2017 02:39 PM, John Ferlan wrote:
On 04/25/2017 12:34 PM, Laine Stump wrote:
If the network isn't active during networkNotifyActualDevice(), we would log an error message stating that the bridge device didn't exist. This patch adds a check to see if the network is active, making the logs more useful in the case that it isn't.
Partially resolves: https://bugzilla.redhat.com/1442700 --- src/network/bridge_driver.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index d2d8557..e06f81b 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -4676,6 +4676,13 @@ networkNotifyActualDevice(virDomainDefPtr dom, } netdef = network->def;
+ if (!virNetworkObjIsActive(network)) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("network '%s' is not active"), + netdef->name); + goto error; + } +
/me wonders whether this should just a goto cleanup - IOW: if the network isn't active, so what, why error. Once someone attempts to start it, they'll get errors I assume...
Not that goto error or cleanup matters since commit id '4fee4e0' changed the goto cleanup to goto error and added:
+ +error: + goto cleanup;
That's missing the point of that commit. "cleanup:" is there only as a place for error: to goto the cleanup code that is common to both the "success:" exit and the "error:" exit (and the three labels are all there so that that this function is structured similarly to networkAllocateActualDevice() - I wanted them to be as close to the same as possible). In the body of the function you either declare the allocate/notification a success (with "goto success") in which case the connect count for the network goes up, or you declare it a failure (with "goto error") in which case the connect count for the network remains unchanged, but you never directly goto the noncommittal "cleanup:".
OK understood - it's just one of those odd looking constructs to have a goto error and goto cleanup immediately following it, but your reasoning for having the logic this way is understandable.
I guess I don't have the answer readily available in my head as to how much of the subsequent code would be called at network start time?
None of this is called when the network is started - we have no way for the networkStart() function to learn which interfaces in which domains are supposed to be connected to a particular network. That needs more infrastructure change than I wanted to put in right now (we need some sort of event or callback that will allow the network driver to notify all active hypervisor drivers whenever a network is started (or destroyed?), giving the hypervisor driver a chance to call networkNotifyActualDevice() for any affected domain interface).
So have I explained myself well enough?
Sure, but for my context... I guess I was putting myself in the mindset of a libvirtd reconnect where "who knows" what was done with each network prior to reconnection and this is the discovery of that. For some paths it's a hard death issue - the physical connection had problems or went away. For other paths it's more of a soft or state issue - someone stopped/destroyed, but didn't undefine a network. I was thinking more that in this latter case, does an error make sense? John
John
/* if we're restarting libvirtd after an upgrade from a version * that didn't save bridge name in actualNetDef for * actualType==network, we need to copy it in so that it will be

On 04/27/2017 09:06 AM, John Ferlan wrote:
On 04/26/2017 05:57 PM, Laine Stump wrote:
On 04/26/2017 02:39 PM, John Ferlan wrote:
On 04/25/2017 12:34 PM, Laine Stump wrote:
If the network isn't active during networkNotifyActualDevice(), we would log an error message stating that the bridge device didn't exist. This patch adds a check to see if the network is active, making the logs more useful in the case that it isn't.
Partially resolves: https://bugzilla.redhat.com/1442700 --- src/network/bridge_driver.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index d2d8557..e06f81b 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -4676,6 +4676,13 @@ networkNotifyActualDevice(virDomainDefPtr dom, } netdef = network->def;
+ if (!virNetworkObjIsActive(network)) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("network '%s' is not active"), + netdef->name); + goto error; + } +
/me wonders whether this should just a goto cleanup - IOW: if the network isn't active, so what, why error. Once someone attempts to start it, they'll get errors I assume...
Not that goto error or cleanup matters since commit id '4fee4e0' changed the goto cleanup to goto error and added:
+ +error: + goto cleanup;
That's missing the point of that commit. "cleanup:" is there only as a place for error: to goto the cleanup code that is common to both the "success:" exit and the "error:" exit (and the three labels are all there so that that this function is structured similarly to networkAllocateActualDevice() - I wanted them to be as close to the same as possible). In the body of the function you either declare the allocate/notification a success (with "goto success") in which case the connect count for the network goes up, or you declare it a failure (with "goto error") in which case the connect count for the network remains unchanged, but you never directly goto the noncommittal "cleanup:".
OK understood - it's just one of those odd looking constructs to have a goto error and goto cleanup immediately following it, but your reasoning for having the logic this way is understandable.
I guess I don't have the answer readily available in my head as to how much of the subsequent code would be called at network start time?
None of this is called when the network is started - we have no way for the networkStart() function to learn which interfaces in which domains are supposed to be connected to a particular network. That needs more infrastructure change than I wanted to put in right now (we need some sort of event or callback that will allow the network driver to notify all active hypervisor drivers whenever a network is started (or destroyed?), giving the hypervisor driver a chance to call networkNotifyActualDevice() for any affected domain interface).
So have I explained myself well enough?
Sure, but for my context...
I guess I was putting myself in the mindset of a libvirtd reconnect where "who knows" what was done with each network prior to reconnection and this is the discovery of that. For some paths it's a hard death issue - the physical connection had problems or went away. For other paths it's more of a soft or state issue - someone stopped/destroyed, but didn't undefine a network. I was thinking more that in this latter case, does an error make sense?
I think it makes sense to log an error, but it doesn't make sense to kill the guest. I think it would only make sense to kill the guest if an error condition led to 1) a guest that was completely non-functioning and/or unreachable anyway, so it would eventually need to be killed anyway no matter what you tried to do, or 2) the guest being left exposed/vulnerable in some serious way. Neither of those is the case here.
participants (2)
-
John Ferlan
-
Laine Stump