[libvirt] [PATCH 0/4] improve virConnectListAllInterfaces()

There's a bit of background about this here: https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html In short, virt-manager is calling the virInterface APIs and that ties up a libvirt thread (and CPU core) for a very long time on hosts that have a large number of interfaces. These patches don't cure the problem (I don't know that there really is a cure other than "Don't DO that!"), but they do fix a couple of bugs I found while investigating, and make a substantial improvement in the amount of time used by virConnectListAllInterfaces(). One thing that I wondered about while investigating this - a big use of CPU by virConnectListAllInterfaces() comes from the need to retrieve the MAC address of every interface. The MAC addresses are both 1) returned to the caller in the interface objects and 2) sent to the policykit ACL checking to decide which interfaces to include in the list. I'm 90% confident that 1) most callers don't really care that they're getting the MAC address along with interface name (virt-manager, for example, follows up with a virInterfaceGetXMLDesc() anyway)), and 2) there is not even a single host *anywhere* that is using libvirt policykit ACLs to limit the list of host interfaces viewable by a client. So we could add a flag to not return MAC addresses, which would allow cutting down the time to list all devices to something < 1 second). But this would be at the expense of no longer having the possibility to limit the list with policykit according to MAC address. Since all host interface information is available to all users via the file system, for example, I don't see this as a huge issue, but it would change behavior, so I don't feel comfortable doing it. I don't like the idea of a single API call taking > 1 minute to return either, though. Does anyone have an opinion about this? Laine Stump (4): interface: fail on OOM from virGetInterface() interface: report correct interface count when not returning list interface: re-use name and mac address rather than re-retrieving interface: let netcf pre-filter for active vs. inactive src/interface/interface_backend_netcf.c | 56 +++++++++++++++------------------ 1 file changed, 25 insertions(+), 31 deletions(-) -- 2.4.3

--- src/interface/interface_backend_netcf.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/interface/interface_backend_netcf.c b/src/interface/interface_backend_netcf.c index 947f1e2..9f74541 100644 --- a/src/interface/interface_backend_netcf.c +++ b/src/interface/interface_backend_netcf.c @@ -2,7 +2,7 @@ * interface_backend_netcf.c: backend driver methods to handle physical * interface configuration using the netcf library. * - * Copyright (C) 2006-2014 Red Hat, Inc. + * Copyright (C) 2006-2015 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -629,8 +629,9 @@ netcfConnectListAllInterfaces(virConnectPtr conn, } if (ifaces) { - iface_obj = virGetInterface(conn, ncf_if_name(iface), - ncf_if_mac_string(iface)); + if (!(iface_obj = virGetInterface(conn, ncf_if_name(iface), + ncf_if_mac_string(iface)))) + goto cleanup; tmp_iface_objs[niface_objs++] = iface_obj; } -- 2.4.3

On Fri, Sep 25, 2015 at 11:13:53AM -0400, Laine Stump wrote:
--- src/interface/interface_backend_netcf.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
ACK Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

The spec for virConnectListAllInterfaces says that if the pointer that is supposed to hold the list of interfaces is NULL, the function should just return the count of interfaces that matched the filter, but the code never increments the count if the list pointer is NULL. --- src/interface/interface_backend_netcf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/interface/interface_backend_netcf.c b/src/interface/interface_backend_netcf.c index 9f74541..a01fbd6 100644 --- a/src/interface/interface_backend_netcf.c +++ b/src/interface/interface_backend_netcf.c @@ -632,8 +632,9 @@ netcfConnectListAllInterfaces(virConnectPtr conn, if (!(iface_obj = virGetInterface(conn, ncf_if_name(iface), ncf_if_mac_string(iface)))) goto cleanup; - tmp_iface_objs[niface_objs++] = iface_obj; + tmp_iface_objs[niface_objs] = iface_obj; } + niface_objs++; ncf_if_free(iface); iface = NULL; -- 2.4.3

On Fri, Sep 25, 2015 at 11:13:54AM -0400, Laine Stump wrote:
The spec for virConnectListAllInterfaces says that if the pointer that is supposed to hold the list of interfaces is NULL, the function should just return the count of interfaces that matched the filter, but the code never increments the count if the list pointer is NULL. --- src/interface/interface_backend_netcf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
ACK Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Getting the MAC address of an interface is actually fairly expensive, and we've already gotten it and stored it into def, so just keep def around a bit longer and retrieve it from there. This reduces the time for "virsh iface-list --all" from 28 to 23 seconds when there are 400 interfaces. --- src/interface/interface_backend_netcf.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/interface/interface_backend_netcf.c b/src/interface/interface_backend_netcf.c index a01fbd6..169ca57 100644 --- a/src/interface/interface_backend_netcf.c +++ b/src/interface/interface_backend_netcf.c @@ -583,6 +583,7 @@ netcfConnectListAllInterfaces(virConnectPtr conn, for (i = 0; i < count; i++) { virInterfaceDefPtr def; + iface = ncf_lookup_by_name(driver->netcf, names[i]); if (!iface) { const char *errmsg, *details; @@ -615,27 +616,26 @@ netcfConnectListAllInterfaces(virConnectPtr conn, virInterfaceDefFree(def); continue; } - virInterfaceDefFree(def); - /* XXX: Filter the result, need to be split once new filter flags * except active|inactive are supported. */ if (MATCH(VIR_CONNECT_LIST_INTERFACES_FILTERS_ACTIVE) && !((MATCH(VIR_CONNECT_LIST_INTERFACES_ACTIVE) && active) || (MATCH(VIR_CONNECT_LIST_INTERFACES_INACTIVE) && !active))) { + virInterfaceDefFree(def); ncf_if_free(iface); iface = NULL; continue; } if (ifaces) { - if (!(iface_obj = virGetInterface(conn, ncf_if_name(iface), - ncf_if_mac_string(iface)))) + if (!(iface_obj = virGetInterface(conn, def->name, def->mac))) goto cleanup; tmp_iface_objs[niface_objs] = iface_obj; } niface_objs++; + virInterfaceDefFree(def); ncf_if_free(iface); iface = NULL; } @@ -698,7 +698,7 @@ static virInterfacePtr netcfInterfaceLookupByName(virConnectPtr conn, if (virInterfaceLookupByNameEnsureACL(conn, def) < 0) goto cleanup; - ret = virGetInterface(conn, ncf_if_name(iface), ncf_if_mac_string(iface)); + ret = virGetInterface(conn, def->name, def->mac); cleanup: ncf_if_free(iface); @@ -746,7 +746,7 @@ static virInterfacePtr netcfInterfaceLookupByMACString(virConnectPtr conn, if (virInterfaceLookupByMACStringEnsureACL(conn, def) < 0) goto cleanup; - ret = virGetInterface(conn, ncf_if_name(iface), ncf_if_mac_string(iface)); + ret = virGetInterface(conn, def->name, def->mac); cleanup: ncf_if_free(iface); -- 2.4.3

On Fri, Sep 25, 2015 at 11:13:55AM -0400, Laine Stump wrote:
Getting the MAC address of an interface is actually fairly expensive, and we've already gotten it and stored it into def, so just keep def around a bit longer and retrieve it from there.
This reduces the time for "virsh iface-list --all" from 28 to 23 seconds when there are 400 interfaces. --- src/interface/interface_backend_netcf.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
ACK Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

If a system has a large number of active or active interfaces, it can be a big waste of time to retrieve and qualify all interfaces if the caller only wanted one subset. Since netcf has a simple flag for this, translate the libvirt flag into a netcf flag and let netcf pre-filter. --- src/interface/interface_backend_netcf.c | 40 +++++++++++++-------------------- 1 file changed, 16 insertions(+), 24 deletions(-) diff --git a/src/interface/interface_backend_netcf.c b/src/interface/interface_backend_netcf.c index 169ca57..0181635 100644 --- a/src/interface/interface_backend_netcf.c +++ b/src/interface/interface_backend_netcf.c @@ -528,10 +528,10 @@ netcfConnectListAllInterfaces(virConnectPtr conn, { int count; size_t i; + unsigned int ncf_flags = 0; struct netcf_if *iface = NULL; virInterfacePtr *tmp_iface_objs = NULL; virInterfacePtr iface_obj = NULL; - bool active; int niface_objs = 0; int ret = -1; char **names = NULL; @@ -543,14 +543,20 @@ netcfConnectListAllInterfaces(virConnectPtr conn, virObjectLock(driver); - /* List all interfaces, in case of we might support new filter flags - * except active|inactive in future. - */ - count = ncf_num_of_interfaces(driver->netcf, NETCF_IFACE_ACTIVE | - NETCF_IFACE_INACTIVE); - if (count < 0) { + /* let netcf pre-filter for this flag to save time */ + if (MATCH(VIR_CONNECT_LIST_INTERFACES_FILTERS_ACTIVE)) { + if (MATCH(VIR_CONNECT_LIST_INTERFACES_ACTIVE)) + ncf_flags |= NETCF_IFACE_ACTIVE; + if (MATCH(VIR_CONNECT_LIST_INTERFACES_INACTIVE)) + ncf_flags |= NETCF_IFACE_INACTIVE; + } else { + ncf_flags = NETCF_IFACE_ACTIVE | NETCF_IFACE_INACTIVE; + } + + if ((count = ncf_num_of_interfaces(driver->netcf, ncf_flags)) < 0) { const char *errmsg, *details; int errcode = ncf_error(driver->netcf, &errmsg, &details); + virReportError(netcf_to_vir_err(errcode), _("failed to get number of host interfaces: %s%s%s"), errmsg, details ? " - " : "", @@ -566,11 +572,11 @@ netcfConnectListAllInterfaces(virConnectPtr conn, if (VIR_ALLOC_N(names, count) < 0) goto cleanup; - if ((count = ncf_list_interfaces(driver->netcf, count, names, - NETCF_IFACE_ACTIVE | - NETCF_IFACE_INACTIVE)) < 0) { + if ((count = ncf_list_interfaces(driver->netcf, count, + names, ncf_flags)) < 0) { const char *errmsg, *details; int errcode = ncf_error(driver->netcf, &errmsg, &details); + virReportError(netcf_to_vir_err(errcode), _("failed to list host interfaces: %s%s%s"), errmsg, details ? " - " : "", @@ -604,9 +610,6 @@ netcfConnectListAllInterfaces(virConnectPtr conn, } } - if (netcfInterfaceObjIsActive(iface, &active) < 0) - goto cleanup; - if (!(def = netcfGetMinimalDefForDevice(iface))) goto cleanup; @@ -616,17 +619,6 @@ netcfConnectListAllInterfaces(virConnectPtr conn, virInterfaceDefFree(def); continue; } - /* XXX: Filter the result, need to be split once new filter flags - * except active|inactive are supported. - */ - if (MATCH(VIR_CONNECT_LIST_INTERFACES_FILTERS_ACTIVE) && - !((MATCH(VIR_CONNECT_LIST_INTERFACES_ACTIVE) && active) || - (MATCH(VIR_CONNECT_LIST_INTERFACES_INACTIVE) && !active))) { - virInterfaceDefFree(def); - ncf_if_free(iface); - iface = NULL; - continue; - } if (ifaces) { if (!(iface_obj = virGetInterface(conn, def->name, def->mac))) -- 2.4.3

On Fri, Sep 25, 2015 at 11:13:56AM -0400, Laine Stump wrote:
If a system has a large number of active or active interfaces, it can
s/active/inactive/ for one of them presuambly.
be a big waste of time to retrieve and qualify all interfaces if the caller only wanted one subset. Since netcf has a simple flag for this, translate the libvirt flag into a netcf flag and let netcf pre-filter. --- src/interface/interface_backend_netcf.c | 40 +++++++++++++-------------------- 1 file changed, 16 insertions(+), 24 deletions(-)
ACK Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Sep 25, 2015 at 11:13:56AM -0400, Laine Stump wrote:
If a system has a large number of active or active interfaces, it can
*active or inactive Jan
be a big waste of time to retrieve and qualify all interfaces if the caller only wanted one subset. Since netcf has a simple flag for this, translate the libvirt flag into a netcf flag and let netcf pre-filter. --- src/interface/interface_backend_netcf.c | 40 +++++++++++++-------------------- 1 file changed, 16 insertions(+), 24 deletions(-)

On Fri, Sep 25, 2015 at 11:13:52AM -0400, Laine Stump wrote:
There's a bit of background about this here:
https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html
In short, virt-manager is calling the virInterface APIs and that ties up a libvirt thread (and CPU core) for a very long time on hosts that have a large number of interfaces. These patches don't cure the problem (I don't know that there really is a cure other than "Don't DO that!"), but they do fix a couple of bugs I found while investigating, and make a substantial improvement in the amount of time used by virConnectListAllInterfaces().
One thing that I wondered about while investigating this - a big use of CPU by virConnectListAllInterfaces() comes from the need to retrieve the MAC address of every interface. The MAC addresses are both
1) returned to the caller in the interface objects and
2) sent to the policykit ACL checking to decide which interfaces to include in the list.
I'm 90% confident that
1) most callers don't really care that they're getting the MAC address along with interface name (virt-manager, for example, follows up with a virInterfaceGetXMLDesc() anyway)), and
2) there is not even a single host *anywhere* that is using libvirt policykit ACLs to limit the list of host interfaces viewable by a client.
So we could add a flag to not return MAC addresses, which would allow cutting down the time to list all devices to something < 1 second). But this would be at the expense of no longer having the possibility to limit the list with policykit according to MAC address. Since all host interface information is available to all users via the file system, for example, I don't see this as a huge issue, but it would change behavior, so I don't feel comfortable doing it. I don't like the idea of a single API call taking > 1 minute to return either, though. Does anyone have an opinion about this?
Ultimately 500 interfaces, each ifcfg-XXX file 300 bytes in size on average is only 150 KB of data. Given the amount of data we are consuming, here I think it is reasonable to expect we can process than in a tiny fraction of a second. So there's clearly a serious algorithmic / data structure flaw here if its taking minutes. By the sounds of the thread you quote, its in augeas itself, so I think we really need to focus on addressing that root cause as a priority rather than try to work around it. As a side note, we might consider adding new API to netcf so that we can fetch the entire interface set + macs in one api call to netcf, though I doubt it'd chance performance that much. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Sep 25, 2015 at 05:22:30PM +0100, Daniel P. Berrange wrote:
On Fri, Sep 25, 2015 at 11:13:52AM -0400, Laine Stump wrote:
There's a bit of background about this here:
https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html
In short, virt-manager is calling the virInterface APIs and that ties up a libvirt thread (and CPU core) for a very long time on hosts that have a large number of interfaces. These patches don't cure the problem (I don't know that there really is a cure other than "Don't DO that!"), but they do fix a couple of bugs I found while investigating, and make a substantial improvement in the amount of time used by virConnectListAllInterfaces().
One thing that I wondered about while investigating this - a big use of CPU by virConnectListAllInterfaces() comes from the need to retrieve the MAC address of every interface. The MAC addresses are both
1) returned to the caller in the interface objects and
2) sent to the policykit ACL checking to decide which interfaces to include in the list.
I'm 90% confident that
1) most callers don't really care that they're getting the MAC address along with interface name (virt-manager, for example, follows up with a virInterfaceGetXMLDesc() anyway)), and
2) there is not even a single host *anywhere* that is using libvirt policykit ACLs to limit the list of host interfaces viewable by a client.
So we could add a flag to not return MAC addresses, which would allow cutting down the time to list all devices to something < 1 second). But this would be at the expense of no longer having the possibility to limit the list with policykit according to MAC address. Since all host interface information is available to all users via the file system, for example, I don't see this as a huge issue, but it would change behavior, so I don't feel comfortable doing it. I don't like the idea of a single API call taking > 1 minute to return either, though. Does anyone have an opinion about this?
Ultimately 500 interfaces, each ifcfg-XXX file 300 bytes in size on average is only 150 KB of data. Given the amount of data we are consuming, here I think it is reasonable to expect we can process than in a tiny fraction of a second. So there's clearly a serious algorithmic / data structure flaw here if its taking minutes.
By the sounds of the thread you quote, its in augeas itself, so I think we really need to focus on addressing that root cause as a priority rather than try to work around it.
As a side note, we might consider adding new API to netcf so that we can fetch the entire interface set + macs in one api call to netcf, though I doubt it'd chance performance that much.
So, I instrumented the netcf and augeas code to checking timings. The aug_get calls time less than a millisecond, as do the various other calls. I found the bulk of the time is actually coming from the netcf function "get_augeas", which in turns call "aug_load" for every single damn netcf function call. So when we have 500 interfaces, we're telling augeas to load all the config files 1000 times. That's where the slowness is coming from.... Either we need to stop loading congfig files on every fnuction call in netcf, or we need to add a netcf bulk data API call, so that libvirt can load /all/ the data it needs in 1 single API call. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 09/25/2015 01:27 PM, Daniel P. Berrange wrote:
On Fri, Sep 25, 2015 at 05:22:30PM +0100, Daniel P. Berrange wrote:
There's a bit of background about this here:
https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html
In short, virt-manager is calling the virInterface APIs and that ties up a libvirt thread (and CPU core) for a very long time on hosts that have a large number of interfaces. These patches don't cure the problem (I don't know that there really is a cure other than "Don't DO that!"), but they do fix a couple of bugs I found while investigating, and make a substantial improvement in the amount of time used by virConnectListAllInterfaces().
One thing that I wondered about while investigating this - a big use of CPU by virConnectListAllInterfaces() comes from the need to retrieve the MAC address of every interface. The MAC addresses are both
1) returned to the caller in the interface objects and
2) sent to the policykit ACL checking to decide which interfaces to include in the list.
I'm 90% confident that
1) most callers don't really care that they're getting the MAC address along with interface name (virt-manager, for example, follows up with a virInterfaceGetXMLDesc() anyway)), and
2) there is not even a single host *anywhere* that is using libvirt policykit ACLs to limit the list of host interfaces viewable by a client.
So we could add a flag to not return MAC addresses, which would allow cutting down the time to list all devices to something < 1 second). But this would be at the expense of no longer having the possibility to limit the list with policykit according to MAC address. Since all host interface information is available to all users via the file system, for example, I don't see this as a huge issue, but it would change behavior, so I don't feel comfortable doing it. I don't like the idea of a single API call taking > 1 minute to return either, though. Does anyone have an opinion about this? Ultimately 500 interfaces, each ifcfg-XXX file 300 bytes in size on average is only 150 KB of data. Given the amount of data we are consuming, here I think it is reasonable to expect we can
On Fri, Sep 25, 2015 at 11:13:52AM -0400, Laine Stump wrote: process than in a tiny fraction of a second. So there's clearly a serious algorithmic / data structure flaw here if its taking minutes.
By the sounds of the thread you quote, its in augeas itself, so I think we really need to focus on addressing that root cause as a priority rather than try to work around it.
As a side note, we might consider adding new API to netcf so that we can fetch the entire interface set + macs in one api call to netcf, though I doubt it'd chance performance that much. So, I instrumented the netcf and augeas code to checking timings.
What did you use? I tried using perf and oprofile, but all I could get them to tell me was that a ton of time was being spent in strcmp(), so either it couldn't figure out who was the caller due to missing stack frame pointers, or I just didn't know the right commandline options. (The last time I did any serious profiling I used some custom code (written by someone else at a previous employer) that massaged xml format output from oprofile. A lot has changed since then.)
The aug_get calls time less than a millisecond, as do the various other calls. I found the bulk of the time is actually coming from the netcf function "get_augeas", which in turns call "aug_load" for every single damn netcf function call.
I remember David Lutterkort talking about exactly that problem several years ago and *thought* I remembered that he had put something into augeas to only reread the files if they had changed. Has my memory failed me again? Or is augeas doing something and netcf just isn't taking advantage of it?
Either we need to stop loading congfig files on every fnuction call in netcf, or we need to add a netcf bulk data API call, so that libvirt can load /all/ the data it needs in 1 single API call.
I much prefer (1) :-)

On Fri, Sep 25, 2015 at 01:48:41PM -0400, Laine Stump wrote:
On 09/25/2015 01:27 PM, Daniel P. Berrange wrote:
On Fri, Sep 25, 2015 at 05:22:30PM +0100, Daniel P. Berrange wrote:
There's a bit of background about this here:
https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html
In short, virt-manager is calling the virInterface APIs and that ties up a libvirt thread (and CPU core) for a very long time on hosts that have a large number of interfaces. These patches don't cure the problem (I don't know that there really is a cure other than "Don't DO that!"), but they do fix a couple of bugs I found while investigating, and make a substantial improvement in the amount of time used by virConnectListAllInterfaces().
One thing that I wondered about while investigating this - a big use of CPU by virConnectListAllInterfaces() comes from the need to retrieve the MAC address of every interface. The MAC addresses are both
1) returned to the caller in the interface objects and
2) sent to the policykit ACL checking to decide which interfaces to include in the list.
I'm 90% confident that
1) most callers don't really care that they're getting the MAC address along with interface name (virt-manager, for example, follows up with a virInterfaceGetXMLDesc() anyway)), and
2) there is not even a single host *anywhere* that is using libvirt policykit ACLs to limit the list of host interfaces viewable by a client.
So we could add a flag to not return MAC addresses, which would allow cutting down the time to list all devices to something < 1 second). But this would be at the expense of no longer having the possibility to limit the list with policykit according to MAC address. Since all host interface information is available to all users via the file system, for example, I don't see this as a huge issue, but it would change behavior, so I don't feel comfortable doing it. I don't like the idea of a single API call taking > 1 minute to return either, though. Does anyone have an opinion about this? Ultimately 500 interfaces, each ifcfg-XXX file 300 bytes in size on average is only 150 KB of data. Given the amount of data we are consuming, here I think it is reasonable to expect we can
On Fri, Sep 25, 2015 at 11:13:52AM -0400, Laine Stump wrote: process than in a tiny fraction of a second. So there's clearly a serious algorithmic / data structure flaw here if its taking minutes.
By the sounds of the thread you quote, its in augeas itself, so I think we really need to focus on addressing that root cause as a priority rather than try to work around it.
As a side note, we might consider adding new API to netcf so that we can fetch the entire interface set + macs in one api call to netcf, though I doubt it'd chance performance that much. So, I instrumented the netcf and augeas code to checking timings.
What did you use? I tried using perf and oprofile, but all I could get them to tell me was that a ton of time was being spent in strcmp(), so either it couldn't figure out who was the caller due to missing stack frame pointers, or I just didn't know the right commandline options. (The last time I did any serious profiling I used some custom code (written by someone else at a previous employer) that massaged xml format output from oprofile. A lot has changed since then.)
The aug_get calls time less than a millisecond, as do the various other calls. I found the bulk of the time is actually coming from the netcf function "get_augeas", which in turns call "aug_load" for every single damn netcf function call.
I remember David Lutterkort talking about exactly that problem several years ago and *thought* I remembered that he had put something into augeas to only reread the files if they had changed. Has my memory failed me again? Or is augeas doing something and netcf just isn't taking advantage of it?
It's at least in the releas notes: 0.7.3 - 2010-08-06 aug_load: only reparse files that have actually changed; greatly speeds up reloading Cheers, -- Guido
Either we need to stop loading congfig files on every fnuction call in netcf, or we need to add a netcf bulk data API call, so that libvirt can load /all/ the data it needs in 1 single API call.
I much prefer (1) :-)
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Fri, Sep 25, 2015 at 01:48:41PM -0400, Laine Stump wrote:
On 09/25/2015 01:27 PM, Daniel P. Berrange wrote:
On Fri, Sep 25, 2015 at 05:22:30PM +0100, Daniel P. Berrange wrote:
There's a bit of background about this here:
https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html
In short, virt-manager is calling the virInterface APIs and that ties up a libvirt thread (and CPU core) for a very long time on hosts that have a large number of interfaces. These patches don't cure the problem (I don't know that there really is a cure other than "Don't DO that!"), but they do fix a couple of bugs I found while investigating, and make a substantial improvement in the amount of time used by virConnectListAllInterfaces().
One thing that I wondered about while investigating this - a big use of CPU by virConnectListAllInterfaces() comes from the need to retrieve the MAC address of every interface. The MAC addresses are both
1) returned to the caller in the interface objects and
2) sent to the policykit ACL checking to decide which interfaces to include in the list.
I'm 90% confident that
1) most callers don't really care that they're getting the MAC address along with interface name (virt-manager, for example, follows up with a virInterfaceGetXMLDesc() anyway)), and
2) there is not even a single host *anywhere* that is using libvirt policykit ACLs to limit the list of host interfaces viewable by a client.
So we could add a flag to not return MAC addresses, which would allow cutting down the time to list all devices to something < 1 second). But this would be at the expense of no longer having the possibility to limit the list with policykit according to MAC address. Since all host interface information is available to all users via the file system, for example, I don't see this as a huge issue, but it would change behavior, so I don't feel comfortable doing it. I don't like the idea of a single API call taking > 1 minute to return either, though. Does anyone have an opinion about this? Ultimately 500 interfaces, each ifcfg-XXX file 300 bytes in size on average is only 150 KB of data. Given the amount of data we are consuming, here I think it is reasonable to expect we can
On Fri, Sep 25, 2015 at 11:13:52AM -0400, Laine Stump wrote: process than in a tiny fraction of a second. So there's clearly a serious algorithmic / data structure flaw here if its taking minutes.
By the sounds of the thread you quote, its in augeas itself, so I think we really need to focus on addressing that root cause as a priority rather than try to work around it.
As a side note, we might consider adding new API to netcf so that we can fetch the entire interface set + macs in one api call to netcf, though I doubt it'd chance performance that much. So, I instrumented the netcf and augeas code to checking timings.
What did you use? I tried using perf and oprofile, but all I could get them to tell me was that a ton of time was being spent in strcmp(), so either it couldn't figure out who was the caller due to missing stack frame pointers, or I just didn't know the right commandline options. (The last time I did any serious profiling I used some custom code (written by someone else at a previous employer) that massaged xml format output from oprofile. A lot has changed since then.)
When I said "instrumented" what I mean is that I put gettimeofday() calls either side of the function calls I thought were interesting/ suspicious, and then printf() the delta :-)
The aug_get calls time less than a millisecond, as do the various other calls. I found the bulk of the time is actually coming from the netcf function "get_augeas", which in turns call "aug_load" for every single damn netcf function call.
I remember David Lutterkort talking about exactly that problem several years ago and *thought* I remembered that he had put something into augeas to only reread the files if they had changed. Has my memory failed me again? Or is augeas doing something and netcf just isn't taking advantage of it?
Either we need to stop loading congfig files on every fnuction call in netcf, or we need to add a netcf bulk data API call, so that libvirt can load /all/ the data it needs in 1 single API call.
I much prefer (1) :-)
The main difficulty with doing (1) is that IIRC, the libvirtd daemon holds is augeas connection open permanently, so we do need some way to periodically refresh the interface data. We can't just do it when a client calls virCOnnectOpen, because apps like openstack one a single connection and then keep it open forever. I guess doing it on every function call was the easy way to "solve" this. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 09/25/2015 01:27 PM, Daniel P. Berrange wrote:
So, I instrumented the netcf and augeas code to checking timings.
The aug_get calls time less than a millisecond, as do the various other calls. I found the bulk of the time is actually coming from the netcf function "get_augeas", which in turns call "aug_load" for every single damn netcf function call. So when we have 500 interfaces, we're telling augeas to load all the config files 1000 times. That's where the slowness is coming from....
Either we need to stop loading config files on every fnuction call in netcf,
Right you are! netcf has a variable "load_augeas" that is set each time an API call is started. I removed the line that sets it, and the time for virsh iface-list --all for 309 toplevel interfaces (300 vlans connected to 300 bridges + whatever is really on the host) went from 22.2 sec. to 1.47 sec! But of course we can't blindly re-use the initial data forever. Looking at aug_load() itself, I see a lot of comments about everything it does to avoid unnecessary re-loading of files. This was added in 2010, and is probably what I remember David talking about: commit 5ee8163051be8214507c13c86171ac90ca7cb91f Author: David Lutterkort <lutter@redhat.com> Date: Tue Jun 29 15:32:44 2010 -0700 Avoid unnecessary file parsing when reloading the tree We used to reparse every file we knew about upon aug_load. Now, we only reparse files if the file has changed on disk. We test a few scenarios to make sure aug_load retains its behavior of obliterating the tree and filling it with the latest from disk. This includes throwing away unsaved changes or trees that have been deleted. So now the question is whether something can be done to improve aug_load() (maybe it used to perform better and there has been a regression?), or if netcf should derive a list of files from the list sent to augeas, and do its own checking of the timestamps. (After this message I think we can remove libvir-list from the Cc, since it's clear that everything is between netcf and augeas).
participants (4)
-
Daniel P. Berrange
-
Guido Günther
-
Ján Tomko
-
Laine Stump