[libvirt] [RFC PATCH 0/2] Bandaid to "fix" bulk stats API on big hosts

newer
[libvirt] [PATCH 0/2] fix labeling...

Peter Krempa

23 May 2017 23 May '17

1 a.m.

Make the API fail less often by bumping the RPC message size and telling users not to request stats for too many VMs until we figure out a better solution to get a lot of stats. Peter Krempa (2): rpc: Bump maximum message size to 32M lib: Add note that bulk stats API queries may overrun RPC buffers src/libvirt-domain.c | 7 +++++++ src/rpc/virnetprotocol.x | 4 ++-- 2 files changed, 9 insertions(+), 2 deletions(-) -- 2.12.2

Show replies by date

Peter Krempa

23 May 23 May

1 a.m.

New subject: [libvirt] [RFC PATCH 1/2] rpc: Bump maximum message size to 32M

While most of the APIs are okay with 16M messages, the bulk stats API can run into the limit in big configurations. Before we devise a new plan for this, bump this limit slightly to accomodate some more configs. --- src/rpc/virnetprotocol.x | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/rpc/virnetprotocol.x b/src/rpc/virnetprotocol.x index ee9899059..901c67159 100644 --- a/src/rpc/virnetprotocol.x +++ b/src/rpc/virnetprotocol.x @@ -40,13 +40,13 @@ const VIR_NET_MESSAGE_INITIAL = 65536; const VIR_NET_MESSAGE_LEGACY_PAYLOAD_MAX = 262120; /* Maximum total message size (serialised). */ -const VIR_NET_MESSAGE_MAX = 16777216; +const VIR_NET_MESSAGE_MAX = 33554432; /* Size of struct virNetMessageHeader (serialised)*/ const VIR_NET_MESSAGE_HEADER_MAX = 24; /* Size of message payload */ -const VIR_NET_MESSAGE_PAYLOAD_MAX = 16777192; +const VIR_NET_MESSAGE_PAYLOAD_MAX = 33554408; /* Size of message length field. Not counted in VIR_NET_MESSAGE_MAX * and VIR_NET_MESSAGE_INITIAL. -- 2.12.2

Martin Kletzander

4:38 p.m.

New subject: [libvirt] [RFC PATCH 1/2] rpc: Bump maximum message size to 32M

On Mon, May 22, 2017 at 06:00:12PM +0200, Peter Krempa wrote:

...

While most of the APIs are okay with 16M messages, the bulk stats API can run into the limit in big configurations. Before we devise a new plan for this, bump this limit slightly to accomodate some more configs. --- src/rpc/virnetprotocol.x | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/rpc/virnetprotocol.x b/src/rpc/virnetprotocol.x index ee9899059..901c67159 100644 --- a/src/rpc/virnetprotocol.x +++ b/src/rpc/virnetprotocol.x @@ -40,13 +40,13 @@ const VIR_NET_MESSAGE_INITIAL = 65536; const VIR_NET_MESSAGE_LEGACY_PAYLOAD_MAX = 262120;

/* Maximum total message size (serialised). */ -const VIR_NET_MESSAGE_MAX = 16777216; +const VIR_NET_MESSAGE_MAX = 33554432;

ACK. Although if you changed it to something nicer to read so that it doesn't take someone too long to figure out it's actually just 32 MiB, that would be even better. Whatever suits you, be it (1 << 25) or (32 * 2 << 20) or (32 * 1024 * 1024) or anything else. I know other numbers could be changed as well, but this one is most updated, currently.

...

/* Size of struct virNetMessageHeader (serialised)*/ const VIR_NET_MESSAGE_HEADER_MAX = 24;

/* Size of message payload */ -const VIR_NET_MESSAGE_PAYLOAD_MAX = 16777192; +const VIR_NET_MESSAGE_PAYLOAD_MAX = 33554408;

and this could be: const VIR_NET_MESSAGE_PAYLOAD_MAX = VIR_NET_MESSAGE_MAX - VIR_NET_MESSAGE_HEADER_MAX; or is none of that supported by the parser? In that case, just leave it like that.

...

/* Size of message length field. Not counted in VIR_NET_MESSAGE_MAX * and VIR_NET_MESSAGE_INITIAL. -- 2.12.2

-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Peter Krempa

24 May 24 May

7:34 p.m.

New subject: [libvirt] [RFC PATCH 1/2] rpc: Bump maximum message size to 32M

On Tue, May 23, 2017 at 09:38:01 +0200, Martin Kletzander wrote:

...

On Mon, May 22, 2017 at 06:00:12PM +0200, Peter Krempa wrote:

...
While most of the APIs are okay with 16M messages, the bulk stats API can run into the limit in big configurations. Before we devise a new plan for this, bump this limit slightly to accomodate some more configs. --- src/rpc/virnetprotocol.x | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/rpc/virnetprotocol.x b/src/rpc/virnetprotocol.x index ee9899059..901c67159 100644 --- a/src/rpc/virnetprotocol.x +++ b/src/rpc/virnetprotocol.x @@ -40,13 +40,13 @@ const VIR_NET_MESSAGE_INITIAL = 65536; const VIR_NET_MESSAGE_LEGACY_PAYLOAD_MAX = 262120;

/* Maximum total message size (serialised). */ -const VIR_NET_MESSAGE_MAX = 16777216; +const VIR_NET_MESSAGE_MAX = 33554432;

ACK.

Although if you changed it to something nicer to read so that it doesn't take someone too long to figure out it's actually just 32 MiB, that would be even better. Whatever suits you, be it (1 << 25) or (32 * 2 << 20) or (32 * 1024 * 1024) or anything else. I know other numbers could be changed as well, but this one is most updated, currently.

I thought about that, but rpcgen is not clever enough: GEN rpc/virnetprotocol.h const VIR_NET_MESSAGE_MAX = 32 * 1024 * 1024; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rpc/virnetprotocol.x, line 43: expected ';' cannot shutdown /usr/bin/rpcgen: at ./rpc/genprotocol.pl line 139.

...

...
/* Size of struct virNetMessageHeader (serialised)*/ const VIR_NET_MESSAGE_HEADER_MAX = 24;

/* Size of message payload */ -const VIR_NET_MESSAGE_PAYLOAD_MAX = 16777192; +const VIR_NET_MESSAGE_PAYLOAD_MAX = 33554408;

and this could be:

const VIR_NET_MESSAGE_PAYLOAD_MAX = VIR_NET_MESSAGE_MAX - VIR_NET_MESSAGE_HEADER_MAX;

And this does not work in the same manner.

...

or is none of that supported by the parser? In that case, just leave it like that.

Welp ...

Peter Krempa

23 May 23 May

1 a.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. -- 2.12.2

Martin Kletzander

5:12 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...

Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried. + *

Makes sense to me, now we can at least point people somewhere when this happens again. ACK from me, but feel free to wait if you want to really make this an RFC and you want more opinions. Is there a discussion about how to continue for the future? I remember Michal starting some discussion, but I'm not sure whether that was about the same thing. Anyway, that's just a rhetorical question so that I can say the two ideas I have: 1) Just increase the limit over time. Computers and networks are getting faster, there's more storage space and memory, and so on. It only makes sense to do scale other things respectively. 2) Have a new API that streams the data back over virStream. We can then do it continuously (like every X seconds), that might help management apps as well because I suspect that's how they use this API anyway. Thanks for listening ;)

...

* Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. -- 2.12.2

-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Peter Krempa

5:14 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Tue, May 23, 2017 at 10:12:06 +0200, Martin Kletzander wrote:

...

On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...
Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

[...]

...

Is there a discussion about how to continue for the future? I remember Michal starting some discussion, but I'm not sure whether that was about the same thing. Anyway, that's just a rhetorical question so that I can say the two ideas I have:

1) Just increase the limit over time. Computers and networks are getting faster, there's more storage space and memory, and so on. It only makes sense to do scale other things respectively.

2) Have a new API that streams the data back over virStream. We can then do it continuously (like every X seconds), that might help management apps as well because I suspect that's how they use this API anyway.

3) danpb also pointed out that we could use a chunk of shared memory to transport the stats for local connections, so that we skip the RPC layer altoghether and also avoid having two copies of the same data. But that's something we need to design first.

Daniel P. Berrange

5:26 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Tue, May 23, 2017 at 10:12:06AM +0200, Martin Kletzander wrote:

...

On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...
Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried. + *

Makes sense to me, now we can at least point people somewhere when this happens again. ACK from me, but feel free to wait if you want to really make this an RFC and you want more opinions.

Is there a discussion about how to continue for the future? I remember Michal starting some discussion, but I'm not sure whether that was about the same thing. Anyway, that's just a rhetorical question so that I can say the two ideas I have:

1) Just increase the limit over time. Computers and networks are getting faster, there's more storage space and memory, and so on. It only makes sense to do scale other things respectively.

2) Have a new API that streams the data back over virStream. We can then do it continuously (like every X seconds), that might help management apps as well because I suspect that's how they use this API anyway.

Thanks for listening ;)

Before we do either of those we should consider just changing the RPC message format for the APIs in question. Essentially have, say, 10 statistics we are grabbing for every VM. On the wire we are encoding <name 1>: <value for stat 1, VM 1> <name 2>: <value for stat 2, VM 1> ... <name N>: <value for stat N, VM 1> <name 1>: <value for stat 1, VM 2> <name 2>: <value for stat 2, VM 2> ... <name N>: <value for stat N, VM 2> <name 1>: <value for stat 1, VM 3> <name 2>: <value for stat 2, VM 3> ... <name N>: <value for stat N, VM 3> where <name XX> is a long string that is basically the same for every VM, while <value> is just an int64. We could change this so that we have a table of names at the start <name 1>: <index 1> <name 2>: <index 2> ... <name N>: <index N> and then for each VM we have <index 1>: <value for stat 1, VM 1> <index 2>: <value for stat 2, VM 1> ... <index N>: <value for stat N, VM 1> Soo the name string is no longer repeated, and in its place is an integer index. If every name is say 12 characters long, and the index is a 4 byte int, we've reduced a 20 byte packet per stat to 12 bytes. That's just one idea, possibly/probably not even the best - there's various other ways we could encode the data to make it more efficient. We could perhaps even consider changing the public API too, because the typed parameters are a really inefficient way to exporting the data in the API too If you want to get more radical with a push based solution, then I would suggest we consider setting up shared memory segment between libvirt & the client app where we just continuously update the data, avoiding RPC entirely. This would only work for apps running locally of course, but at a large scale, it seems the major apps using libvirt have all taken the approach of having a local libvirtd connection, rather than trying to use our remote RPC at data center scale levels from a central host. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Daniel P. Berrange

5:32 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...

Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried.

With the way this is worded, applications have no guidance on what is a sensible max number of VMs they can safely use, so I don't think it is particularly useful, except as a way to say "we told you so" when apps report a bug. I'd be inclined to explicitly put a limit of 100 VMs[1] in the API, and document that hard limit, so apps immediately know to write their code to chunk in 100 VM blocks. Regards, Daniel [1] 100 is entirely arbitrary for this email - we'd actually pick a number based on what we reasonably expect to fit in the RPC message size. -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Martin Kletzander

5:51 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Tue, May 23, 2017 at 09:32:03AM +0100, Daniel P. Berrange wrote:

...

On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...
Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried.

With the way this is worded, applications have no guidance on what is a sensible max number of VMs they can safely use, so I don't think it is particularly useful, except as a way to say "we told you so" when apps report a bug.

I'd be inclined to explicitly put a limit of 100 VMs[1] in the API, and document that hard limit, so apps immediately know to write their code to chunk in 100 VM blocks.

Or we can give a hint on what the limit is and let users figure out the sensible number. It is not only based on the number of VMs. I believe you can hit the limit with one or two VMs if you have lot of devices that we report statistics for. Limiting the number of VMs to a particular number would not help as much in this case. But we can combine both approaches.

...

Regards, Daniel

[1] 100 is entirely arbitrary for this email - we'd actually pick a number based on what we reasonably expect to fit in the RPC message size. -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Daniel P. Berrange

6:04 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Tue, May 23, 2017 at 10:51:47AM +0200, Martin Kletzander wrote:

...

On Tue, May 23, 2017 at 09:32:03AM +0100, Daniel P. Berrange wrote:

...
On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...
Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried.

With the way this is worded, applications have no guidance on what is a sensible max number of VMs they can safely use, so I don't think it is particularly useful, except as a way to say "we told you so" when apps report a bug.

I'd be inclined to explicitly put a limit of 100 VMs[1] in the API, and document that hard limit, so apps immediately know to write their code to chunk in 100 VM blocks.

Or we can give a hint on what the limit is and let users figure out the sensible number. It is not only based on the number of VMs. I believe you can hit the limit with one or two VMs if you have lot of devices that we report statistics for. Limiting the number of VMs to a particular number would not help as much in this case. But we can combine both approaches.

Urgh, that's even worse. Apps can't simply split queries into blocks of N vms, because any single VM in that list might have huge number of disks. So to use this at all reliably you have to query the XML config of every guest to see what devices are present and then split up the queries into variable number of VMs :-( This just adds to my feeling that we should consider this API a failed experiment Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Martin Kletzander

6:19 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Tue, May 23, 2017 at 10:04:44AM +0100, Daniel P. Berrange wrote:

...

On Tue, May 23, 2017 at 10:51:47AM +0200, Martin Kletzander wrote:

...
On Tue, May 23, 2017 at 09:32:03AM +0100, Daniel P. Berrange wrote:

...
On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...
Hint that the users should limit the number of VMs queried in the bulk stats API. --- src/libvirt-domain.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 310b91b37..b01f2705c 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11315,6 +11315,10 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * VIR_CONNECT_GET_ALL_DOMAINS_STATS_SHUTOFF and/or * VIR_CONNECT_GET_ALL_DOMAINS_STATS_OTHER for all other states. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to use virDomainListGetStats with a + * reasonable list of VMs as the argument. + * * Returns the count of returned statistics structures on success, -1 on error. * The requested data are returned in the @retStats parameter. The returned * array should be freed by the caller. See virDomainStatsRecordListFree. @@ -11386,6 +11390,9 @@ virConnectGetAllDomainStats(virConnectPtr conn, * Note that any of the domain list filtering flags in @flags may be rejected * by this function. * + * Note that this API is prone to exceeding maximum RPC message size on hosts + * with lots of VMs so it's suggested to limit the number of VMs queried.

With the way this is worded, applications have no guidance on what is a sensible max number of VMs they can safely use, so I don't think it is particularly useful, except as a way to say "we told you so" when apps report a bug.

I'd be inclined to explicitly put a limit of 100 VMs[1] in the API, and document that hard limit, so apps immediately know to write their code to chunk in 100 VM blocks.

Or we can give a hint on what the limit is and let users figure out the sensible number. It is not only based on the number of VMs. I believe you can hit the limit with one or two VMs if you have lot of devices that we report statistics for. Limiting the number of VMs to a particular number would not help as much in this case. But we can combine both approaches.

Urgh, that's even worse. Apps can't simply split queries into blocks of N vms, because any single VM in that list might have huge number of disks. So to use this at all reliably you have to query the XML config of every guest to see what devices are present and then split up the queries into variable number of VMs :-(

I meant something along the lines of: This API might transfer lot of data over the connection which needs to be limited. The limit is currently 32MB with all the headers and the output format is described here, so one should be able to make a pretty good assumptions based on the average deployment they have. In case you get error message saying "RPC: message to big", you need to limit the number of VMs you are requesting statistics for. And then put a limit in the API that returns additional error if more than 100 VMs are in the list.

...

This just adds to my feeling that we should consider this API a failed experiment

I'm not saying no... But we should be able to do such things, just in a different way, let's say. The idea with shared memory is nice. Even remotely the libvirt API could expose it as such and underneath we could be changing how that update routine works.

...

Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Peter Krempa

6:50 p.m.

New subject: [libvirt] [RFC PATCH 2/2] lib: Add note that bulk stats API queries may overrun RPC buffers

On Tue, May 23, 2017 at 10:04:44 +0100, Daniel Berrange wrote:

...

On Tue, May 23, 2017 at 10:51:47AM +0200, Martin Kletzander wrote:

...
On Tue, May 23, 2017 at 09:32:03AM +0100, Daniel P. Berrange wrote:

...
On Mon, May 22, 2017 at 06:00:13PM +0200, Peter Krempa wrote:

...
Hint that the users should limit the number of VMs queried in the bulk stats API.

[...]

...

...
Or we can give a hint on what the limit is and let users figure out the sensible number. It is not only based on the number of VMs. I believe you can hit the limit with one or two VMs if you have lot of devices that we report statistics for. Limiting the number of VMs to a particular number would not help as much in this case. But we can combine both approaches.

Urgh, that's even worse. Apps can't simply split queries into blocks of N vms, because any single VM in that list might have huge number of disks. So to use this at all reliably you have to query the XML config of every guest to see what devices are present and then split up the queries into variable number of VMs :-(

This just adds to my feeling that we should consider this API a failed experiment

Given this metric, any API in libvirt that takes XML is failed in the same sense. You can easily have a VM that exceeds the 4MiB string RPC limit (and 16, or 32) MiB in this sense anyways. Along with metadata, you can reach the limit even easier. A disk information dump returned with this API has slightly above 600 bytes, so let's say 1k, with this metric you can have a VM with 16k disks/nics/whatever, which is reasonable. With the bump to 32MiB, you can obviously have even more. Since this API makes sense (saves time) even if called for a singe VM I don't really think this is a big problem. If you say that the number of VMs you should query is let's say 2 or 4, you can have giant guests. The only non-scalable part is if you have lots of giant guests. You can't really fix that.

3110

Age (days ago)

3112

Last active (days ago)

List overview

Download

12 comments

3 participants

participants (3)

Daniel P. Berrange
Martin Kletzander
Peter Krempa

[libvirt] [RFC PATCH 0/2] Bandaid to "fix" bulk stats API on big hosts

tags

participants (3)