Re: [Libvir] Scability / performance fix for virDomainLookupByID

10 Jul 2006


      On Mon, Jul 10, 2006 at 02:37:36PM +0100, Daniel P. Berrange wrote:
...
So my previous post considered performance when doing a single iteration
over all active domains. I've now moved onto look at what happens when
you do  multiple iterations - eg a monitoring application sampling state
every 5 seconds.
Take code which connects to the HV, and then periodically lists all domains
and gets their status, eg
conn = virConnectOpen(NULL);
for (j = 0 ; j < 5 ; j++) {
      nid = virConnectListDomains(conn, ids, sizeof(ids)/sizeof(int));
for (i = 0 ; i < nid ; i++) {
        int id = ids[i];
dom = virDomainLookupByID(conn, id);
ret = virDomainGetInfo(dom, &info);
printf("Domains %d: %d CPUs\n", id, info.nrVirtCpu);
      }
    }
That code is wrong, if you do a domain lookup, bringing a new domain 
pointer then you must call the associated freeing routine, you can't
just forget the dom pointer. And if you do then your 'optimization' below
will break, because the internal reference counting for the domain will be
back to 0 and the associated virDomain will be freed (check the value of
dom->uses as you iterate though the loop !)
  Note that from python such a loop would not leak that way because the
virDomain::__del__ destructor calls implicitely virDomainFree, but that mean
that the same loop done in python would get no benefit from your patch ;-)
...
Then, every single call to virDomainLookupByID will trigger an HTTP GET
to resolve  id -> name,uuid.  Now the virConnect  data structure contains
a hash table storing all known domains, however, the xenDaemonLookupByID
method only uses the cache once its done the HTTP GET to resolv id -> name.
Okay,
...
This is obviously needed the first time around, however, subsequent calls
to xenDaemonLookupByID have no need to do the HTTP GET since the cache
already contains a virDomain instance with id, name, uuid resolved.
Well, that is on the edge of the correctness. The problem is that ID will
change, for example if a separate application do a suspend to disk, and later 
a resume operation, the domain will still exist, but its ID will have changed.
If you just cache here and never invalidate you may generate false positive
you report the domain as existing but it does not.
...
If you measure the above code snippet on a machine with 20 domains, then
5 iterations of the outer loop will require 101  HTTP GET's and take about
5 seconds to complete.
If we modify the xenDaemonLookupByID method to check the cache immediately,
doing a cache lookup on ID, rather than name, we only need to do the HTTP
GET request the very first time a domain is looked up. Thus, only the first
iteration of the outer loop will result in HTTP calls - you can do as many
iterations as you like and there will only ever by 21 HTTP GETs performed,
and total runtime for 5 iterations drops to 1 second. The first iteration
does HTTP GETS to resolve name & uuid, but all subsquent iterations only
need to do HyperCalls so are lightening fast.
I'm attaching a patch which implements this speed up. The only issue not
being dealt with here is cache eviction. Xen Domain IDs will always increase,
no ID is re-used until the IDs wrap around - I can't remember whether the
ID space is 16 or 32-bits, but even with 16bits we'd need 32,000 domains to
have been created before wrap-around becomes an issue. We'd need to address
it long term, but for now I think we can live with this. Cache evication
could probaly be done in the virListDomains method - ie evict any entries
with IDs not currently running
I think that as-is that patch is wrong. First you will get the gain
that you see only if you make the error of not freeing the domain object
in the loop, and second it may generate false positive lookup.
  At this point it's better to be safe and slower until we have a better
way to implement this correctly. This would require:
   - adding graceful delay on domain deallocation in libvirt to 
     actually benefit from the caching effect when apps query and release
     repeatedly
   - get some invalidation mechanism, for example each time the ID 
     list is queried we know we have a full snapshot, then some cleanup
     should be done there.

  So yes, there is some optimization to be done there, but as is the patch
doesn't really seems okay to me, there is a bit more work needed before
we get to a correct solution :-)

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard@redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/