On Thu, May 16, 2013 at 1:18 PM, Daniel P. Berrange <berrange(a)redhat.com> wrote:
On Thu, May 16, 2013 at 01:00:15PM -0400, Peter Feiner wrote:
> > How many CPU cores are you testing on ? That's a good improvement,
> > but I'd expect the improvement to be greater as # of core is larger.
>
> I'm testing on 12 Cores x 2 HT per code. As I'm working on teasing out
> software bottlenecks, I'm intentionally running fewer tasks (20 parallel
> creations) than the number of logical cores (24). The memory, disk and
> network are also well over provisioned.
>
> > Also did you tune /etc/libvirt/libvirtd.conf at all ? By default we
> > limit a single connection to only 5 RPC calls. Beyond that calls
> > queue up, even if libvirtd is otherwise idle. OpenStack uses a
> > single connection for everythin so will hit this. I suspect this
> > would be why virConnectGetLibVersion would appear to be slow. That
> > API does absolutely nothing of any consequence, so the only reason
> > I'd expect that to be slow is if you're hitting a libvirtd RPC
> > limit causing the API to be queued up.
>
> I hadn't tuned libvirtd.conf at all. I have just increased
> max_{clients,workers,requests,client_requests} to 50 and repeated my
> experiment. As you expected, virtConnectGetLibVersion is now very fast.
> Unfortunately, the median VM creation time didn't change.
>
> > I'm not actively doing anything in this area. Mostly because I've got
not
> > clear data on where any remaining bottlenecks are.
>
> Unless there are other parameters to tweak, I believe I'm still hitting a
> bottleneck. Booting 1 VM vs booting 20 VMs in parallel, the times for libvirt
> calls are
>
> virConnectDefineXML*: 13ms vs 4.5s
> virDomainCreateWithFlags*: 1.8s vs 20s
>
> * I had said that virConnectDefineXML wasn't serialized in my first email. I
> based that observation on a single trace I looked at :-) In the average case,
> virConnectDefineXML is affected by a bottleneck.
virConnectDefineXML would at least hit the possible bottleneck on
the virDomainObjListAddLocked method. In fact that's pretty much
the only contended lock I'd expect it to hit. Nothing else that
it runs has any serious locking involved.
Okay cool, I'll measure this. I'll also try to figure out what
virDomainCreateWithFlags is waiting on.
> Note that when I took these measurements, I also monitored CPU
& disk
> utilization.
> During the 20 VM test, both CPU & disk were well below 100% for 97% of the test
> (i.e., 60s test duration, measured utilization with atop using a 2
> second interval,
> CPU was pegged for 2s).
>
> > One theory I had was that the virDomainObjListSearchName method could
> > be a bottleneck, becaue that acquires a lock on every single VM. This
> > is invoked when starting a VM, when we call virDomainObjListAddLocked.
> > I tried removing this locking though & didn't see any performance
> > benefit, so never persued this further. Before trying things like
> > this again, I think we'd need to find a way to actually identify where
> > the true bottlenecks are, rather than guesswork.
>
> Testing your hypothesis would be straightforward. I'll add some
> instrumentation to
> measure the time spent waiting for the locks and repeat my 20 VM experiment. Or,
> if there's some systematic lock profiling in place, then I can turn
> that on and report
> the results.
There's no lock profiling support built-in to libvirt. I'm not sure
of the best way introduce such support without it impacting the very
thing we're trying to test. Suggestions welcome
A straightforward way to keep lock statistics with low overhead and
w/out affecting concurrency would be to use thread local storage
(TLS). At the end of a run, or periodically, the stats could be
aggregated and reported. Since the stats don't have to be precise,
it's OK to do the aggregation racily.
Simple statistics to keep are
* For each lock L, the time spent waiting.
* For each lock L and callsite C, the time spent waiting.
It would probably be sufficient to identify L as the lock's parent
class name. If per-instance stats are necessary, then we could add the
address of the object to the identity of L.
So pseudo code would look something like this:
struct lock_stats {
map of (lock_class) to unsigned long: wait_time;
map of (lock_class, stack_trace) to unsigned long: callsite_wait_time;
};
__thread struct lock_stats * lock_stats;
thread_local_storage_init() {
lock_stats = new lock_stats;
}
/* return microseconds elapsed since some arbitrary start time */
unsigned long timestamp(void) {
struct timespec timespec;
clock_gettime(CLOCK_MONOTONIC, ×pec);
return timespec.tv_sec * 1e6 + timespec.tv_sec / 10;
}
void virObjectLock(void *anyobj) {
unsigned long start, elapsed;
virObjectLockablePtr obj = anyobj;
start = timestamp();
virMutexLock(&obj->lock);
elapsed = timestamp() - start;
lock_stats->wait_time[obj->parent.klass->name] += elapsed
lock_stats->wait_time[obj->parent.klass->name, get_stack_trace()] += elapsed
}
Perhaps a systemtap script would do a reasonable job at it though.
eg record any stack traces associated with long futex_wait() system
calls or something like that.
Regards,
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|