Re: [libvirt] Ongoing work on lock contention in qemu driver?

Thursday, 16 May 2013

...
 How many CPU cores are you testing on ?  That's a good
improvement,
 but I'd expect the improvement to be greater as # of core is larger. 
I'm testing on 12 Cores x 2 HT per code. As I'm working on teasing out
software bottlenecks, I'm intentionally running fewer tasks (20 parallel
creations) than the number of logical cores (24). The memory, disk and
network are also well over provisioned.

...
 Also did you tune /etc/libvirt/libvirtd.conf at all ? By default we
 limit a single connection to only 5 RPC calls. Beyond that calls
 queue up, even if libvirtd is otherwise idle. OpenStack uses a
 single connection for everythin so will hit this. I suspect this
 would be why  virConnectGetLibVersion would appear to be slow. That
 API does absolutely nothing of any consequence, so the only reason
 I'd expect that to be slow is if you're hitting a libvirtd RPC
 limit causing the API to be queued up. 
I hadn't tuned libvirtd.conf at all. I have just increased
max_{clients,workers,requests,client_requests} to 50 and repeated my
experiment. As you expected, virtConnectGetLibVersion is now very fast.
Unfortunately, the median VM creation time didn't change.

...
 I'm not actively doing anything in this area. Mostly because
I've got not
 clear data on where any remaining bottlenecks are. 
Unless there are other parameters to tweak, I believe I'm still hitting a
bottleneck. Booting 1 VM vs booting 20 VMs in parallel, the times for libvirt
calls are

virConnectDefineXML*: 13ms vs 4.5s
virDomainCreateWithFlags*: 1.8s vs 20s

* I had said that virConnectDefineXML wasn't serialized in my first email. I
  based that observation on a single trace I looked at :-) In the average case,
  virConnectDefineXML is affected by a bottleneck.

Note that when I took these measurements, I also monitored CPU & disk
utilization.
During the 20 VM test, both CPU & disk were well below 100% for 97% of the test
(i.e., 60s test duration, measured utilization with atop using a 2
second interval,
CPU was pegged for 2s).

...
 One theory I had was that the virDomainObjListSearchName method
could
 be a bottleneck, becaue that acquires a lock on every single VM. This
 is invoked when starting a VM, when we call virDomainObjListAddLocked.
 I tried removing this locking though & didn't see any performance
 benefit, so never persued this further.  Before trying things like
 this again, I think we'd need to find a way to actually identify where
 the true bottlenecks are, rather than guesswork. 
Testing your hypothesis would be straightforward. I'll add some
instrumentation to
measure the time spent waiting for the locks and repeat my 20 VM experiment. Or,
if there's some systematic lock profiling in place, then I can turn
that on and report
the results.

Thanks,
Peter

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] Ongoing work on lock contention in qemu driver?