On 12/12/2012 04:45 PM, Daniel P. Berrange wrote:
Many years ago when the QEMU driver was first written for libvirt
the
daemon was single threaded, so we didn't have to worry about locking
at all. Then we introduced threads and so we had to have locking.
Since then locking has been done at two levels. The big QEMU driver
lock must be acquired first, then the per-VM lock could be acquired.
If no field in the driver struct was used, we could quickly
release the QEMU driver lock and only hold the per-VM lock.
Over time though, our code has got more & more complicated to the
point where we have to hold the big QEMU driver lock in the vast
majority of methods. We mitigate this by releasing the locks when
sleeping on monitor API calls, but this is still a huge bottleneck.
This is particularly apparent when you look at concurrency when
starting/stopping guests.
This scalability limitation is at a point where it is unacceptable
for us to continue as we do today.
Band-aids no longer suffice. To fix this for the long term, we need
to dramatically change the locking approach we use in the QEMU driver.
The only way we can achieve this is by dramatically changing the way
we update/access QEMU driver state.
The core problem is that the qemu driver struct holds a vast array
of (often unrelated) data, with differing access patterns. It is
clear that a single lock is not a suitable level of granularity for
this. Looking at what is in the struct we can classify data into a
number of buckets
1. Read-only data that never changes for lifetime of libvirtd.
eg
- char *configDir
- char *libDir
- bool privileged;
- const char *uri;
- virThreadPoolPtr workerPool
2. Read-only data that never changes for as long as the config
file is not reloaded. (Currently equivalent to previous bucket
since we don't support reloading qemu.conf - we need to support
that in the future)
eg
- uid_t user;
- uid_t group;
- int dynamicOwnership;
- unsigned int vncTLS :1;
- unsigned int vncSASL :1;
- char *vncListen;
3. Read-write data that changes at arbitrary times
eg
- virDomainObjList domains
- size_t nactive
- pciDeviceList *activePciHostdevs;
- usbDeviceList *activeUsbHostdevs;
- virHashTablePtr closeCallbacks;
My proposal for dealing with things is as follows
1. Read-only data that never changes for lifetime of libvirtd.
Turn the current driver mutex into a driver RW-lock. All API
calls will always acquire a read-lock at start, hold it for
the lifetime of their execution and release it at the end.
API calls will never directly acquire write locks.
The QEMU driver startup / shutdown global initializers will
acquire write-locks. This ensures the daemon can't shutdown
while any APIs are being executed.
2. Read-only data that never changes for as long as the config
file is not reloaded.
Move all of this data out into a new virQEMUDriverConfigPtr
struct, which is an instance of virObject. The virQEMUDriverPtr
will hold the primary reference to the config. The contents
of this object struct will be considered immutable once
initialized.
When an API needs to access config file, it will obtain a
reference on the config object. Obtaining the reference
will involve acquiring & releasing the driver lock.
If the QEMU driver needs to reload the config, it will populate
a completely new virQEMUDriverConfigPtr instance, and unref
the existing one.
Thus access to data in virQEMUDriverConfigPtr can be completely
lockless once an instance has been acquired, despite the possiblity
of the config being updated at an arbitrary time.
cf RCU (read-copy-update)
3. Read-write data that changes at arbitrary times.
All data that can be changed must be stored in a dedicated
virObject based instance. Each object must provide its own
internal locking mechanisms targetted to the type of data
being stored.
Some objects may need some re-architecting to allow them to
operate effectively without the protection of the long lived
QEMU driver lock. For example during domain startup, we rely
on the QEMU driver lock to protect against races between the
time we check for an existing VM with (name,uuid), and the
time we actually finish starting the new VM & store it in
the domain list. To deal with this the virDomainObjList
will need to have some concept of a 'reserved name,uuid'
so safety is ensured, despite not holding a lock for the whole
start operation.
So I lied slightly when I said this was the death of the big QEMU
driver lock. The big QEMU driver lock still exists, but API calls
only ever need to have read-locks. Write-locks are only held for
libvirtd startup/shutdown, and for the tiny time window it takes
to grab a reference to a virQEMUDriverConfigPtr.
Access to config params is completely lockless, even allowing for
their live update.
All the remaining exclusive locks will be pushed down into individual
objects which need them, hopefully ensuring high concurrency of
operation.
Implementing this all is a non-trivial job, so I envisage the following
order of attack
1. Create the virQEMUDriverConfigPtr object & move config file
parameters into that.
2. Encapsulte all read-writable state into objects with dedicated
locking
3. Turn QEMU driver mutex into a read-write lock
4. Convert all APIs to only hold read-locks on QEMU driver.
Regards,
Daniel
Hi,
I see this is thought-through a lot more than I did when I first came
across this issue, but I'm glad some of those thoughts were fairly
similar, so I'll share mine too, with some questions following.
I already have a working patch for virRWLock with its methods LockRead,
Lock(Write) and Unlock, that not only replaces QEMU driver's virMutex,
but is written so it can be used on more places (of course). With this,
an idea for new "locking virObject" came to my mind. A possible
"inherited class", let's say, that would have the capabilities of
virObject with RWLock incorporated inside itself with universal
lock-related methods. Is that what you had in mind with the "object
with dedicated locking"? This would help having the lock on same
"level" in methods related to the object itself. For other use cases
than QEMU driver, I made it as Win32-friendly as I could without being
able to compile on that platform.
Having each and every read-write property of the driver as an object
with its own read-write lock (again not Mutex, see below) seems a bit
cumbersome, though. I thought splitting it in a logical hierarchy would
be nicer. This, of course, might not be possible for most of those
properties (just an idea).
I understand you meant it the same way as I, but just to be sure the
idea is common, these objects should have read-write locks, not mutexes
in them, right? I'm asking, because some of the locks can be held for
procedures that are most likely to be only reading (e.g. looking up a
domain).
As you said, one of the most apparent bottlenecks is parallel starting
of QEMU domains. This is even more troublesome (and apparent) when
management applications start, for example, requesting statistics for
each of those domains. The speed can go down by orders of magnitude in
such cases.
I've got two additional questions, if I may, just out of curiosity.
What category are you planing on keeping the capabilities in. That
virQEMUDriverConfigPtr or as one of the self-locking objects?
I couldn't find out for sure, but it seems like POSIX thread's
write-locks can be starved by read-locks, which we IMHO don't want since
the majority of operations will be read-only. Do you know what's the
preference, and potentially, will we have to rewrite the locks to be
greedy the other way around?
Martin