On 11/02/2017 06:20 AM, Cedric Bosdonnat wrote:
Hi all,
I'm investigating a crash in the storage driver, and it seems the solution
is not as obvious as I wanted it.
Basically what happens is that libvirtd crashes due to the pool refresh thread
clearing the list of pool volumes while someone is adding a new volume. Here
is the valgrind log I have for that:
http://paste.opensuse.org/58733866
What version of libvirt? Your line numbers don't match up with current
HEAD.
I tried adding a call to virStoragePoolObjLock() in the
virStorageVolPoolRefreshThread(), since the storageVolCreateXML()
appears to have it, but I'm getting a dead lock. Here are the waiting
threads I got from gdb:
virStoragePoolObjFindByName returns a locked pool obj, so not sure
where/why you added this...
http://paste.opensuse.org/45961070
Any idea what would be a proper fix for that? My vision of the storage
drivers locks is too limited it seems ;)
From just a quick look...
The virStorageVolPoolRefreshThread will take storageDriverLock, get a
locked pool object (virStoragePoolObjFindByName), clear out the pool,
add back volumes that it knows about, unlock the pool object, and call
storageDriverUnlock.
If you were adding a new pool... You'd take storageDriverLock, then
eventually virStoragePoolObjAssignDef would be called and the pool's
object lock taken and unlocked, and returns a locked pool object which
gets later unlocked in cleanup, followd by a storageDriverUnlock.
If you're adding a new volume object, you'd get a locked pool object via
virStoragePoolObjFromStoragePool, then if building, that lock gets
dropped after increasing the async job count and setting the building
flag, the volume is built, then the object lock retaken while
temporarily holding the storageDriverLock, the async job count gets
decremented and the building flag cleared, eventually we fall into
cleanup with unlocks the pool again.
So how to fix - well seems to me the storageDriverLock in VolCreateXML
may be the way since the refresh thread takes the driver lock first,
then the pool object second. Perhaps even like the build code where it
takes it temporarily while getting the pool object. I'd have to think a
bit more about though. Still might be something to try since the Vol
Refresh thread takes it while running...
John
Not related to this problem per se, but what may help even more is if I
can get the storage driver usage of a common object model patches
completely reviewed, but that's a different problem ;-)... I'll have to
go look and see if I may have fixed this there. The newer model uses
hash tables, RW locks, and reduces the storage driver hammer lock, but
this one condition may not have been addressed.
--
Cedric
--
libvir-list mailing list
libvir-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list