On 11/02/2017 09:10 AM, Cedric Bosdonnat wrote:
Hi John,
On Thu, 2017-11-02 at 07:03 -0400, John Ferlan wrote:
>
> On 11/02/2017 06:20 AM, Cedric Bosdonnat wrote:
>> Hi all,
>>
>> I'm investigating a crash in the storage driver, and it seems the solution
>> is not as obvious as I wanted it.
>>
>> Basically what happens is that libvirtd crashes due to the pool refresh thread
>> clearing the list of pool volumes while someone is adding a new volume. Here
>> is the valgrind log I have for that:
>>
>>
http://paste.opensuse.org/58733866
>>
>
> What version of libvirt? Your line numbers don't match up with current
> HEAD.
It's a 3.8.0.
>> I tried adding a call to virStoragePoolObjLock() in the
>> virStorageVolPoolRefreshThread(), since the storageVolCreateXML()
>> appears to have it, but I'm getting a dead lock. Here are the waiting
>> threads I got from gdb:
>
> virStoragePoolObjFindByName returns a locked pool obj, so not sure
> where/why you added this...
Oh I overlooked that... then there is already a lock on it, the solution
is elsewhere
>>
>>
http://paste.opensuse.org/45961070
>>
>> Any idea what would be a proper fix for that? My vision of the storage
>> drivers locks is too limited it seems ;)
>
> From just a quick look...
>
> The virStorageVolPoolRefreshThread will take storageDriverLock, get a
> locked pool object (virStoragePoolObjFindByName), clear out the pool,
> add back volumes that it knows about, unlock the pool object, and call
> storageDriverUnlock.
>
> If you were adding a new pool... You'd take storageDriverLock, then
> eventually virStoragePoolObjAssignDef would be called and the pool's
> object lock taken and unlocked, and returns a locked pool object which
> gets later unlocked in cleanup, followd by a storageDriverUnlock.
>
> If you're adding a new volume object, you'd get a locked pool object via
> virStoragePoolObjFromStoragePool, then if building, that lock gets
> dropped after increasing the async job count and setting the building
> flag, the volume is built, then the object lock retaken while
> temporarily holding the storageDriverLock, the async job count gets
> decremented and the building flag cleared, eventually we fall into
> cleanup with unlocks the pool again.
Thanks for the details: this is really useful to globally understand how
locks are working in the storage driver. Maybe worth turning it into a doc
somewhere?
You mean code isn't self documenting, shocking ;-) [tongue firmly
planted in cheek].
Perhaps when I complete the conversion for Storage Pool/Volume and do
more Domain fix-ups I can add something somewhere that "generally"
describes internal flow. The goal I've had is to be fairly similar for
all drivers w/r/t object usage and code flow. It's taking a long time
to get there though because it's a lot of "roto-tilling" of code.
John
FWIW: I saw your query before we hopped in the car for a 7 hour drive -
figured at least providing something before I left would be helpful ;-).
> So how to fix - well seems to me the storageDriverLock in
VolCreateXML
> may be the way since the refresh thread takes the driver lock first,
> then the pool object second. Perhaps even like the build code where it
> takes it temporarily while getting the pool object. I'd have to think a
> bit more about though. Still might be something to try since the Vol
> Refresh thread takes it while running...
Ok, I'll look into that direction.
> John
>
> Not related to this problem per se, but what may help even more is if I
> can get the storage driver usage of a common object model patches
> completely reviewed, but that's a different problem ;-)... I'll have to
> go look and see if I may have fixed this there. The newer model uses
> hash tables, RW locks, and reduces the storage driver hammer lock, but
> this one condition may not have been addressed.
I can have a go at it and see if I can reproduce the crash with it. Not sure
I'll be the one skilled enough for the full review though ;)
--
Cedric