Re: [libvirt] storage lock issues

Thursday, 2 November 2017

Hi John,

On Thu, 2017-11-02 at 07:03 -0400, John Ferlan wrote:
...

 On 11/02/2017 06:20 AM, Cedric Bosdonnat wrote:
 > Hi all,
 > 
 > I'm investigating a crash in the storage driver, and it seems the solution
 > is not as obvious as I wanted it.
 > 
 > Basically what happens is that libvirtd crashes due to the pool refresh thread
 > clearing the list of pool volumes while someone is adding a new volume. Here
 > is the valgrind log I have for that:
 > 
 > http://paste.opensuse.org/58733866
 > 

 What version of libvirt?  Your line numbers don't match up with current
 HEAD. 
It's a 3.8.0.

...
 > I tried adding a call to virStoragePoolObjLock() in the
 > virStorageVolPoolRefreshThread(), since the storageVolCreateXML()
 > appears to have it, but I'm getting a dead lock. Here are the waiting
 > threads I got from gdb:

 virStoragePoolObjFindByName returns a locked pool obj, so not sure
 where/why you added this... 
Oh I overlooked that... then there is already a lock on it, the solution
is elsewhere

...
 > 
 > http://paste.opensuse.org/45961070
 > 
 > Any idea what would be a proper fix for that? My vision of the storage
 > drivers locks is too limited it seems ;)

 From just a quick look...

 The virStorageVolPoolRefreshThread will take storageDriverLock, get a
 locked pool object (virStoragePoolObjFindByName), clear out the pool,
 add back volumes that it knows about, unlock the pool object, and call
 storageDriverUnlock.

 If you were adding a new pool... You'd take storageDriverLock, then
 eventually virStoragePoolObjAssignDef would be called and the pool's
 object lock taken and unlocked, and returns a locked pool object which
 gets later unlocked in cleanup, followd by a storageDriverUnlock.

 If you're adding a new volume object, you'd get a locked pool object via
 virStoragePoolObjFromStoragePool, then if building, that lock gets
 dropped after increasing the async job count and setting the building
 flag, the volume is built, then the object lock retaken while
 temporarily holding the storageDriverLock, the async job count gets
 decremented and the building flag cleared, eventually we fall into
 cleanup with unlocks the pool again. 
Thanks for the details: this is really useful to globally understand how
locks are working in the storage driver. Maybe worth turning it into a doc
somewhere?

...
 So how to fix - well seems to me the storageDriverLock in
VolCreateXML
 may be the way since the refresh thread takes the driver lock first,
 then the pool object second.  Perhaps even like the build code where it
 takes it temporarily while getting the pool object. I'd have to think a
 bit more about though. Still might be something to try since the Vol
 Refresh thread takes it while running... 
Ok, I'll look into that direction.

...
 John

 Not related to this problem per se, but what may help even more is if I
 can get the storage driver usage of a common object model patches
 completely reviewed, but that's a different problem ;-)... I'll have to
 go look and see if I may have fixed this there. The newer model uses
 hash tables, RW locks, and reduces the storage driver hammer lock, but
 this one condition may not have been addressed. 
I can have a go at it and see if I can reproduce the crash with it. Not sure
I'll be the one skilled enough for the full review though ;)

--
Cedric

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] storage lock issues