[libvirt-users] error : virHashForEach:597 : Hash operation not allowed during iteration

Hey! On many of our servers, we often have the following error: error : virHashForEach:597 : Hash operation not allowed during iteration When querying the list of domains, this means libvirt will silently return 0 domain (most uses of virHashForEach don't handle the return code). We are using a hook, but the hook doesn't query libvirt back, so it should be safe from this point of view. However, we have several sources querying libvirt at regular interval (once every 5 seconds from two different sources). I am unable to reproduce the problem with just running "virsh list" in parallel. We are using libvirt 3.6. Is it an expected constraint to not be able to reliably query libvirt from several sources? Have fixes for this problem been pushed since 3.6? -- A kind of Batman of contemporary letters. -- Philip Larkin on Anthony Burgess

On 04/04/2018 10:36 AM, Vincent Bernat wrote:
Hey!
On many of our servers, we often have the following error:
error : virHashForEach:597 : Hash operation not allowed during iteration
When querying the list of domains, this means libvirt will silently return 0 domain (most uses of virHashForEach don't handle the return code).
We are using a hook, but the hook doesn't query libvirt back, so it should be safe from this point of view. However, we have several sources querying libvirt at regular interval (once every 5 seconds from two different sources). I am unable to reproduce the problem with just running "virsh list" in parallel.
We are using libvirt 3.6.
Is this plain 3.6.0 or do you have any patches on top of it (e.g. some in-house patches)?
Is it an expected constraint to not be able to reliably query libvirt from several sources? Have fixes for this problem been pushed since 3.6?
Dunno, this is the first time I hear about this issue. Maybe you can try to set a break point on virHashIterationError() and when it's hit get stacktrace of all threads 't a a bt'. That might shed more light into the issue. Smells like we are not locking somewhere properly. Also, you might want to try the latest release of 4.2.0 and see if the bug still reproduces. Michal

❦ 4 avril 2018 11:17 +0200, Michal Privoznik <mprivozn@redhat.com> :
We are using a hook, but the hook doesn't query libvirt back, so it should be safe from this point of view. However, we have several sources querying libvirt at regular interval (once every 5 seconds from two different sources). I am unable to reproduce the problem with just running "virsh list" in parallel.
We are using libvirt 3.6.
Is this plain 3.6.0 or do you have any patches on top of it (e.g. some in-house patches)?
It's the one from Ubuntu Artful with some additional patches to support Skylake and the IBRS CPU flag.
Is it an expected constraint to not be able to reliably query libvirt from several sources? Have fixes for this problem been pushed since 3.6?
Dunno, this is the first time I hear about this issue. Maybe you can try to set a break point on virHashIterationError() and when it's hit get stacktrace of all threads 't a a bt'. That might shed more light into the issue. Smells like we are not locking somewhere properly.
I'll try. It doesn't happen that often on a given host.
Also, you might want to try the latest release of 4.2.0 and see if the bug still reproduces.
I'll try the 4.0 version which is the version to be in Ubuntu Bionic (I am trying to not diverge from distribution to be able to get security updates for free). -- Choose variable names that won't be confused. - The Elements of Programming Style (Kernighan & Plauger)

❦ 4 avril 2018 11:17 +0200, Michal Privoznik <mprivozn@redhat.com> :
Dunno, this is the first time I hear about this issue. Maybe you can try to set a break point on virHashIterationError() and when it's hit get stacktrace of all threads 't a a bt'. That might shed more light into the issue. Smells like we are not locking somewhere properly.
So, we have two threads iterating. Both of them originates from virNetServerProgramDispatchCall. So, it seems there is a lock missing? Thread 10 (Thread 0x7f931f814700 (LWP 126453)): #0 virHashForEach (table=0x7f92fc69a480, iter=iter@entry=0x7f932ea8fbf0 <virDomainObjListCollectIterator>, data=data@entry=0x7f931f813a20) at ../../../src/util/virhash.c:597 #1 0x00007f932ea911c3 in virDomainObjListCollect (domlist=0x7f92fc82dd50, conn=conn@entry=0x7f92e0000f80, vms=vms@entry=0x7f931f813a80, nvms=nvms@entry=0x7f931f813a90, filter=0x7f932ebdc6b0 <virConnectListAllDomainsCheckACL>, flags=48) at ../../../src/conf/virdomainobjlist.c:935 #2 0x00007f932ea91482 in virDomainObjListExport (domlist=<optimized out>, conn=0x7f92e0000f80, domains=0x7f931f813af0, filter=<optimized out>, flags=<optimized out>) at ../../../src/conf/virdomainobjlist.c:1019 #3 0x00007f932eaf3b75 in virConnectListAllDomains (conn=0x7f92e0000f80, domains=0x7f931f813af0, flags=48) at ../../../src/libvirt-domain.c:6491 #4 0x0000559f219f6dca in remoteDispatchConnectListAllDomains (server=0x559f23795530, msg=0x559f237b78c0, ret=0x7f92d0000920, args=0x7f92d0000ad0, rerr=0x7f931f813bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:1244 #5 remoteDispatchConnectListAllDomainsHelper (server=0x559f23795530, client=<optimized out>, msg=0x559f237b78c0, rerr=0x7f931f813bc0, args=0x7f92d0000ad0, ret=0x7f92d0000920) at ../../../daemon/remote_dispatch.h:1220 #6 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237b78c0, client=0x559f237b7bd0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437 #7 virNetServerProgramDispatch (prog=0x559f237b2a50, server=server@entry=0x559f23795530, client=0x559f237b7bd0, msg=0x559f237b78c0) at ../../../src/rpc/virnetserverprogram.c:307 #8 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148 #9 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169 #10 0x00007f932ea3871b in virThreadPoolWorker (opaque=opaque@entry=0x559f2377da20) at ../../../src/util/virthreadpool.c:167 #11 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206 #12 0x00007f932e0ed6ba in start_thread (arg=0x7f931f814700) at pthread_create.c:333 #13 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 6 (Thread 0x7f9321818700 (LWP 126449)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f932e0efdbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f92ec011c80) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f932ea37c45 in virMutexLock (m=m@entry=0x7f92ec011c80) at ../../../src/util/virthread.c:89 #3 0x00007f932ea180ca in virObjectLock (anyobj=anyobj@entry=0x7f92ec011c70) at ../../../src/util/virobject.c:388 #4 0x00007f932ea8fd81 in virDomainObjListCopyActiveIDs (payload=0x7f92ec011c70, name=<optimized out>, opaque=0x7f9321817aa0) at ../../../src/conf/virdomainobjlist.c:679 #5 0x00007f932e9ee6cd in virHashForEach (table=0x7f92fc69a480, iter=iter@entry=0x7f932ea8fd70 <virDomainObjListCopyActiveIDs>, data=data@entry=0x7f9321817aa0) at ../../../src/util/virhash.c:606 #6 0x00007f932ea90fba in virDomainObjListGetActiveIDs (doms=0x7f92fc82dd50, ids=<optimized out>, maxids=<optimized out>, filter=<optimized out>, conn=<optimized out>) at ../../../src/conf/virdomainobjlist.c:701 #7 0x00007f932eae1c6e in virConnectListDomains (conn=0x7f92e40018b0, ids=0x7f92f4015ae0, maxids=14) at ../../../src/libvirt-domain.c:66 #8 0x0000559f21a0d3c1 in remoteDispatchConnectListDomains (server=<optimized out>, msg=0x559f237beea0, ret=0x7f92f4009eb0, args=0x7f92f4016290, rerr=0x7f9321817bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:2108 #9 remoteDispatchConnectListDomainsHelper (server=<optimized out>, client=<optimized out>, msg=0x559f237beea0, rerr=0x7f9321817bc0, args=0x7f92f4016290, ret=0x7f92f4009eb0) at ../../../daemon/remote_dispatch.h:2076 #10 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237beea0, client=0x559f237c0ce0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437 #11 virNetServerProgramDispatch (prog=0x559f237b2a50, server=server@entry=0x559f23795530, client=0x559f237c0ce0, msg=0x559f237beea0) at ../../../src/rpc/virnetserverprogram.c:307 #12 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148 #13 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169 #14 0x00007f932ea3871b in virThreadPoolWorker (opaque=opaque@entry=0x559f2377d860) at ../../../src/util/virthreadpool.c:167 #15 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206 #16 0x00007f932e0ed6ba in start_thread (arg=0x7f9321818700) at pthread_create.c:333 #17 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Also, you might want to try the latest release of 4.2.0 and see if the bug still reproduces.
Michal
-- Lay on, MacDuff, and curs'd be him who first cries, "Hold, enough!". -- Shakespeare

On 04/04/2018 02:59 PM, Vincent Bernat wrote:
❦ 4 avril 2018 11:17 +0200, Michal Privoznik <mprivozn@redhat.com> :
Dunno, this is the first time I hear about this issue. Maybe you can try to set a break point on virHashIterationError() and when it's hit get stacktrace of all threads 't a a bt'. That might shed more light into the issue. Smells like we are not locking somewhere properly.
So, we have two threads iterating. Both of them originates from virNetServerProgramDispatchCall. So, it seems there is a lock missing?
No. As suspected, both threads are reading internal list of domains (actually hash table). They are not modifying the list so they grab a read lock.
Thread 10 (Thread 0x7f931f814700 (LWP 126453)): #0 virHashForEach (table=0x7f92fc69a480, iter=iter@entry=0x7f932ea8fbf0 <virDomainObjListCollectIterator>, data=data@entry=0x7f931f813a20) at ../../../src/util/virhash.c:597 #1 0x00007f932ea911c3 in virDomainObjListCollect (domlist=0x7f92fc82dd50, conn=conn@entry=0x7f92e0000f80, vms=vms@entry=0x7f931f813a80, nvms=nvms@entry=0x7f931f813a90, filter=0x7f932ebdc6b0 <virConnectListAllDomainsCheckACL>, flags=48) at ../../../src/conf/virdomainobjlist.c:935 #2 0x00007f932ea91482 in virDomainObjListExport (domlist=<optimized out>, conn=0x7f92e0000f80, domains=0x7f931f813af0, filter=<optimized out>, flags=<optimized out>) at ../../../src/conf/virdomainobjlist.c:1019 #3 0x00007f932eaf3b75 in virConnectListAllDomains (conn=0x7f92e0000f80, domains=0x7f931f813af0, flags=48) at ../../../src/libvirt-domain.c:6491 #4 0x0000559f219f6dca in remoteDispatchConnectListAllDomains (server=0x559f23795530, msg=0x559f237b78c0, ret=0x7f92d0000920, args=0x7f92d0000ad0, rerr=0x7f931f813bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:1244 #5 remoteDispatchConnectListAllDomainsHelper (server=0x559f23795530, client=<optimized out>, msg=0x559f237b78c0, rerr=0x7f931f813bc0, args=0x7f92d0000ad0, ret=0x7f92d0000920) at ../../../daemon/remote_dispatch.h:1220 #6 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237b78c0, client=0x559f237b7bd0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437 #7 virNetServerProgramDispatch (prog=0x559f237b2a50, server=server@entry=0x559f23795530, client=0x559f237b7bd0, msg=0x559f237b78c0) at ../../../src/rpc/virnetserverprogram.c:307 #8 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148 #9 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169 #10 0x00007f932ea3871b in virThreadPoolWorker (opaque=opaque@entry=0x559f2377da20) at ../../../src/util/virthreadpool.c:167 #11 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206 #12 0x00007f932e0ed6ba in start_thread (arg=0x7f931f814700) at pthread_create.c:333 #13 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 6 (Thread 0x7f9321818700 (LWP 126449)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f932e0efdbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f92ec011c80) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f932ea37c45 in virMutexLock (m=m@entry=0x7f92ec011c80) at ../../../src/util/virthread.c:89 #3 0x00007f932ea180ca in virObjectLock (anyobj=anyobj@entry=0x7f92ec011c70) at ../../../src/util/virobject.c:388 #4 0x00007f932ea8fd81 in virDomainObjListCopyActiveIDs (payload=0x7f92ec011c70, name=<optimized out>, opaque=0x7f9321817aa0) at ../../../src/conf/virdomainobjlist.c:679 #5 0x00007f932e9ee6cd in virHashForEach (table=0x7f92fc69a480, iter=iter@entry=0x7f932ea8fd70 <virDomainObjListCopyActiveIDs>, data=data@entry=0x7f9321817aa0) at ../../../src/util/virhash.c:606 #6 0x00007f932ea90fba in virDomainObjListGetActiveIDs (doms=0x7f92fc82dd50, ids=<optimized out>, maxids=<optimized out>, filter=<optimized out>, conn=<optimized out>) at ../../../src/conf/virdomainobjlist.c:701 #7 0x00007f932eae1c6e in virConnectListDomains (conn=0x7f92e40018b0, ids=0x7f92f4015ae0, maxids=14) at ../../../src/libvirt-domain.c:66 #8 0x0000559f21a0d3c1 in remoteDispatchConnectListDomains (server=<optimized out>, msg=0x559f237beea0, ret=0x7f92f4009eb0, args=0x7f92f4016290, rerr=0x7f9321817bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:2108 #9 remoteDispatchConnectListDomainsHelper (server=<optimized out>, client=<optimized out>, msg=0x559f237beea0, rerr=0x7f9321817bc0, args=0x7f92f4016290, ret=0x7f92f4009eb0) at ../../../daemon/remote_dispatch.h:2076 #10 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237beea0, client=0x559f237c0ce0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437 #11 virNetServerProgramDispatch (prog=0x559f237b2a50, server=server@entry=0x559f23795530, client=0x559f237c0ce0, msg=0x559f237beea0) at ../../../src/rpc/virnetserverprogram.c:307 #12 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148 #13 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169 #14 0x00007f932ea3871b in virThreadPoolWorker (opaque=opaque@entry=0x559f2377d860) at ../../../src/util/virthreadpool.c:167 #15 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206 #16 0x00007f932e0ed6ba in start_thread (arg=0x7f9321818700) at pthread_create.c:333 #17 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Both threads call virHashForEach(table=0x7f92fc69a480). Thread 6 was first so it starts iterating and sets table->iterating so later when thread 10 enters the function an error is reported. I guess we can go with what Dan suggested and after some rework we can just drop ->iterating completely. Michal

❦ 4 avril 2018 15:19 +0200, Michal Privoznik <mprivozn@redhat.com> :
Both threads call virHashForEach(table=0x7f92fc69a480). Thread 6 was first so it starts iterating and sets table->iterating so later when thread 10 enters the function an error is reported.
I guess we can go with what Dan suggested and after some rework we can just drop ->iterating completely.
I may have missed this suggestion. Maybe Dan only sent it to you? In the meantime, could I change the locks around virHashForEach() and similar as read/write locks? -- Nothing so needs reforming as other people's habits. -- Mark Twain, "Pudd'nhead Wilson's Calendar"

On 04/04/2018 03:48 PM, Vincent Bernat wrote:
❦ 4 avril 2018 15:19 +0200, Michal Privoznik <mprivozn@redhat.com> :
Both threads call virHashForEach(table=0x7f92fc69a480). Thread 6 was first so it starts iterating and sets table->iterating so later when thread 10 enters the function an error is reported.
I guess we can go with what Dan suggested and after some rework we can just drop ->iterating completely.
I may have missed this suggestion. Maybe Dan only sent it to you?
No, there is another thread where this issue is discussed: https://www.redhat.com/archives/libvir-list/2018-April/msg00190.html In the
meantime, could I change the locks around virHashForEach() and similar as read/write locks?
You can do that locally, but as a patch it's very unlikely to be accepted upstream because we've introduced RW locks to be able to access domain list from multiple threads. Michal

❦ 4 avril 2018 15:59 +0200, Michal Privoznik <mprivozn@redhat.com> :
I may have missed this suggestion. Maybe Dan only sent it to you?
No, there is another thread where this issue is discussed:
https://www.redhat.com/archives/libvir-list/2018-April/msg00190.html
In the
meantime, could I change the locks around virHashForEach() and similar as read/write locks?
You can do that locally, but as a patch it's very unlikely to be accepted upstream because we've introduced RW locks to be able to access domain list from multiple threads.
Looking a bit more, the whole "iterating" bit is currently unsafe. First, it is racy since it's "check then take". Two threads may check the value is false and start iterating both. Moreover, at some places, it is not set, for example in virHashAddOrUpdateEntry(). So, it would be possible to call this function and during the update to iterate over the hash. This seems to be prevented only by callers using RW lock. So, it seems we can just remove this "iterating" bool and just check all users of these functions are using the appropriate lock. -- Make sure all variables are initialised before use. - The Elements of Programming Style (Kernighan & Plauger)

❦ 4 avril 2018 16:21 +0200, Vincent Bernat <bernat@luffy.cx> :
You can do that locally, but as a patch it's very unlikely to be accepted upstream because we've introduced RW locks to be able to access domain list from multiple threads.
Looking a bit more, the whole "iterating" bit is currently unsafe. First, it is racy since it's "check then take". Two threads may check the value is false and start iterating both. Moreover, at some places, it is not set, for example in virHashAddOrUpdateEntry(). So, it would be possible to call this function and during the update to iterate over the hash. This seems to be prevented only by callers using RW lock. So, it seems we can just remove this "iterating" bool and just check all users of these functions are using the appropriate lock.
My bad, I didn't see the use of table->current... -- The only way to keep your health is to eat what you don't want, drink what you don't like, and do what you'd rather not. -- Mark Twain

❦ 4 avril 2018 17:00 +0200, Vincent Bernat <bernat@luffy.cx> :
You can do that locally, but as a patch it's very unlikely to be accepted upstream because we've introduced RW locks to be able to access domain list from multiple threads.
Looking a bit more, the whole "iterating" bit is currently unsafe. First, it is racy since it's "check then take". Two threads may check the value is false and start iterating both. Moreover, at some places, it is not set, for example in virHashAddOrUpdateEntry(). So, it would be possible to call this function and during the update to iterate over the hash. This seems to be prevented only by callers using RW lock. So, it seems we can just remove this "iterating" bool and just check all users of these functions are using the appropriate lock.
My bad, I didn't see the use of table->current...
It seems it can also go away if caller correctly use a RW lock. I am currently testing the attached patch on a few servers (patch for master, but applied on our 3.6 version). After checking many locations, I came to conclusion that locks are used correctly. -- Use library functions. - The Elements of Programming Style (Kernighan & Plauger)

On 04/04/2018 08:38 PM, Vincent Bernat wrote:
0001-util-don-t-check-for-parallel-iteration-in-hash-rela.patch
From 7bf32825325b124edce58687a8cd2ceff13328b9 Mon Sep 17 00:00:00 2001 From: Vincent Bernat <vincent@bernat.im> Date: Wed, 4 Apr 2018 16:25:37 +0200 Subject: [PATCH] util: don't check for parallel iteration in hash-related functions
Can you send this as a regular patch? I'd like to review it. Michal

❦ 5 avril 2018 17:00 +0200, Michal Privoznik <mprivozn@redhat.com> :
0001-util-don-t-check-for-parallel-iteration-in-hash-rela.patch
From 7bf32825325b124edce58687a8cd2ceff13328b9 Mon Sep 17 00:00:00 2001 From: Vincent Bernat <vincent@bernat.im> Date: Wed, 4 Apr 2018 16:25:37 +0200 Subject: [PATCH] util: don't check for parallel iteration in hash-related functions
Can you send this as a regular patch? I'd like to review it.
What do you mean by regular? Not an attachment? Through git send-email? FI, I am now running this patch on 22 hosts without any detected issues. -- Let the machine do the dirty work. - The Elements of Programming Style (Kernighan & Plauger)

On 04/05/2018 05:55 PM, Vincent Bernat wrote:
❦ 5 avril 2018 17:00 +0200, Michal Privoznik <mprivozn@redhat.com> :
0001-util-don-t-check-for-parallel-iteration-in-hash-rela.patch
From 7bf32825325b124edce58687a8cd2ceff13328b9 Mon Sep 17 00:00:00 2001 From: Vincent Bernat <vincent@bernat.im> Date: Wed, 4 Apr 2018 16:25:37 +0200 Subject: [PATCH] util: don't check for parallel iteration in hash-related functions
Can you send this as a regular patch? I'd like to review it.
What do you mean by regular? Not an attachment? Through git send-email?
Yes. git send-email.
FI, I am now running this patch on 22 hosts without any detected issues.
Oh that sounds good. We need to go through all callbacks and check whether they are accessing the table though. This is going to be slightly more time consuming, but it's feasible. Michal
participants (2)
-
Michal Privoznik
-
Vincent Bernat