[libvirt] [PATCH] storage_backend_rbd: always call rados_conf_read_file when connect a rbd pool

From: Chen Hanxiao <chenhanxiao@gmail.com> This patch fix a dead lock when try to read a rbd image When trying to connect a rbd server (ceph-0.94.7-1.el7.centos.x86_64), rbd_list/rbd_open enter a dead lock state. Backtrace: Thread 30 (Thread 0x7fdb342d0700 (LWP 12105)): #0 0x00007fdb40b16705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fdb294273f1 in librados::IoCtxImpl::operate_read(object_t const&, ObjectOperation*, ceph::buffer::list*, int) () from /lib64/librados.so.2 #2 0x00007fdb29429fcc in librados::IoCtxImpl::read(object_t const&, ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2 #3 0x00007fdb293e850c in librados::IoCtx::read(std::string const&, ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2 #4 0x00007fdb2b9dd15e in librbd::list(librados::IoCtx&, std::vector<std::string, std::allocator<std::string> >&) () from /lib64/librbd.so.1 #5 0x00007fdb2b98c089 in rbd_list () from /lib64/librbd.so.1 #6 0x00007fdb2e1a8052 in virStorageBackendRBDRefreshPool (conn=<optimized out>, pool=0x7fdafc002d50) at storage/storage_backend_rbd.c:366 #7 0x00007fdb2e193833 in storagePoolCreate (obj=0x7fdb1c1fd5a0, flags=<optimized out>) at storage/storage_driver.c:876 #8 0x00007fdb43790ea1 in virStoragePoolCreate (pool=pool@entry=0x7fdb1c1fd5a0, flags=0) at libvirt-storage.c:695 #9 0x00007fdb443becdf in remoteDispatchStoragePoolCreate (server=0x7fdb45fb2ab0, msg=0x7fdb45fb3db0, args=0x7fdb1c0037d0, rerr=0x7fdb342cfc30, client=<optimized out>) at remote_dispatch.h:14383 #10 remoteDispatchStoragePoolCreateHelper (server=0x7fdb45fb2ab0, client=<optimized out>, msg=0x7fdb45fb3db0, rerr=0x7fdb342cfc30, args=0x7fdb1c0037d0, ret=0x7fdb1c1b3260) at remote_dispatch.h:14359 #11 0x00007fdb437d9c42 in virNetServerProgramDispatchCall (msg=0x7fdb45fb3db0, client=0x7fdb45fd1a80, server=0x7fdb45fb2ab0, prog=0x7fdb45fcd670) at rpc/virnetserverprogram.c:437 #12 virNetServerProgramDispatch (prog=0x7fdb45fcd670, server=server@entry=0x7fdb45fb2ab0, client=0x7fdb45fd1a80, msg=0x7fdb45fb3db0) at rpc/virnetserverprogram.c:307 #13 0x00007fdb437d4ebd in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fdb45fb2ab0) at rpc/virnetserver.c:135 #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fdb45fb2ab0) at rpc/virnetserver.c:156 #15 0x00007fdb436cfb35 in virThreadPoolWorker (opaque=opaque@entry=0x7fdb45fa7650) at util/virthreadpool.c:145 #16 0x00007fdb436cf058 in virThreadHelper (data=<optimized out>) at util/virthread.c:206 #17 0x00007fdb40b12df5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007fdb408401ad in clone () from /lib64/libc.so.6 366 len = rbd_list(ptr.ioctx, names, &max_size); (gdb) n [New Thread 0x7fdb20758700 (LWP 22458)] [New Thread 0x7fdb20556700 (LWP 22459)] [Thread 0x7fdb20758700 (LWP 22458) exited] [New Thread 0x7fdb20455700 (LWP 22460)] [Thread 0x7fdb20556700 (LWP 22459) exited] [New Thread 0x7fdb20556700 (LWP 22461)] infinite loop... Signed-off-by: Chen Hanxiao <chenhanxiao@gmail.com> --- src/storage/storage_backend_rbd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index b1c51ab..233737b 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -95,6 +95,9 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr ptr, goto cleanup; } + /* try default location, but ignore failure */ + rados_conf_read_file(ptr->cluster, NULL); + if (!conn) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("'ceph' authentication not supported " @@ -124,6 +127,10 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr ptr, _("failed to create the RADOS cluster")); goto cleanup; } + + /* try default location, but ignore failure */ + rados_conf_read_file(ptr->cluster, NULL); + if (virStorageBackendRBDRADOSConfSet(ptr->cluster, "auth_supported", "none") < 0) goto cleanup; -- 2.7.4

On 12/30/2016 03:39 AM, Chen Hanxiao wrote:
From: Chen Hanxiao <chenhanxiao@gmail.com>
This patch fix a dead lock when try to read a rbd image
When trying to connect a rbd server (ceph-0.94.7-1.el7.centos.x86_64),
rbd_list/rbd_open enter a dead lock state.
Backtrace: Thread 30 (Thread 0x7fdb342d0700 (LWP 12105)): #0 0x00007fdb40b16705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fdb294273f1 in librados::IoCtxImpl::operate_read(object_t const&, ObjectOperation*, ceph::buffer::list*, int) () from /lib64/librados.so.2 #2 0x00007fdb29429fcc in librados::IoCtxImpl::read(object_t const&, ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2 #3 0x00007fdb293e850c in librados::IoCtx::read(std::string const&, ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2 #4 0x00007fdb2b9dd15e in librbd::list(librados::IoCtx&, std::vector<std::string, std::allocator<std::string> >&) () from /lib64/librbd.so.1 #5 0x00007fdb2b98c089 in rbd_list () from /lib64/librbd.so.1 #6 0x00007fdb2e1a8052 in virStorageBackendRBDRefreshPool (conn=<optimized out>, pool=0x7fdafc002d50) at storage/storage_backend_rbd.c:366 #7 0x00007fdb2e193833 in storagePoolCreate (obj=0x7fdb1c1fd5a0, flags=<optimized out>) at storage/storage_driver.c:876 #8 0x00007fdb43790ea1 in virStoragePoolCreate (pool=pool@entry=0x7fdb1c1fd5a0, flags=0) at libvirt-storage.c:695 #9 0x00007fdb443becdf in remoteDispatchStoragePoolCreate (server=0x7fdb45fb2ab0, msg=0x7fdb45fb3db0, args=0x7fdb1c0037d0, rerr=0x7fdb342cfc30, client=<optimized out>) at remote_dispatch.h:14383 #10 remoteDispatchStoragePoolCreateHelper (server=0x7fdb45fb2ab0, client=<optimized out>, msg=0x7fdb45fb3db0, rerr=0x7fdb342cfc30, args=0x7fdb1c0037d0, ret=0x7fdb1c1b3260) at remote_dispatch.h:14359 #11 0x00007fdb437d9c42 in virNetServerProgramDispatchCall (msg=0x7fdb45fb3db0, client=0x7fdb45fd1a80, server=0x7fdb45fb2ab0, prog=0x7fdb45fcd670) at rpc/virnetserverprogram.c:437 #12 virNetServerProgramDispatch (prog=0x7fdb45fcd670, server=server@entry=0x7fdb45fb2ab0, client=0x7fdb45fd1a80, msg=0x7fdb45fb3db0) at rpc/virnetserverprogram.c:307 #13 0x00007fdb437d4ebd in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fdb45fb2ab0) at rpc/virnetserver.c:135 #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fdb45fb2ab0) at rpc/virnetserver.c:156 #15 0x00007fdb436cfb35 in virThreadPoolWorker (opaque=opaque@entry=0x7fdb45fa7650) at util/virthreadpool.c:145 #16 0x00007fdb436cf058 in virThreadHelper (data=<optimized out>) at util/virthread.c:206 #17 0x00007fdb40b12df5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007fdb408401ad in clone () from /lib64/libc.so.6
366 len = rbd_list(ptr.ioctx, names, &max_size); (gdb) n [New Thread 0x7fdb20758700 (LWP 22458)] [New Thread 0x7fdb20556700 (LWP 22459)] [Thread 0x7fdb20758700 (LWP 22458) exited] [New Thread 0x7fdb20455700 (LWP 22460)] [Thread 0x7fdb20556700 (LWP 22459) exited] [New Thread 0x7fdb20556700 (LWP 22461)]
infinite loop...
Signed-off-by: Chen Hanxiao <chenhanxiao@gmail.com> --- src/storage/storage_backend_rbd.c | 7 +++++++ 1 file changed, 7 insertions(+)
Could you provide a bit more context... Why does calling rados_conf_read_file with a NULL resolve the issue? Is this something "new" or "expected"? And if expected, why are we only seeing it now? What is the other thread that "has" the lock doing?
From my cursory/quick read of :
http://docs.ceph.com/docs/master/rados/api/librados/ ... "Then you configure your rados_t to connect to your cluster, either by setting individual values (rados_conf_set()), using a configuration file (rados_conf_read_file()), using command line options (rados_conf_parse_argv()), or an environment variable (rados_conf_parse_env()):" Since we use rados_conf_set, that would seem to indicate we're OK. It's not clear from just what's posted why calling eventually calling rbd_list is causing a hang. I don't have the cycles or environment to do the research right now and it really isn't clear why a read_file would resolve the issue. John
diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index b1c51ab..233737b 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -95,6 +95,9 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr ptr, goto cleanup; }
+ /* try default location, but ignore failure */ + rados_conf_read_file(ptr->cluster, NULL); + if (!conn) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("'ceph' authentication not supported " @@ -124,6 +127,10 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr ptr, _("failed to create the RADOS cluster")); goto cleanup; } + + /* try default location, but ignore failure */ + rados_conf_read_file(ptr->cluster, NULL); + if (virStorageBackendRBDRADOSConfSet(ptr->cluster, "auth_supported", "none") < 0) goto cleanup;

At 2017-01-11 02:23:54, "John Ferlan" <jferlan@redhat.com> wrote:
On 12/30/2016 03:39 AM, Chen Hanxiao wrote:
From: Chen Hanxiao <chenhanxiao@gmail.com>
This patch fix a dead lock when try to read a rbd image
When trying to connect a rbd server (ceph-0.94.7-1.el7.centos.x86_64),
rbd_list/rbd_open enter a dead lock state.
Backtrace: Thread 30 (Thread 0x7fdb342d0700 (LWP 12105)): #0 0x00007fdb40b16705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fdb294273f1 in librados::IoCtxImpl::operate_read(object_t const&, ObjectOperation*, ceph::buffer::list*, int) () from /lib64/librados.so.2 #2 0x00007fdb29429fcc in librados::IoCtxImpl::read(object_t const&, ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2 #3 0x00007fdb293e850c in librados::IoCtx::read(std::string const&, ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2 #4 0x00007fdb2b9dd15e in librbd::list(librados::IoCtx&, std::vector<std::string, std::allocator<std::string> >&) () from /lib64/librbd.so.1 #5 0x00007fdb2b98c089 in rbd_list () from /lib64/librbd.so.1 #6 0x00007fdb2e1a8052 in virStorageBackendRBDRefreshPool (conn=<optimized out>, pool=0x7fdafc002d50) at storage/storage_backend_rbd.c:366 #7 0x00007fdb2e193833 in storagePoolCreate (obj=0x7fdb1c1fd5a0, flags=<optimized out>) at storage/storage_driver.c:876 #8 0x00007fdb43790ea1 in virStoragePoolCreate (pool=pool@entry=0x7fdb1c1fd5a0, flags=0) at libvirt-storage.c:695 #9 0x00007fdb443becdf in remoteDispatchStoragePoolCreate (server=0x7fdb45fb2ab0, msg=0x7fdb45fb3db0, args=0x7fdb1c0037d0, rerr=0x7fdb342cfc30, client=<optimized out>) at remote_dispatch.h:14383 #10 remoteDispatchStoragePoolCreateHelper (server=0x7fdb45fb2ab0, client=<optimized out>, msg=0x7fdb45fb3db0, rerr=0x7fdb342cfc30, args=0x7fdb1c0037d0, ret=0x7fdb1c1b3260) at remote_dispatch.h:14359 #11 0x00007fdb437d9c42 in virNetServerProgramDispatchCall (msg=0x7fdb45fb3db0, client=0x7fdb45fd1a80, server=0x7fdb45fb2ab0, prog=0x7fdb45fcd670) at rpc/virnetserverprogram.c:437 #12 virNetServerProgramDispatch (prog=0x7fdb45fcd670, server=server@entry=0x7fdb45fb2ab0, client=0x7fdb45fd1a80, msg=0x7fdb45fb3db0) at rpc/virnetserverprogram.c:307 #13 0x00007fdb437d4ebd in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fdb45fb2ab0) at rpc/virnetserver.c:135 #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fdb45fb2ab0) at rpc/virnetserver.c:156 #15 0x00007fdb436cfb35 in virThreadPoolWorker (opaque=opaque@entry=0x7fdb45fa7650) at util/virthreadpool.c:145 #16 0x00007fdb436cf058 in virThreadHelper (data=<optimized out>) at util/virthread.c:206 #17 0x00007fdb40b12df5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007fdb408401ad in clone () from /lib64/libc.so.6
366 len = rbd_list(ptr.ioctx, names, &max_size); (gdb) n [New Thread 0x7fdb20758700 (LWP 22458)] [New Thread 0x7fdb20556700 (LWP 22459)] [Thread 0x7fdb20758700 (LWP 22458) exited] [New Thread 0x7fdb20455700 (LWP 22460)] [Thread 0x7fdb20556700 (LWP 22459) exited] [New Thread 0x7fdb20556700 (LWP 22461)]
infinite loop...
Signed-off-by: Chen Hanxiao <chenhanxiao@gmail.com> --- src/storage/storage_backend_rbd.c | 7 +++++++ 1 file changed, 7 insertions(+)
Could you provide a bit more context...
Why does calling rados_conf_read_file with a NULL resolve the issue?
Is this something "new" or "expected"? And if expected, why are we only seeing it now?
What is the other thread that "has" the lock doing?
It seams that the server side of ceph does not response our request. So when libvirt call rbd_open/rbd_list, etc, it never return. But qemu works fine. So I take qemu's code as a reference. https://github.com/qemu/qemu/blob/master/block/rbd.c#L365 rados_conf_read_file with a NULL will try to get ceph conf file from /etc/ceph and other default paths. Althougth we rados_conf_set in the following code, w/o rados_conf_read_file, ceph-0.94.7-1.el7 does not answer our rbd_open. Some elder or newer ceph server does not have this issue. I think this may be a ceph server bug of ceph-0.94.7-1.el7. Doing rados_conf_read_file(cluster, NULL) will make our code more robust. Regards, - Chen
From my cursory/quick read of :
http://docs.ceph.com/docs/master/rados/api/librados/
... "Then you configure your rados_t to connect to your cluster, either by setting individual values (rados_conf_set()), using a configuration file (rados_conf_read_file()), using command line options (rados_conf_parse_argv()), or an environment variable (rados_conf_parse_env()):"
Since we use rados_conf_set, that would seem to indicate we're OK. It's not clear from just what's posted why calling eventually calling rbd_list is causing a hang.
I don't have the cycles or environment to do the research right now and it really isn't clear why a read_file would resolve the issue.
John
diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index b1c51ab..233737b 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -95,6 +95,9 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr ptr, goto cleanup; }
+ /* try default location, but ignore failure */ + rados_conf_read_file(ptr->cluster, NULL); + if (!conn) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("'ceph' authentication not supported " @@ -124,6 +127,10 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr ptr, _("failed to create the RADOS cluster")); goto cleanup; } + + /* try default location, but ignore failure */ + rados_conf_read_file(ptr->cluster, NULL); + if (virStorageBackendRBDRADOSConfSet(ptr->cluster, "auth_supported", "none") < 0) goto cleanup;
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

[...]
Could you provide a bit more context...
Why does calling rados_conf_read_file with a NULL resolve the issue?
Is this something "new" or "expected"? And if expected, why are we only seeing it now?
What is the other thread that "has" the lock doing?
It seams that the server side of ceph does not response our request.
So when libvirt call rbd_open/rbd_list, etc, it never return.
But qemu works fine. So I take qemu's code as a reference. https://github.com/qemu/qemu/blob/master/block/rbd.c#L365
rados_conf_read_file with a NULL will try to get ceph conf file from /etc/ceph and other default paths.
Althougth we rados_conf_set in the following code, w/o rados_conf_read_file, ceph-0.94.7-1.el7 does not answer our rbd_open.
Some elder or newer ceph server does not have this issue. I think this may be a ceph server bug of ceph-0.94.7-1.el7.
Thus a bug should be filed against ceph to fix their 0.94 version rather than adding what would seemingly be an unnecessary change into libvirt to work around a problem that appears to be fixed in some future version of ceph. John
Doing rados_conf_read_file(cluster, NULL) will make our code more robust.
Regards, - Chen
participants (2)
-
Chen Hanxiao
-
John Ferlan