At 2017-01-11 02:23:54, "John Ferlan" <jferlan(a)redhat.com> wrote:
On 12/30/2016 03:39 AM, Chen Hanxiao wrote:
> From: Chen Hanxiao <chenhanxiao(a)gmail.com>
>
> This patch fix a dead lock when try to read a rbd image
>
> When trying to connect a rbd server
> (ceph-0.94.7-1.el7.centos.x86_64),
>
> rbd_list/rbd_open enter a dead lock state.
>
> Backtrace:
> Thread 30 (Thread 0x7fdb342d0700 (LWP 12105)):
> #0 0x00007fdb40b16705 in pthread_cond_wait@(a)GLIBC_2.3.2 () from
/lib64/libpthread.so.0
> #1 0x00007fdb294273f1 in librados::IoCtxImpl::operate_read(object_t const&,
ObjectOperation*, ceph::buffer::list*, int) () from /lib64/librados.so.2
> #2 0x00007fdb29429fcc in librados::IoCtxImpl::read(object_t const&,
ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2
> #3 0x00007fdb293e850c in librados::IoCtx::read(std::string const&,
ceph::buffer::list&, unsigned long, unsigned long) () from /lib64/librados.so.2
> #4 0x00007fdb2b9dd15e in librbd::list(librados::IoCtx&,
std::vector<std::string, std::allocator<std::string> >&) () from
/lib64/librbd.so.1
> #5 0x00007fdb2b98c089 in rbd_list () from /lib64/librbd.so.1
> #6 0x00007fdb2e1a8052 in virStorageBackendRBDRefreshPool (conn=<optimized
out>, pool=0x7fdafc002d50) at storage/storage_backend_rbd.c:366
> #7 0x00007fdb2e193833 in storagePoolCreate (obj=0x7fdb1c1fd5a0, flags=<optimized
out>) at storage/storage_driver.c:876
> #8 0x00007fdb43790ea1 in virStoragePoolCreate (pool=pool@entry=0x7fdb1c1fd5a0,
flags=0) at libvirt-storage.c:695
> #9 0x00007fdb443becdf in remoteDispatchStoragePoolCreate (server=0x7fdb45fb2ab0,
msg=0x7fdb45fb3db0, args=0x7fdb1c0037d0, rerr=0x7fdb342cfc30, client=<optimized
out>) at remote_dispatch.h:14383
> #10 remoteDispatchStoragePoolCreateHelper (server=0x7fdb45fb2ab0,
client=<optimized out>, msg=0x7fdb45fb3db0, rerr=0x7fdb342cfc30,
args=0x7fdb1c0037d0, ret=0x7fdb1c1b3260) at remote_dispatch.h:14359
> #11 0x00007fdb437d9c42 in virNetServerProgramDispatchCall (msg=0x7fdb45fb3db0,
client=0x7fdb45fd1a80, server=0x7fdb45fb2ab0, prog=0x7fdb45fcd670) at
rpc/virnetserverprogram.c:437
> #12 virNetServerProgramDispatch (prog=0x7fdb45fcd670,
server=server@entry=0x7fdb45fb2ab0, client=0x7fdb45fd1a80, msg=0x7fdb45fb3db0) at
rpc/virnetserverprogram.c:307
> #13 0x00007fdb437d4ebd in virNetServerProcessMsg (msg=<optimized out>,
prog=<optimized out>, client=<optimized out>, srv=0x7fdb45fb2ab0) at
rpc/virnetserver.c:135
> #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fdb45fb2ab0) at
rpc/virnetserver.c:156
> #15 0x00007fdb436cfb35 in virThreadPoolWorker (opaque=opaque@entry=0x7fdb45fa7650) at
util/virthreadpool.c:145
> #16 0x00007fdb436cf058 in virThreadHelper (data=<optimized out>) at
util/virthread.c:206
> #17 0x00007fdb40b12df5 in start_thread () from /lib64/libpthread.so.0
> #18 0x00007fdb408401ad in clone () from /lib64/libc.so.6
>
> 366 len = rbd_list(ptr.ioctx, names, &max_size);
> (gdb) n
> [New Thread 0x7fdb20758700 (LWP 22458)]
> [New Thread 0x7fdb20556700 (LWP 22459)]
> [Thread 0x7fdb20758700 (LWP 22458) exited]
> [New Thread 0x7fdb20455700 (LWP 22460)]
> [Thread 0x7fdb20556700 (LWP 22459) exited]
> [New Thread 0x7fdb20556700 (LWP 22461)]
>
> infinite loop...
>
> Signed-off-by: Chen Hanxiao <chenhanxiao(a)gmail.com>
> ---
> src/storage/storage_backend_rbd.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
Could you provide a bit more context...
Why does calling rados_conf_read_file with a NULL resolve the issue?
Is this something "new" or "expected"? And if expected, why are we
only
seeing it now?
What is the other thread that "has" the lock doing?
It seams that the server side of ceph does not response our request.
So when libvirt call rbd_open/rbd_list, etc, it never return.
But qemu works fine.
So I take qemu's code as a reference.
https://github.com/qemu/qemu/blob/master/block/rbd.c#L365
rados_conf_read_file with a NULL will try to get ceph conf file from
/etc/ceph and other default paths.
Althougth we rados_conf_set in the following code,
w/o rados_conf_read_file,
ceph-0.94.7-1.el7 does not answer our rbd_open.
Some elder or newer ceph server does not have this issue.
I think this may be a ceph server bug of ceph-0.94.7-1.el7.
Doing rados_conf_read_file(cluster, NULL)
will make our code more robust.
Regards,
- Chen
>From my cursory/quick read of :
http://docs.ceph.com/docs/master/rados/api/librados/
...
"Then you configure your rados_t to connect to your cluster, either by
setting individual values (rados_conf_set()), using a configuration file
(rados_conf_read_file()), using command line options
(rados_conf_parse_argv()), or an environment variable
(rados_conf_parse_env()):"
Since we use rados_conf_set, that would seem to indicate we're OK. It's
not clear from just what's posted why calling eventually calling
rbd_list is causing a hang.
I don't have the cycles or environment to do the research right now and
it really isn't clear why a read_file would resolve the issue.
John
> diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c
> index b1c51ab..233737b 100644
> --- a/src/storage/storage_backend_rbd.c
> +++ b/src/storage/storage_backend_rbd.c
> @@ -95,6 +95,9 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr
ptr,
> goto cleanup;
> }
>
> + /* try default location, but ignore failure */
> + rados_conf_read_file(ptr->cluster, NULL);
> +
> if (!conn) {
> virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
> _("'ceph' authentication not supported
"
> @@ -124,6 +127,10 @@ virStorageBackendRBDOpenRADOSConn(virStorageBackendRBDStatePtr
ptr,
> _("failed to create the RADOS cluster"));
> goto cleanup;
> }
> +
> + /* try default location, but ignore failure */
> + rados_conf_read_file(ptr->cluster, NULL);
> +
> if (virStorageBackendRBDRADOSConfSet(ptr->cluster,
> "auth_supported",
"none") < 0)
> goto cleanup;
>
--
libvir-list mailing list
libvir-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list