[libvirt] [Discussion] How do we think about time out mechanism?

There's a kind of situation that when libvirtd's under a lot of pressure, just as we start a lot of VMs at the same time, some libvirt APIs may take a lot of time to return. And this will block the up level job to be finished. Mostly we can't wait forever, we want a time out mechnism to help us out. When one API takes more than some time, it can return time out as a result, and do some rolling back. So my question is: do we have a plan to give a 'time out' solution or a better solution to fix this kind of problems in the future? And when? Thanks all! -- Best Regards James

On Fri, Jul 25, 2014 at 04:45:55PM +0800, James wrote:
There's a kind of situation that when libvirtd's under a lot of pressure, just as we start a lot of VMs at the same time, some libvirt APIs may take a lot of time to return. And this will block the up level job to be finished. Mostly we can't wait forever, we want a time out mechnism to help us out. When one API takes more than some time, it can return time out as a result, and do some rolling back.
So my question is: do we have a plan to give a 'time out' solution or a better solution to fix this kind of problems in the future? And when?
Is it only because there are not enough workers available? If yes, then changing the limits in libvirtd.conf (both global and per-connection) might be the easiest way to go. Martin

On 2014/7/25 18:07, Martin Kletzander wrote:
On Fri, Jul 25, 2014 at 04:45:55PM +0800, James wrote:
There's a kind of situation that when libvirtd's under a lot of pressure, just as we start a lot of VMs at the same time, some libvirt APIs may take a lot of time to return. And this will block the up level job to be finished. Mostly we can't wait forever, we want a time out mechnism to help us out. When one API takes more than some time, it can return time out as a result, and do some rolling back.
So my question is: do we have a plan to give a 'time out' solution or a better solution to fix this kind of problems in the future? And when?
Is it only because there are not enough workers available? If yes, then changing the limits in libvirtd.conf (both global and per-connection) might be the easiest way to go.
Martin
That's very nice to receive your reply quickly. The job pressure is just one point for time out mechnism. If something really bad happened just like a blocked bug which stops libvirt API returning, and it's very rare to happen, what can we do to assure the job not blocked by the blocked API? It's like Process A call libvirt API b, but b never returns, A is blocked there forever, so what's the best for us to do? -- Best Regards James

On Sat, Jul 26, 2014 at 03:47:09PM +0800, James wrote:
On 2014/7/25 18:07, Martin Kletzander wrote:
On Fri, Jul 25, 2014 at 04:45:55PM +0800, James wrote:
There's a kind of situation that when libvirtd's under a lot of pressure, just as we start a lot of VMs at the same time, some libvirt APIs may take a lot of time to return. And this will block the up level job to be finished. Mostly we can't wait forever, we want a time out mechnism to help us out. When one API takes more than some time, it can return time out as a result, and do some rolling back.
So my question is: do we have a plan to give a 'time out' solution or a better solution to fix this kind of problems in the future? And when?
Is it only because there are not enough workers available? If yes, then changing the limits in libvirtd.conf (both global and per-connection) might be the easiest way to go.
Martin
That's very nice to receive your reply quickly.
The job pressure is just one point for time out mechnism. If something really bad happened just like a blocked bug which stops libvirt API returning, and it's very rare to happen, what can we do to assure the job not blocked by the blocked API?
It's like Process A call libvirt API b, but b never returns, A is blocked there forever, so what's the best for us to do?
As that is pretty rare case that cannot be dealt with inside the API (since the API is the place where it gets locked), it has to be dealt with outside it. I guess whatever you would do by hand is OK. If, for example, you are used to restart libvirtd after the block is detected, then restart it and try again. You can spawn another process that will do it if you want some fine-grained control, or you can use client (and server) -side keepalive to be automatically disconnected in case the block happens inside the event loop (but it won't catch it outside). I'm not sure how to answer more properly since this is not libvirt-specific. If there's something libvirt-specific I missed, let me know. Martin

On 2014/8/4 19:59, Martin Kletzander wrote:
On Sat, Jul 26, 2014 at 03:47:09PM +0800, James wrote:
On 2014/7/25 18:07, Martin Kletzander wrote:
On Fri, Jul 25, 2014 at 04:45:55PM +0800, James wrote:
There's a kind of situation that when libvirtd's under a lot of pressure, just as we start a lot of VMs at the same time, some libvirt APIs may take a lot of time to return. And this will block the up level job to be finished. Mostly we can't wait forever, we want a time out mechnism to help us out. When one API takes more than some time, it can return time out as a result, and do some rolling back.
So my question is: do we have a plan to give a 'time out' solution or a better solution to fix this kind of problems in the future? And when?
Is it only because there are not enough workers available? If yes, then changing the limits in libvirtd.conf (both global and per-connection) might be the easiest way to go.
Martin
That's very nice to receive your reply quickly.
The job pressure is just one point for time out mechnism. If something really bad happened just like a blocked bug which stops libvirt API returning, and it's very rare to happen, what can we do to assure the job not blocked by the blocked API?
It's like Process A call libvirt API b, but b never returns, A is blocked there forever, so what's the best for us to do?
As that is pretty rare case that cannot be dealt with inside the API (since the API is the place where it gets locked), it has to be dealt with outside it. I guess whatever you would do by hand is OK. If, for example, you are used to restart libvirtd after the block is detected, then restart it and try again. You can spawn another process that will do it if you want some fine-grained control, or you can use client (and server) -side keepalive to be automatically disconnected in case the block happens inside the event loop (but it won't catch it outside). I'm not sure how to answer more properly since this is not libvirt-specific. If there's something libvirt-specific I missed, let me know.
Martin
Thanks. In fact, to deal with this kind of situation, we add some timeout codes in libvirtd, during remote_dispatch process. The mechanism is like this: 1. when we call an API, we start a thread to do the timer, when time out, the timer set a timeout flag to the API, and return timeout result to the libvirt client. 2. when the API return to remote_dispatch level, it checkout the timeout flag to consider what to do next. If timeout, we do some rollback action. It's like detach device, if we attach device at first. In this solution, there's something trouble, first, we have to figure out suitable rollback actions. Second, I'm not sure it's the best way to solve this kind of block problem, not so elegant. How do you think about it? -- Best Regards James

On Tue, Aug 05, 2014 at 03:15:18PM +0800, James wrote:
In fact, to deal with this kind of situation, we add some timeout codes in libvirtd, during remote_dispatch process. The mechanism is like this: 1. when we call an API, we start a thread to do the timer, when time out, the timer set a timeout flag to the API, and return timeout result to the libvirt client. 2. when the API return to remote_dispatch level, it checkout the timeout flag to consider what to do next. If timeout, we do some rollback action. It's like detach device, if we attach device at first.
In this solution, there's something trouble, first, we have to figure out suitable rollback actions. Second, I'm not sure it's the best way to solve this kind of block problem, not so elegant.
How do you think about it?
I'm not sure what do you want to know. Yes, there are problems like "what rollback actions to do", which would depend on where the call got stuck and "what's the timeout that should be set", which depends on thousands of factors. I can't think of any elegant solution that would prevent locking properly. Mainly because this is literally the Halting problem [1] plus a bit more. I'd say that whatever works for you in this situation is OK, but will (most probably) work only for your particular scenario. Martin [1] https://en.wikipedia.org/wiki/Halting_problem

On 2014/8/5 17:13, Martin Kletzander wrote:
On Tue, Aug 05, 2014 at 03:15:18PM +0800, James wrote:
In fact, to deal with this kind of situation, we add some timeout codes in libvirtd, during remote_dispatch process. The mechanism is like this: 1. when we call an API, we start a thread to do the timer, when time out, the timer set a timeout flag to the API, and return timeout result to the libvirt client. 2. when the API return to remote_dispatch level, it checkout the timeout flag to consider what to do next. If timeout, we do some rollback action. It's like detach device, if we attach device at first.
In this solution, there's something trouble, first, we have to figure out suitable rollback actions. Second, I'm not sure it's the best way to solve this kind of block problem, not so elegant.
How do you think about it?
I'm not sure what do you want to know. Yes, there are problems like "what rollback actions to do", which would depend on where the call got stuck and "what's the timeout that should be set", which depends on thousands of factors. I can't think of any elegant solution that would prevent locking properly. Mainly because this is literally the Halting problem [1] plus a bit more.
I'd say that whatever works for you in this situation is OK, but will (most probably) work only for your particular scenario.
Martin
Well, thank you very much. I'll think over it much more. -- Best Regards James
participants (2)
-
James
-
Martin Kletzander