[libvirt] [PATCHv2] qemu: agent: Fix QEMU guest agent is not available due to an error

These days, we experienced one qemu guest agent is not available error. Then the guest agent couldn't be used forever before rebooting the libvirtd daemon or reboot the qemu-guest-agent in the VM(you can also reboot the VM) to clear the priv->agentError. This is one bug for a long time in many verisons of libvirtd: https://bugzilla.redhat.com/show_bug.cgi?id=1090551 And there is one python script to reproduce this bug from the bugzilla: https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py Just set the timeout to 0 at Line26, you can reproduce it very quickly: rsp = libvirt_qemu.qemuAgentCommand(dom, cmd, 1, 0) // 1 -> 0 The reason why this could happen is that: In case we received something like {"return": {}} for example, it is likely that somebody: 1) Started GA thread which sent one "guest-ping" command with seconds equals to VIR_DOMAIN_QEMU_AGENT_COMMAND_NOWAIT(0) makes this function return immediately without waiting, but after updating the events, it will sleep in virCondWait() or virCondWaitUntil(). 2) While in AgentIO thread the "guest-ping" command is sent successfully, with updating the events it will wait for reply. 3) Then the GA thread is woken up and the mon->msg is set to NULL and exit. 4) At last the AgentIO thread receives the reply of "guest-ping" command with the mon->msg == NULL. This case could be adapt to all the GA commands, including guest-sync, etc. This patch will check if this is the case and don't report an error but return silently. Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Signed-off-by: Zhuoyu Zhang <zhangzhuoyu@cmss.chinamobile.com> Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com> Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Qide Chen <chenqide@cmss.chinamobile.com> --- Changes in V2: - check value type instead of object type. src/qemu/qemu_agent.c | 37 ++++++++++++++++++++++++++++--------- src/util/virjson.h | 7 +++++++ 2 files changed, 35 insertions(+), 9 deletions(-) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index eeede6b..f5408cc 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -308,7 +308,6 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, { virJSONValuePtr obj = NULL; int ret = -1; - unsigned long long id; VIR_DEBUG("Line [%s]", line); @@ -333,16 +332,36 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, obj = NULL; ret = 0; } else { - /* If we've received something like: - * {"return": 1234} - * it is likely that somebody started GA - * which is now processing our previous - * guest-sync commands. Check if this is - * the case and don't report an error but + virJSONValuePtr val; + + /* In case we received something like: + * {"return": {}} for example, + * it is likely that somebody: + * + * 1) Started GA thread which sent one "guest-ping" command + * with seconds equals to VIR_DOMAIN_QEMU_AGENT_COMMAND_NOWAIT(0) + * makes this function return immediately without waiting, but + * after updating the events, it will sleep in virCondWait() or + * virCondWaitUntil(). + * + * 2) While in AgentIO thread the "guest-ping" command is sent + * successfully, with updating the events it will wait for reply. + * + * 3) Then the GA thread is woken up and the mon->msg is set + * to NULL and exit. + * + * 4) At last the AgentIO thread receives the reply of "guest-ping" + * command with the mon->msg == NULL. + * + * This case could be adapt to all the GA commands, including + * guest-sync, etc. + * + * Check if this is the case and don't report an error but * return silently. */ - if (virJSONValueObjectGetNumberUlong(obj, "return", &id) == 0) { - VIR_DEBUG("Ignoring delayed reply to guest-sync: %llu", id); + val = virJSONValueObjectGet(obj, "return"); + if (val && virJSONValueTypeIsValid(val)) { + VIR_DEBUG("Ignoring delayed command reply"); ret = 0; goto cleanup; } diff --git a/src/util/virjson.h b/src/util/virjson.h index 66ed48a..cb1d973 100644 --- a/src/util/virjson.h +++ b/src/util/virjson.h @@ -80,6 +80,13 @@ struct _virJSONValue { } data; }; +static inline bool +virJSONValueTypeIsValid(virJSONValuePtr value) +{ + return ((value->type >= VIR_JSON_TYPE_OBJECT) && + (value->type <= VIR_JSON_TYPE_BOOLEAN)); +} + void virJSONValueFree(virJSONValuePtr value); int virJSONValueObjectCreate(virJSONValuePtr *obj, ...) -- 1.8.3.1

On 07/07/2016 05:53 AM, Xiubo Li wrote:
These days, we experienced one qemu guest agent is not available error. Then the guest agent couldn't be used forever before rebooting the libvirtd daemon or reboot the qemu-guest-agent in the VM(you can also reboot the VM) to clear the priv->agentError.
This is one bug for a long time in many verisons of libvirtd: https://bugzilla.redhat.com/show_bug.cgi?id=1090551
And there is one python script to reproduce this bug from the bugzilla: https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py Just set the timeout to 0 at Line26, you can reproduce it very quickly: rsp = libvirt_qemu.qemuAgentCommand(dom, cmd, 1, 0) // 1 -> 0
The reason why this could happen is that:
In case we received something like {"return": {}} for example, it is likely that somebody:
1) Started GA thread which sent one "guest-ping" command with seconds equals to VIR_DOMAIN_QEMU_AGENT_COMMAND_NOWAIT(0) makes this function return immediately without waiting, but after updating the events, it will sleep in virCondWait() or virCondWaitUntil().
2) While in AgentIO thread the "guest-ping" command is sent successfully, with updating the events it will wait for reply.
3) Then the GA thread is woken up and the mon->msg is set to NULL and exit.
4) At last the AgentIO thread receives the reply of "guest-ping" command with the mon->msg == NULL.
This case could be adapt to all the GA commands, including guest-sync, etc.
This patch will check if this is the case and don't report an error but return silently.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Signed-off-by: Zhuoyu Zhang <zhangzhuoyu@cmss.chinamobile.com> Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com> Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Qide Chen <chenqide@cmss.chinamobile.com> ---
Changes in V2: - check value type instead of object type.
src/qemu/qemu_agent.c | 37 ++++++++++++++++++++++++++++--------- src/util/virjson.h | 7 +++++++ 2 files changed, 35 insertions(+), 9 deletions(-)
Since this has been sitting around for a long time with a response - figured I'd point out something that was pushed today: http://www.redhat.com/archives/libvir-list/2016-September/msg00570.html It addresses/resolves the issue in essentially the same manner, although it doesn't check the empty {} reply, rather it just makes the else condition be a ignored delay reply.... There's also a few more patches to the series. John
diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index eeede6b..f5408cc 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -308,7 +308,6 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, { virJSONValuePtr obj = NULL; int ret = -1; - unsigned long long id;
VIR_DEBUG("Line [%s]", line);
@@ -333,16 +332,36 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, obj = NULL; ret = 0; } else { - /* If we've received something like: - * {"return": 1234} - * it is likely that somebody started GA - * which is now processing our previous - * guest-sync commands. Check if this is - * the case and don't report an error but + virJSONValuePtr val; + + /* In case we received something like: + * {"return": {}} for example, + * it is likely that somebody: + * + * 1) Started GA thread which sent one "guest-ping" command + * with seconds equals to VIR_DOMAIN_QEMU_AGENT_COMMAND_NOWAIT(0) + * makes this function return immediately without waiting, but + * after updating the events, it will sleep in virCondWait() or + * virCondWaitUntil(). + * + * 2) While in AgentIO thread the "guest-ping" command is sent + * successfully, with updating the events it will wait for reply. + * + * 3) Then the GA thread is woken up and the mon->msg is set + * to NULL and exit. + * + * 4) At last the AgentIO thread receives the reply of "guest-ping" + * command with the mon->msg == NULL. + * + * This case could be adapt to all the GA commands, including + * guest-sync, etc. + * + * Check if this is the case and don't report an error but * return silently. */ - if (virJSONValueObjectGetNumberUlong(obj, "return", &id) == 0) { - VIR_DEBUG("Ignoring delayed reply to guest-sync: %llu", id); + val = virJSONValueObjectGet(obj, "return"); + if (val && virJSONValueTypeIsValid(val)) { + VIR_DEBUG("Ignoring delayed command reply"); ret = 0; goto cleanup; } diff --git a/src/util/virjson.h b/src/util/virjson.h index 66ed48a..cb1d973 100644 --- a/src/util/virjson.h +++ b/src/util/virjson.h @@ -80,6 +80,13 @@ struct _virJSONValue { } data; };
+static inline bool +virJSONValueTypeIsValid(virJSONValuePtr value) +{ + return ((value->type >= VIR_JSON_TYPE_OBJECT) && + (value->type <= VIR_JSON_TYPE_BOOLEAN)); +} + void virJSONValueFree(virJSONValuePtr value);
int virJSONValueObjectCreate(virJSONValuePtr *obj, ...)

Cc'ed Michal and Nikolay. On Wed, Sep 21, 2016 at 07:01:37AM -0400, John Ferlan wrote:
On 07/07/2016 05:53 AM, Xiubo Li wrote:
src/qemu/qemu_agent.c | 37 ++++++++++++++++++++++++++++--------- src/util/virjson.h | 7 +++++++ 2 files changed, 35 insertions(+), 9 deletions(-)
Since this has been sitting around for a long time with a response -
Yes, we sent the patch more than 2 months ago,exactly at 07/07/2016. And then we sent a ping mail wanting to get any response from libvirt community. But unfortunately there's none.
figured I'd point out something that was pushed today:
John, thank you for your kindly remainder. We know it just now and find out that we were even not cc'ed when it happened.
http://www.redhat.com/archives/libvir-list/2016-September/msg00570.html
It addresses/resolves the issue in essentially the same manner, although it doesn't check the empty {} reply, rather it just makes the else condition be a ignored delay reply.... There's also a few more patches to the series.
Michal, could you please explain why you merged a later-sent patch rather than the first-sent one? We worked very hard to solve this issue and contribute the patch to fix it. And we always hopefully want to exchange with you guys from the libvirt community. Our company focus on cloud computing technology and we contribute to qemu, linux kernel, kvm and openstack community. But this situation is the first time we met in our long-time contribution to these communities. This is the traditional of libvirt community or something? We strongly recommend the later patch series rebase on our patch. Thanks.
John
participants (3)
-
John Ferlan
-
Xiubo Li
-
Yaowei Bai