[libvirt] [PATCH 0/2] Adaptive timeout for connecting to qemu monitor

I am not able to test this properly (my host has "just" 32GiB of RAM), but I've patched qemu to insert some delay into its init process and it worked just fine. Michal Privoznik (2): virTimeBackOffWait: Avoid long periods of sleep qemu: Adaptive timeout for connecting to monitor src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ src/util/virtime.c | 14 ++++++++++++-- tests/qemumonitortestutils.c | 1 + 6 files changed, 54 insertions(+), 8 deletions(-) -- 2.11.0

While connecting to qemu monitor, the first thing we do is for it to show up. However, we are doing it with some timeout to avoid indefinite waits (e.g. when qemu doesn't create the monitor socket at all). After beaa447a29 we are using exponential back off timeout meaning, after the first connection attempt we wait 1ms, then 2ms, then 4 and so on. This allows us to bring down wait time for small domains where qemu initializes quickly. However, on the other end of this scale are some domains with huge amounts of guest memory. Now imagine that we've gotten up to wait time of 15 seconds. The next one is going to be 30 seconds, and the one after that whole minute. Well, okay - with current code we are not going to wait longer than 30 seconds in total, but this is going to change in the next commit. The exponential back off is usable only for first few iterations. Then it needs to be caped (one second was chosen as the limit) and switch to constant wait time. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virtime.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/src/util/virtime.c b/src/util/virtime.c index aac96918a..650b1d0f9 100644 --- a/src/util/virtime.c +++ b/src/util/virtime.c @@ -390,6 +390,9 @@ virTimeBackOffStart(virTimeBackOffVar *var, return 0; } + +#define VIR_TIME_BACK_OFF_CAP 1000 + /** * virTimeBackOffWait * @var: Timeout variable (with type virTimeBackOffVar *). @@ -410,7 +413,9 @@ virTimeBackOffStart(virTimeBackOffVar *var, * The while loop that runs the body of the code repeatedly, with an * exponential backoff. It first waits for first milliseconds, then * runs the body, then waits for 2*first ms, then runs the body again. - * Then 4*first ms, and so on. + * Then 4*first ms, and so on, up until wait time would reach + * VIR_TIME_BACK_OFF_CAP (whole second). Then it switches to constant + * waiting time of VIR_TIME_BACK_OFF_CAP. * * When timeout milliseconds is reached, the while loop ends. * @@ -429,8 +434,13 @@ virTimeBackOffWait(virTimeBackOffVar *var) if (t > var->limit_t) return 0; /* ends the while loop */ + /* Compute next wait time. Should go above VIR_TIME_BACK_OFF_CAP + * mark, cap it there to avoid long useless sleeps. */ next = var->next; - var->next *= 2; + if (var->next < VIR_TIME_BACK_OFF_CAP) + var->next *= 2; + else if (var->next > VIR_TIME_BACK_OFF_CAP) + var->next = VIR_TIME_BACK_OFF_CAP; /* If sleeping would take us beyond the limit, then shorten the * sleep. This is so we always run the body just before the final -- 2.11.0

On 03/13/2017 06:29 AM, Michal Privoznik wrote:
While connecting to qemu monitor, the first thing we do is for it
s/is for/is wait for/
to show up. However, we are doing it with some timeout to avoid indefinite waits (e.g. when qemu doesn't create the monitor socket at all). After beaa447a29 we are using exponential back off timeout meaning, after the first connection attempt we wait 1ms, then 2ms, then 4 and so on. This allows us to bring down wait time for small domains where qemu initializes quickly. However, on the other end of this scale are some domains with huge amounts of guest memory. Now imagine that we've gotten up to wait time of 15 seconds. The next one is going to be 30 seconds, and the one after that whole minute. Well, okay - with current code we are not going to wait longer than 30 seconds in total, but this is going to change in the next commit.
The exponential back off is usable only for first few iterations. Then it needs to be caped (one second was chosen as the limit) and switch to constant wait time.
Cool. Nice job remembering the subtle fix needed here with potentially large timeouts!
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virtime.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/src/util/virtime.c b/src/util/virtime.c index aac96918a..650b1d0f9 100644 --- a/src/util/virtime.c +++ b/src/util/virtime.c @@ -390,6 +390,9 @@ virTimeBackOffStart(virTimeBackOffVar *var, return 0; }
+ +#define VIR_TIME_BACK_OFF_CAP 1000
Maybe just VIT_TIME_BACKOFF_CAP?
+ /** * virTimeBackOffWait * @var: Timeout variable (with type virTimeBackOffVar *). @@ -410,7 +413,9 @@ virTimeBackOffStart(virTimeBackOffVar *var, * The while loop that runs the body of the code repeatedly, with an * exponential backoff. It first waits for first milliseconds, then * runs the body, then waits for 2*first ms, then runs the body again. - * Then 4*first ms, and so on. + * Then 4*first ms, and so on, up until wait time would reach + * VIR_TIME_BACK_OFF_CAP (whole second). Then it switches to constant + * waiting time of VIR_TIME_BACK_OFF_CAP. * * When timeout milliseconds is reached, the while loop ends. * @@ -429,8 +434,13 @@ virTimeBackOffWait(virTimeBackOffVar *var) if (t > var->limit_t) return 0; /* ends the while loop */
+ /* Compute next wait time. Should go above VIR_TIME_BACK_OFF_CAP + * mark, cap it there to avoid long useless sleeps. */
A suggestion for alternate wording of the comment: + /* Compute next wait time. Cap at VIR_TIME_BACK_OFF_CAP + * to avoid long useless sleeps. */
next = var->next; - var->next *= 2; + if (var->next < VIR_TIME_BACK_OFF_CAP) + var->next *= 2; + else if (var->next > VIR_TIME_BACK_OFF_CAP) + var->next = VIR_TIME_BACK_OFF_CAP;
/* If sleeping would take us beyond the limit, then shorten the * sleep. This is so we always run the body just before the final
ACK. Your choice on the nits. Regards, Jim

There were couple of reports on the list (e.g. [1]) that guests with huge amounts of RAM are unable to start because libvirt kills qemu in the initialization phase. The problem is that if guest is configured to use hugepages kernel has to zero them all out before handing over to qemu process. For instance, 402GiB worth of 1GiB pages took around 105 seconds (~3.8GiB/s). Since we do not want to make the timeout for connecting to monitor configurable [2], we have to teach libvirt to count with this fact. This commit implements "1s per each 1GiB of RAM" approach as suggested here [3]. 1: https://www.redhat.com/archives/libvir-list/2017-March/msg00373.html 3: https://www.redhat.com/archives/libvir-list/2017-March/msg00405.html 2: The reason is that ideally someday it will be Libvirt who creates the monitor socket and qemu will just use it. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ tests/qemumonitortestutils.c | 1 + 5 files changed, 42 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 5a3b4ac50..54dfd22d8 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -4761,7 +4761,7 @@ virQEMUCapsInitQMPCommandRun(virQEMUCapsInitQMPCommandPtr cmd, cmd->vm->pid = cmd->pid; if (!(cmd->mon = qemuMonitorOpen(cmd->vm, &cmd->config, true, - &callbacks, NULL))) + 0, &callbacks, NULL))) goto ignore; virObjectLock(cmd->mon); diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index d71f84c80..272350bf5 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -327,11 +327,13 @@ qemuMonitorDispose(void *obj) static int -qemuMonitorOpenUnix(const char *monitor, pid_t cpid) +qemuMonitorOpenUnix(const char *monitor, + pid_t cpid, + unsigned long long timeout) { struct sockaddr_un addr; int monfd; - virTimeBackOffVar timeout; + virTimeBackOffVar timebackoff; int ret = -1; if ((monfd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { @@ -348,9 +350,9 @@ qemuMonitorOpenUnix(const char *monitor, pid_t cpid) goto error; } - if (virTimeBackOffStart(&timeout, 1, 30*1000 /* ms */) < 0) + if (virTimeBackOffStart(&timebackoff, 1, timeout * 1000) < 0) goto error; - while (virTimeBackOffWait(&timeout)) { + while (virTimeBackOffWait(&timebackoff)) { ret = connect(monfd, (struct sockaddr *) &addr, sizeof(addr)); if (ret == 0) @@ -871,10 +873,30 @@ qemuMonitorOpenInternal(virDomainObjPtr vm, } +#define QEMU_DEFAULT_MONITOR_WAIT 30 + +/** + * qemuMonitorOpen: + * @vm: domain object + * @config: monitor configuration + * @json: enable JSON on the monitor + * @timeout: how much seconds add to default timeout + * @cb: monitor event handles + * @opaque: opaque data for @cb + * + * Opens the monitor for running qemu. It may happen that it + * takes some time for qemu to create the monitor socket (e.g. + * because kernel is zeroing configured hugepages), therefore we + * wait up to default + timeout seconds for the monitor to show + * up after which a failure is claimed. + * + * Returns monitor object, NULL on error. + */ qemuMonitorPtr qemuMonitorOpen(virDomainObjPtr vm, virDomainChrSourceDefPtr config, bool json, + unsigned long long timeout, qemuMonitorCallbacksPtr cb, void *opaque) { @@ -882,10 +904,14 @@ qemuMonitorOpen(virDomainObjPtr vm, bool hasSendFD = false; qemuMonitorPtr ret; + timeout += QEMU_DEFAULT_MONITOR_WAIT; + switch (config->type) { case VIR_DOMAIN_CHR_TYPE_UNIX: hasSendFD = true; - if ((fd = qemuMonitorOpenUnix(config->data.nix.path, vm ? vm->pid : 0)) < 0) + if ((fd = qemuMonitorOpenUnix(config->data.nix.path, + vm ? vm->pid : 0, + timeout)) < 0) return NULL; break; diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 847e9458a..3c37a6ffe 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -246,6 +246,7 @@ char *qemuMonitorUnescapeArg(const char *in); qemuMonitorPtr qemuMonitorOpen(virDomainObjPtr vm, virDomainChrSourceDefPtr config, bool json, + unsigned long long timeout, qemuMonitorCallbacksPtr cb, void *opaque) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(4); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index b9c1847bb..6a9c53aea 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -1658,6 +1658,7 @@ qemuConnectMonitor(virQEMUDriverPtr driver, virDomainObjPtr vm, int asyncJob, qemuDomainObjPrivatePtr priv = vm->privateData; int ret = -1; qemuMonitorPtr mon = NULL; + unsigned long long timeout = 0; if (qemuSecuritySetDaemonSocketLabel(driver->securityManager, vm->def) < 0) { VIR_ERROR(_("Failed to set security context for monitor for %s"), @@ -1665,6 +1666,12 @@ qemuConnectMonitor(virQEMUDriverPtr driver, virDomainObjPtr vm, int asyncJob, return -1; } + /* When using hugepages, kernel zeroes them out before + * handing them over to qemu. This can be very time + * consuming. Therefore, add a seconds to timeout for each + * 1GiB of guest RAM. */ + timeout = vm->def->mem.total_memory / (1024 * 1024); + /* Hold an extra reference because we can't allow 'vm' to be * deleted until the monitor gets its own reference. */ virObjectRef(vm); @@ -1675,6 +1682,7 @@ qemuConnectMonitor(virQEMUDriverPtr driver, virDomainObjPtr vm, int asyncJob, mon = qemuMonitorOpen(vm, priv->monConfig, priv->monJSON, + timeout, &monitorCallbacks, driver); diff --git a/tests/qemumonitortestutils.c b/tests/qemumonitortestutils.c index cfd0a38cb..89857a662 100644 --- a/tests/qemumonitortestutils.c +++ b/tests/qemumonitortestutils.c @@ -1175,6 +1175,7 @@ qemuMonitorTestNew(bool json, if (!(test->mon = qemuMonitorOpen(test->vm, &src, json, + 0, &qemuMonitorTestCallbacks, driver))) goto error; -- 2.11.0

On 03/13/2017 06:29 AM, Michal Privoznik wrote:
There were couple of reports on the list (e.g. [1]) that guests with huge amounts of RAM are unable to start because libvirt kills qemu in the initialization phase. The problem is that if guest is configured to use hugepages kernel has to zero them all out before handing over to qemu process. For instance, 402GiB worth of 1GiB pages took around 105 seconds (~3.8GiB/s). Since we do not want to make the timeout for connecting to monitor configurable [2], we have to teach libvirt to count with this fact. This commit implements "1s per each 1GiB of RAM" approach as suggested here [3].
FYI, on another machine I observed ~1.5 seconds for each 5G of memory.
1: https://www.redhat.com/archives/libvir-list/2017-March/msg00373.html 3: https://www.redhat.com/archives/libvir-list/2017-March/msg00405.html
2: The reason is that ideally someday it will be Libvirt who creates the monitor socket and qemu will just use it.
This point isn't clear to me wrt the changes done by the commit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ tests/qemumonitortestutils.c | 1 + 5 files changed, 42 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 5a3b4ac50..54dfd22d8 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -4761,7 +4761,7 @@ virQEMUCapsInitQMPCommandRun(virQEMUCapsInitQMPCommandPtr cmd, cmd->vm->pid = cmd->pid;
if (!(cmd->mon = qemuMonitorOpen(cmd->vm, &cmd->config, true, - &callbacks, NULL))) + 0, &callbacks, NULL))) goto ignore;
virObjectLock(cmd->mon); diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index d71f84c80..272350bf5 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -327,11 +327,13 @@ qemuMonitorDispose(void *obj)
static int -qemuMonitorOpenUnix(const char *monitor, pid_t cpid) +qemuMonitorOpenUnix(const char *monitor, + pid_t cpid, + unsigned long long timeout) { struct sockaddr_un addr; int monfd; - virTimeBackOffVar timeout; + virTimeBackOffVar timebackoff; int ret = -1;
if ((monfd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { @@ -348,9 +350,9 @@ qemuMonitorOpenUnix(const char *monitor, pid_t cpid) goto error; }
- if (virTimeBackOffStart(&timeout, 1, 30*1000 /* ms */) < 0) + if (virTimeBackOffStart(&timebackoff, 1, timeout * 1000) < 0) goto error; - while (virTimeBackOffWait(&timeout)) { + while (virTimeBackOffWait(&timebackoff)) { ret = connect(monfd, (struct sockaddr *) &addr, sizeof(addr));
if (ret == 0) @@ -871,10 +873,30 @@ qemuMonitorOpenInternal(virDomainObjPtr vm, }
+#define QEMU_DEFAULT_MONITOR_WAIT 30 + +/** + * qemuMonitorOpen: + * @vm: domain object + * @config: monitor configuration + * @json: enable JSON on the monitor + * @timeout: how much seconds add to default timeout
s/how much seconds/number of seconds to/
+ * @cb: monitor event handles + * @opaque: opaque data for @cb + * + * Opens the monitor for running qemu. It may happen that it + * takes some time for qemu to create the monitor socket (e.g. + * because kernel is zeroing configured hugepages), therefore we + * wait up to default + timeout seconds for the monitor to show + * up after which a failure is claimed. + * + * Returns monitor object, NULL on error. + */ qemuMonitorPtr qemuMonitorOpen(virDomainObjPtr vm, virDomainChrSourceDefPtr config, bool json, + unsigned long long timeout, qemuMonitorCallbacksPtr cb, void *opaque) { @@ -882,10 +904,14 @@ qemuMonitorOpen(virDomainObjPtr vm, bool hasSendFD = false; qemuMonitorPtr ret;
+ timeout += QEMU_DEFAULT_MONITOR_WAIT; + switch (config->type) { case VIR_DOMAIN_CHR_TYPE_UNIX: hasSendFD = true; - if ((fd = qemuMonitorOpenUnix(config->data.nix.path, vm ? vm->pid : 0)) < 0) + if ((fd = qemuMonitorOpenUnix(config->data.nix.path, + vm ? vm->pid : 0, + timeout)) < 0) return NULL; break;
diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 847e9458a..3c37a6ffe 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -246,6 +246,7 @@ char *qemuMonitorUnescapeArg(const char *in); qemuMonitorPtr qemuMonitorOpen(virDomainObjPtr vm, virDomainChrSourceDefPtr config, bool json, + unsigned long long timeout, qemuMonitorCallbacksPtr cb, void *opaque) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(4); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index b9c1847bb..6a9c53aea 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -1658,6 +1658,7 @@ qemuConnectMonitor(virQEMUDriverPtr driver, virDomainObjPtr vm, int asyncJob, qemuDomainObjPrivatePtr priv = vm->privateData; int ret = -1; qemuMonitorPtr mon = NULL; + unsigned long long timeout = 0;
if (qemuSecuritySetDaemonSocketLabel(driver->securityManager, vm->def) < 0) { VIR_ERROR(_("Failed to set security context for monitor for %s"), @@ -1665,6 +1666,12 @@ qemuConnectMonitor(virQEMUDriverPtr driver, virDomainObjPtr vm, int asyncJob, return -1; }
+ /* When using hugepages, kernel zeroes them out before + * handing them over to qemu. This can be very time + * consuming. Therefore, add a seconds to timeout for each
s/seconds/second/
+ * 1GiB of guest RAM. */ + timeout = vm->def->mem.total_memory / (1024 * 1024); + /* Hold an extra reference because we can't allow 'vm' to be * deleted until the monitor gets its own reference. */ virObjectRef(vm); @@ -1675,6 +1682,7 @@ qemuConnectMonitor(virQEMUDriverPtr driver, virDomainObjPtr vm, int asyncJob, mon = qemuMonitorOpen(vm, priv->monConfig, priv->monJSON, + timeout, &monitorCallbacks, driver);
diff --git a/tests/qemumonitortestutils.c b/tests/qemumonitortestutils.c index cfd0a38cb..89857a662 100644 --- a/tests/qemumonitortestutils.c +++ b/tests/qemumonitortestutils.c @@ -1175,6 +1175,7 @@ qemuMonitorTestNew(bool json, if (!(test->mon = qemuMonitorOpen(test->vm, &src, json, + 0, &qemuMonitorTestCallbacks, driver))) goto error;
Looks good. You can decide about the nits. ACK in principle, but I'm not able to test this ATM. The host I was using is now running a VM that uses most of its memory. Matt might be running a test in that VM so I don't want to kill it :-). I can test the patches tomorrow and report back if you want to wait a bit before pushing. Regards, Jim

On 03/15/2017 02:16 AM, Jim Fehlig wrote:
On 03/13/2017 06:29 AM, Michal Privoznik wrote:
There were couple of reports on the list (e.g. [1]) that guests with huge amounts of RAM are unable to start because libvirt kills qemu in the initialization phase. The problem is that if guest is configured to use hugepages kernel has to zero them all out before handing over to qemu process. For instance, 402GiB worth of 1GiB pages took around 105 seconds (~3.8GiB/s). Since we do not want to make the timeout for connecting to monitor configurable [2], we have to teach libvirt to count with this fact. This commit implements "1s per each 1GiB of RAM" approach as suggested here [3].
FYI, on another machine I observed ~1.5 seconds for each 5G of memory.
1: https://www.redhat.com/archives/libvir-list/2017-March/msg00373.html 3: https://www.redhat.com/archives/libvir-list/2017-March/msg00405.html
2: The reason is that ideally someday it will be Libvirt who creates the monitor socket and qemu will just use it.
This point isn't clear to me wrt the changes done by the commit.
Okay, I can remove it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ tests/qemumonitortestutils.c | 1 + 5 files changed, 42 insertions(+), 6 deletions(-)
Looks good. You can decide about the nits. ACK in principle, but I'm not able to test this ATM. The host I was using is now running a VM that uses most of its memory. Matt might be running a test in that VM so I don't want to kill it :-). I can test the patches tomorrow and report back if you want to wait a bit before pushing.
Sure. I will wait for your report and thus full ACK. Michal

Michal Privoznik wrote:
On 03/15/2017 02:16 AM, Jim Fehlig wrote:
On 03/13/2017 06:29 AM, Michal Privoznik wrote:
There were couple of reports on the list (e.g. [1]) that guests with huge amounts of RAM are unable to start because libvirt kills qemu in the initialization phase. The problem is that if guest is configured to use hugepages kernel has to zero them all out before handing over to qemu process. For instance, 402GiB worth of 1GiB pages took around 105 seconds (~3.8GiB/s). Since we do not want to make the timeout for connecting to monitor configurable [2], we have to teach libvirt to count with this fact. This commit implements "1s per each 1GiB of RAM" approach as suggested here [3]. FYI, on another machine I observed ~1.5 seconds for each 5G of memory.
1: https://www.redhat.com/archives/libvir-list/2017-March/msg00373.html 3: https://www.redhat.com/archives/libvir-list/2017-March/msg00405.html
2: The reason is that ideally someday it will be Libvirt who creates the monitor socket and qemu will just use it. This point isn't clear to me wrt the changes done by the commit.
Okay, I can remove it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ tests/qemumonitortestutils.c | 1 + 5 files changed, 42 insertions(+), 6 deletions(-)
Looks good. You can decide about the nits. ACK in principle, but I'm not able to test this ATM. The host I was using is now running a VM that uses most of its memory. Matt might be running a test in that VM so I don't want to kill it :-). I can test the patches tomorrow and report back if you want to wait a bit before pushing.
Sure. I will wait for your report and thus full ACK.
Test results were positive. ACK series. Regards, Jim

On 03/15/2017 03:28 PM, Jim Fehlig wrote:
Michal Privoznik wrote:
On 03/15/2017 02:16 AM, Jim Fehlig wrote:
On 03/13/2017 06:29 AM, Michal Privoznik wrote:
There were couple of reports on the list (e.g. [1]) that guests with huge amounts of RAM are unable to start because libvirt kills qemu in the initialization phase. The problem is that if guest is configured to use hugepages kernel has to zero them all out before handing over to qemu process. For instance, 402GiB worth of 1GiB pages took around 105 seconds (~3.8GiB/s). Since we do not want to make the timeout for connecting to monitor configurable [2], we have to teach libvirt to count with this fact. This commit implements "1s per each 1GiB of RAM" approach as suggested here [3]. FYI, on another machine I observed ~1.5 seconds for each 5G of memory.
1: https://www.redhat.com/archives/libvir-list/2017-March/msg00373.html 3: https://www.redhat.com/archives/libvir-list/2017-March/msg00405.html
2: The reason is that ideally someday it will be Libvirt who creates the monitor socket and qemu will just use it. This point isn't clear to me wrt the changes done by the commit.
Okay, I can remove it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ tests/qemumonitortestutils.c | 1 + 5 files changed, 42 insertions(+), 6 deletions(-)
Looks good. You can decide about the nits. ACK in principle, but I'm not able to test this ATM. The host I was using is now running a VM that uses most of its memory. Matt might be running a test in that VM so I don't want to kill it :-). I can test the patches tomorrow and report back if you want to wait a bit before pushing.
Sure. I will wait for your report and thus full ACK.
Test results were positive. ACK series.
Thank you. I've pushed all the patches. If we ever find that our guessing algorithm doesn't work we can revisit it. BTW: I've run couple of tests myself too and am quite surprised to learn that it takes ages to zero out memory. On my system it's about ~3.6GiB/s. And this is DDR4@2133 MHz. Michal

On 03/13/2017 06:29 AM, Michal Privoznik wrote:
I am not able to test this properly (my host has "just" 32GiB of RAM), but I've patched qemu to insert some delay into its init process and it worked just fine.
Damn, I should have scanned the list today before also working on this problem. Attached is a patch I used to test the concept. It's a hack, but in the end calculates the same timeout as your series. It worked well in my testing.
Michal Privoznik (2): virTimeBackOffWait: Avoid long periods of sleep qemu: Adaptive timeout for connecting to monitor
Yep, I like these changes over the hack :-). Regards, Jim
src/qemu/qemu_capabilities.c | 2 +- src/qemu/qemu_monitor.c | 36 +++++++++++++++++++++++++++++++----- src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_process.c | 8 ++++++++ src/util/virtime.c | 14 ++++++++++++-- tests/qemumonitortestutils.c | 1 + 6 files changed, 54 insertions(+), 8 deletions(-)

Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/news.xml | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/news.xml b/docs/news.xml index 04783aa5e..6ce6ab362 100644 --- a/docs/news.xml +++ b/docs/news.xml @@ -43,7 +43,20 @@ </change> </section> <section title="Bug fixes"> - <change/> + <change> + <summary> + qemu: Adaptive timeout for connecting to monitor + </summary> + <description> + When starting qemu, libvirt waits for qemu to create the monitor + socket which libvirt connects to. Historically, there was sharp 30 + seconds timeout after which the qemu process was killed. This + approach is suboptimal as in some scenarios with huge amounts of + guest RAM it can take a minute or more for kernel to allocate and + zero out pages for qemu. The timeout is now flexible and computed by + libvirt at domain startup. + </description> + </change> </section> </release> <release version="v3.1.0" date="2017-03-03"> -- 2.11.0

On Wed, Mar 15, 2017 at 01:06:12PM +0100, Michal Privoznik wrote:
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/news.xml | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/docs/news.xml b/docs/news.xml index 04783aa5e..6ce6ab362 100644 --- a/docs/news.xml +++ b/docs/news.xml @@ -43,7 +43,20 @@ </change> </section> <section title="Bug fixes"> - <change/> + <change> + <summary> + qemu: Adaptive timeout for connecting to monitor + </summary> + <description> + When starting qemu, libvirt waits for qemu to create the monitor + socket which libvirt connects to. Historically, there was sharp 30 + seconds timeout after which the qemu process was killed. This
s/seconds/second/ (as in 30-second timeout) feels more English, I guess.
+ approach is suboptimal as in some scenarios with huge amounts of + guest RAM it can take a minute or more for kernel to allocate and + zero out pages for qemu. The timeout is now flexible and computed by + libvirt at domain startup. + </description> + </change> </section>
I'll take the opportunity to repeat here what I said in another news-related thread. The good thing about the new news file layout is that we can express freely what is the feature that was added and we don't have to copy-paste the restricted commit messages. So I, personally, like to have nicely formatted sentences in the summary (well, I like that even in the commit messages sometimes, but that's another story). I would go with: "When connecting to qemu monitor, the timeout is now adaptive" or even: "Better (or dynamic determination) strategy is now used for qemu monitor connection timeout" for the summary. Or something along the lines. What I say is (as before) pretty subjective, so I'll leave the final decision up to you, just wanted to put it out there (yet again). I'm trying to imagine the user going through these and immediately having an idea of the list of things being done. Commit names are more for developers. ACK P.S.: Isn't it nice when you can just copy-paste reviews? =D P.P.S.: Would it be to much if I also said that shortening the description to something like the following would be enough? "When launching QEMU, libvirt now adjusts the timeout for the initial connection dynamically as oposed to the static 30 seconds before" Still, your decision on that.
</release> <release version="v3.1.0" date="2017-03-03"> -- 2.11.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, 2017-03-15 at 14:10 +0100, Martin Kletzander wrote:
I'll take the opportunity to repeat here what I said in another news-related thread. The good thing about the new news file layout is that we can express freely what is the feature that was added and we don't have to copy-paste the restricted commit messages. So I, personally, like to have nicely formatted sentences in the summary (well, I like that even in the commit messages sometimes, but that's another story). I would go with: "When connecting to qemu monitor, the timeout is now adaptive" or even: "Better (or dynamic determination) strategy is now used for qemu monitor connection timeout" for the summary. Or something along the lines. What I say is (as before) pretty subjective, so I'll leave the final decision up to you, just wanted to put it out there (yet again). I'm trying to imagine the user going through these and immediately having an idea of the list of things being done. Commit names are more for developers.
I think having a prefix, such a "qemu:" in this case, helps you scanning the release notes and very quickly realize whether or not any of the changes are relevant to you. I also believe that, while we don't necessarily have to artificially limit ourselves, having a fairly short summary is usually good for the same reasons explained above. So basically I like Michal's original summary more than I like yours ;) I would s/Adaptive/Use adaptive/ though. -- Andrea Bolognani / Red Hat / Virtualization

On Wed, 2017-03-15 at 14:10 +0100, Martin Kletzander wrote: I would s/Adaptive/Use adaptive/ though. Oh, and it should be "QEMU" rather than "qemu" in the description ;) -- Andrea Bolognani / Red Hat / Virtualization

On Wed, Mar 15, 2017 at 03:40:04PM +0100, Andrea Bolognani wrote:
On Wed, 2017-03-15 at 14:10 +0100, Martin Kletzander wrote:
I'll take the opportunity to repeat here what I said in another news-related thread. The good thing about the new news file layout is that we can express freely what is the feature that was added and we don't have to copy-paste the restricted commit messages. So I, personally, like to have nicely formatted sentences in the summary (well, I like that even in the commit messages sometimes, but that's another story). I would go with: "When connecting to qemu monitor, the timeout is now adaptive" or even: "Better (or dynamic determination) strategy is now used for qemu monitor connection timeout" for the summary. Or something along the lines. What I say is (as before) pretty subjective, so I'll leave the final decision up to you, just wanted to put it out there (yet again). I'm trying to imagine the user going through these and immediately having an idea of the list of things being done. Commit names are more for developers.
I think having a prefix, such a "qemu:" in this case, helps you scanning the release notes and very quickly realize whether or not any of the changes are relevant to you.
OK, fair enough, in this case (some other changes have non-related prefix every now and then).
I also believe that, while we don't necessarily have to artificially limit ourselves, having a fairly short summary is usually good for the same reasons explained above.
So basically I like Michal's original summary more than I like yours ;)
I would s/Adaptive/Use adaptive/ though.
Sure, that makes it a sentence as well ;) So it fits my requirement as well =)
-- Andrea Bolognani / Red Hat / Virtualization
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
participants (4)
-
Andrea Bolognani
-
Jim Fehlig
-
Martin Kletzander
-
Michal Privoznik