[libvirt] [PATCH 0/3] qemu: Don't get stuck in bulk-stats API on broken NFS

Accessing VM disk files while we weren't able to get the stats from qemu directly as it was most probably stuck on missing storage will make libvrtd's thread handling the stats get stuck. Avoid touching files in such case. Peter Krempa (3): qemu: driver: Remove unnecessary flag in qemuDomainGetStatsBlock qemu: driver: Separate bulk stats worker for block devices qemu: bulk stats: Don't access possibly blocked storage src/qemu/qemu_driver.c | 78 +++++++++++++++++++++++++++++++------------------- 1 file changed, 49 insertions(+), 29 deletions(-) -- 2.8.2

'abbreviated' was true if 'stats' were NULL --- src/qemu/qemu_driver.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 00daf72..5917bc2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -19335,7 +19335,6 @@ qemuDomainGetStatsOneBlock(virQEMUDriverPtr driver, virStorageSourcePtr src, size_t block_idx, unsigned int backing_idx, - bool abbreviated, virHashTablePtr stats) { qemuBlockStats *entry; @@ -19354,7 +19353,7 @@ qemuDomainGetStatsOneBlock(virQEMUDriverPtr driver, QEMU_ADD_BLOCK_PARAM_UI(record, maxparams, block_idx, "backingIndex", backing_idx); - if (abbreviated || !alias || !(entry = virHashLookup(stats, alias))) { + if (!stats || !alias || !(entry = virHashLookup(stats, alias))) { if (virStorageSourceIsEmpty(src)) { ret = 0; goto cleanup; @@ -19431,15 +19430,12 @@ qemuDomainGetStatsBlock(virQEMUDriverPtr driver, int rc; virHashTablePtr stats = NULL; qemuDomainObjPrivatePtr priv = dom->privateData; - bool abbreviated = false; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); int count_index = -1; size_t visited = 0; bool visitBacking = !!(privflags & QEMU_DOMAIN_STATS_BACKING); - if (!HAVE_JOB(privflags) || !virDomainObjIsActive(dom)) { - abbreviated = true; /* it's ok, just go ahead silently */ - } else { + if (HAVE_JOB(privflags) && virDomainObjIsActive(dom)) { qemuDomainObjEnterMonitor(driver, dom); rc = qemuMonitorGetAllBlockStatsInfo(priv->mon, &stats, visitBacking); @@ -19449,10 +19445,9 @@ qemuDomainGetStatsBlock(virQEMUDriverPtr driver, if (qemuDomainObjExitMonitor(driver, dom) < 0) goto cleanup; - if (rc < 0) { + /* failure to retrieve stats is fine at this point */ + if (rc < 0) virResetLastError(); - abbreviated = true; /* still ok, again go ahead silently */ - } } /* When listing backing chains, it's easier to fix up the count @@ -19469,7 +19464,7 @@ qemuDomainGetStatsBlock(virQEMUDriverPtr driver, while (src && (backing_idx == 0 || visitBacking)) { if (qemuDomainGetStatsOneBlock(driver, cfg, dom, record, maxparams, disk, src, visited, backing_idx, - abbreviated, stats) < 0) + stats) < 0) goto cleanup; visited++; backing_idx++; -- 2.8.2

On Wed, May 18, 2016 at 15:14:27 +0200, Peter Krempa wrote:
'abbreviated' was true if 'stats' were NULL --- src/qemu/qemu_driver.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-)
ACK Jirka

Extract the fallback path that reloads the stats from disk into a separate function. --- src/qemu/qemu_driver.c | 58 ++++++++++++++++++++++++++++++++------------------ 1 file changed, 37 insertions(+), 21 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 5917bc2..ee50577 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -19324,6 +19324,40 @@ do { \ goto cleanup; \ } while (0) +/* refresh information by opening images on the disk */ +static int +qemuDomainGetStatsOneBlockFallback(virQEMUDriverPtr driver, + virQEMUDriverConfigPtr cfg, + virDomainObjPtr dom, + virDomainStatsRecordPtr record, + int *maxparams, + virStorageSourcePtr src, + size_t block_idx) +{ + int ret = -1; + + if (virStorageSourceIsEmpty(src)) + return 0; + + if (qemuStorageLimitsRefresh(driver, cfg, dom, src) < 0) { + virResetLastError(); + return 0; + } + + if (src->allocation) + QEMU_ADD_BLOCK_PARAM_ULL(record, maxparams, block_idx, + "allocation", src->allocation); + if (src->capacity) + QEMU_ADD_BLOCK_PARAM_ULL(record, maxparams, block_idx, + "capacity", src->capacity); + if (src->physical) + QEMU_ADD_BLOCK_PARAM_ULL(record, maxparams, block_idx, + "physical", src->physical); + ret = 0; + cleanup: + return ret; +} + static int qemuDomainGetStatsOneBlock(virQEMUDriverPtr driver, @@ -19353,28 +19387,10 @@ qemuDomainGetStatsOneBlock(virQEMUDriverPtr driver, QEMU_ADD_BLOCK_PARAM_UI(record, maxparams, block_idx, "backingIndex", backing_idx); + /* use fallback path if data is not available */ if (!stats || !alias || !(entry = virHashLookup(stats, alias))) { - if (virStorageSourceIsEmpty(src)) { - ret = 0; - goto cleanup; - } - - if (qemuStorageLimitsRefresh(driver, cfg, dom, src) < 0) { - virResetLastError(); - ret = 0; - goto cleanup; - } - - if (src->allocation) - QEMU_ADD_BLOCK_PARAM_ULL(record, maxparams, block_idx, - "allocation", src->allocation); - if (src->capacity) - QEMU_ADD_BLOCK_PARAM_ULL(record, maxparams, block_idx, - "capacity", src->capacity); - if (src->physical) - QEMU_ADD_BLOCK_PARAM_ULL(record, maxparams, block_idx, - "physical", src->physical); - ret = 0; + ret = qemuDomainGetStatsOneBlockFallback(driver, cfg, dom, record, + maxparams, src, block_idx); goto cleanup; } -- 2.8.2

On Wed, May 18, 2016 at 15:14:28 +0200, Peter Krempa wrote:
Extract the fallback path that reloads the stats from disk into a separate function. --- src/qemu/qemu_driver.c | 58 ++++++++++++++++++++++++++++++++------------------ 1 file changed, 37 insertions(+), 21 deletions(-)
ACK Jirka

When libvirt is gathering stats for block devices in the bulk stats API it would use the fallback code that accesses the files directly in libvirt both if the VM was offline and if qemu didn't return the stats at all. If qemu is not cooperating due to being stuck on an inaccessible NFS share we would then attempt to read the files and get stuck too with the VM object locked. All other APIs would get eventually get stuck waiting on the VM lock. Avoid this problem by skipping the block stats if the VM is online but the monitor did not provide any stats. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1337073 --- src/qemu/qemu_driver.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index ee50577..d244fd3 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -19387,13 +19387,22 @@ qemuDomainGetStatsOneBlock(virQEMUDriverPtr driver, QEMU_ADD_BLOCK_PARAM_UI(record, maxparams, block_idx, "backingIndex", backing_idx); - /* use fallback path if data is not available */ - if (!stats || !alias || !(entry = virHashLookup(stats, alias))) { + /* the VM is offline so we have to go and load the stast from the disk by + * ourselves */ + if (!virDomainObjIsActive(dom)) { ret = qemuDomainGetStatsOneBlockFallback(driver, cfg, dom, record, maxparams, src, block_idx); goto cleanup; } + /* In case where qemu didn't provide the stats we stop here rather than + * trying to refresh the stats from the disk. Inability to provide stats is + * usually caused by blocked storage so this would make libvirtd hang */ + if (!stats || !alias || !(entry = virHashLookup(stats, alias))) { + ret = 0; + goto cleanup; + } + QEMU_ADD_BLOCK_PARAM_LL(record, maxparams, block_idx, "rd.reqs", entry->rd_req); QEMU_ADD_BLOCK_PARAM_LL(record, maxparams, block_idx, -- 2.8.2

On Wed, May 18, 2016 at 15:14:29 +0200, Peter Krempa wrote:
When libvirt is gathering stats for block devices in the bulk stats API it would use the fallback code that accesses the files directly in libvirt both if the VM was offline and if qemu didn't return the stats at all.
It took me a while to understand what you wanted to say here. It would be nice if you could reword it :-)
If qemu is not cooperating due to being stuck on an inaccessible NFS share we would then attempt to read the files and get stuck too with the VM object locked. All other APIs would get eventually get stuck
s/ get / /
waiting on the VM lock.
Avoid this problem by skipping the block stats if the VM is online but the monitor did not provide any stats.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1337073 --- src/qemu/qemu_driver.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
ACK Jirka
participants (2)
-
Jiri Denemark
-
Peter Krempa