Re: [PATCH] qemu: Don't report error from domblkinfo if vm is inactive

6 Sep 2022

On Tue, Sep 06, 2022 at 10:29:01 +0800, huangy81@chinatelecom.cn wrote:
...
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Libvirt logs and reports error when executing domblkinfo if vm
configured rbd storage and in inactive state.
The steps to reproduce the problem:
1. define and start domain with its storage configured rbd disk,
   and corresponding disk label is 'vda'.
2. set vm in inactive state.
3. call 'virsh domblklinfo' as the following and problem reproduced.
$ virsh domblkinfo vm1 vda
error: internal error: missing storage backend for network files using
rbd protocol
Meanwhile, libvirtd log message also report the same error.
To fix this, validate the disk type if vm is inactive before
refreshing capacity and allocation limits of a given storage source
in qemuDomainGetBlockInfo in advance, if storage source type is
VIR_STORAGE_TYPE_NETWORK and net protocol is
VIR_STORAGE_NET_PROTOCAOL_RBD, set info to 0 like 'domblkinfo --all'
command does and return directly.
This problem is not specific for RBD, but for everything where the
backend is not loaded or doesn't exist.
...
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1724808
This BZ is not appropriate, this was reported for log entries in the
bulk stats code and more importantly it's actually closed as resolved,
so pointing to it makes no sense.
...
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Signed-off-by: Pengcheng Deng <dengpc12@chinatelecom.cn>
---
 src/qemu/qemu_driver.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index c7cca64..bfe1fa2 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -11118,6 +11118,19 @@ qemuDomainGetBlockInfo(virDomainPtr dom,
/* for inactive domains we have to peek into the files */
     if (!virDomainObjIsActive(vm)) {
+        /* for rbd protocol, virStorageFileBackend not loaded if vm is inactive,
+         * so generate 0 based info like 'domblkinfo --all' does and return directly
+         * */
+        if (virStorageSourceGetActualType(disk->src) == VIR_STORAGE_TYPE_NETWORK &&
+            disk->src->protocol == VIR_STORAGE_NET_PROTOCOL_RBD) {
So with this piece of code you plaster over a very specific sub-part of
the problem noted above, but leave other bits unadressed. What's worse
that the behaviour will differ between protocols and states which is
unacceptable.

If you look at the BZ you've linked above you'll find there patches
which fix the issue in the bulk stats code. Apart from commits for
adding the infrastructure there is the following commit:

  commit 60b862cf9d6a335db65bbf2b011499dfa729ca2e
  Author: Peter Krempa <pkrempa@redhat.com>
  Date:   Wed Aug 14 18:46:09 2019 +0200
  
      qemu: Don't report some ignored errors in qemuDomainGetStatsOneBlockFallback
  
      The function ignores all errors from qemuStorageLimitsRefresh by calling
      virResetLastError. This still logs them. Since qemuStorageLimitsRefresh
      allows suppressing some, do so.
  
      Signed-off-by: Peter Krempa <pkrempa@redhat.com>
      Reviewed-by: Ján Tomko <jtomko@redhat.com>
  
  diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
  index 243a8ac4cf..f44d134b92 100644
  --- a/src/qemu/qemu_driver.c
  +++ b/src/qemu/qemu_driver.c
  @@ -21301,7 +21301,7 @@ qemuDomainGetStatsOneBlockFallback(virQEMUDriverPtr driver,
       if (virStorageSourceIsEmpty(src))
           return 0;
  
  -    if (qemuStorageLimitsRefresh(driver, cfg, dom, src, false) < 0) {
  +    if (qemuStorageLimitsRefresh(driver, cfg, dom, src, true) <= 0) {
           virResetLastError();
           return 0;
  

So at the very least to properly address all instances this patch would
look similarly to the above.
...
+            info->capacity = 0;
+            info->allocation = 0;
+            info->physical = 0;
+
+            ret = 0;
+            goto endjob;
+        }
+
         if ((qemuStorageLimitsRefresh(driver, cfg, vm, disk->src, false)) < 0)
             goto endjob;
Now for this very specific instance 'qemuDomainGetBlockInfo' is querying
data for only one disk (in contrast to the bulk stats API). As of such
it's okay to report an error if the required data can't be obtained.
Additionally that is what this code was doing for a long time.

In case of individual query APIs it's usually better if the user gets an
error if the required data can't be fetched.

Based on the above, my stance is that the behaviour of reporting an
error is correct here. If you need to collect stats without reporting
error I strongly suggest using the bulk stats API as that is
specifically designed to omit information it can't gather. Here where
the query is specific, failure to obtain the information should produce
an error.