[PATCH 0/7] qemu: Fix broken qemu caps cache invalidation

Bugs in the code meant to invalidate the qemu capabilities cache resulted in us always re-probing capabilities on startup of libvirtd/virtqemud. This got extremely annoying with the modern deployments using --timeout. Apart from issues in inserting the detected features, it also turns out that AMD boxes report data which changes based on which CPU the instruction happened to execute, so we need to mask that out. Peter Krempa (7): virCPUx86DataIsIdentical: Add debug output virCPUx86DataGetHost: Fix construction of the returned data virHostCPUGetCPUID: Add comment on how KVM_GET_SUPPORTED_CPUID works util: virhostcpu: Extract filtering of the returned data from virHostCPUGetCPUID virHostCPUGetCPUID: Fix possible allocation of huge amount of memory virHostCPUGetCPUID: Limit the buffer size ranges virHostCPUGetCPUIDFilterVolatile: Filter out topology data on AMD src/cpu/cpu_x86.c | 31 ++++++++++--- src/util/virhostcpu.c | 104 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 116 insertions(+), 19 deletions(-) -- 2.35.1

Without this it's impossible to debug scenarios when this function returns a mismatch but the formatted data looks identical. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/cpu/cpu_x86.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/src/cpu/cpu_x86.c b/src/cpu/cpu_x86.c index 9826ec6190..18e9cacfd0 100644 --- a/src/cpu/cpu_x86.c +++ b/src/cpu/cpu_x86.c @@ -3309,14 +3309,20 @@ virCPUx86DataIsIdentical(const virCPUData *a, if (!a || !b) return VIR_CPU_COMPARE_ERROR; - if (a->arch != b->arch) + if (a->arch != b->arch) { + VIR_DEBUG("incompatible architecture a:%u b:%u", a->arch, b->arch); return VIR_CPU_COMPARE_INCOMPATIBLE; + } - if (!((adata = &a->data.x86) && (bdata = &b->data.x86))) + if (!((adata = &a->data.x86) && (bdata = &b->data.x86))) { + VIR_DEBUG("missing x86 data: a:%p b:%p", adata, bdata); return VIR_CPU_COMPARE_ERROR; + } - if (adata->len != bdata->len) + if (adata->len != bdata->len) { + VIR_DEBUG("unequal length a:%zu b:%zu", adata->len, bdata->len); return VIR_CPU_COMPARE_INCOMPATIBLE; + } for (i = 0; i < adata->len; ++i) { bool found = false; @@ -3330,8 +3336,10 @@ virCPUx86DataIsIdentical(const virCPUData *a, break; } - if (!found) + if (!found) { + VIR_DEBUG("mismatched data"); return VIR_CPU_COMPARE_INCOMPATIBLE; + } } return VIR_CPU_COMPARE_IDENTICAL; -- 2.35.1

The function returns 'virCPUData' but doesn't do two important steps which other code takes: 1) leaves with all-zero data is stripped from the XML output 2) the data is expected to be sorted in the array Now the 'virHostCPUGetCPUID' helper returns both all 0 leaves and doesn't order them as we expect. If this is then used in conjunction with 'virCPUx86DataIsIdentical' together with data which made a roundtrip to XML and back the result will be always false even if the data itself is identical. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/cpu/cpu_x86.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/src/cpu/cpu_x86.c b/src/cpu/cpu_x86.c index 18e9cacfd0..7e9d1cea47 100644 --- a/src/cpu/cpu_x86.c +++ b/src/cpu/cpu_x86.c @@ -3354,16 +3354,17 @@ virCPUx86DataGetHost(void) size_t i; virCPUData *cpuid; g_autofree struct kvm_cpuid2 *kvm_cpuid = NULL; + virCPUx86DataItem zero = { 0 }; if ((kvm_cpuid = virHostCPUGetCPUID()) == NULL) return NULL; cpuid = virCPUDataNew(virArchFromHost()); - cpuid->data.x86.len = kvm_cpuid->nent; + cpuid->data.x86.len = 0; cpuid->data.x86.items = g_new0(virCPUx86DataItem, kvm_cpuid->nent); for (i = 0; i < kvm_cpuid->nent; ++i) { - virCPUx86DataItem *item = &cpuid->data.x86.items[i]; + virCPUx86DataItem *item = &cpuid->data.x86.items[cpuid->data.x86.len]; item->type = VIR_CPU_X86_DATA_CPUID; item->data.cpuid.eax_in = kvm_cpuid->entries[i].function; item->data.cpuid.ecx_in = kvm_cpuid->entries[i].index; @@ -3371,8 +3372,18 @@ virCPUx86DataGetHost(void) item->data.cpuid.ebx = kvm_cpuid->entries[i].ebx; item->data.cpuid.ecx = kvm_cpuid->entries[i].ecx; item->data.cpuid.edx = kvm_cpuid->entries[i].edx; + + /* skip all-zero leaves same as we do in the XML formatter */ + if (virCPUx86DataItemMatch(item, &zero)) + continue; + + cpuid->data.x86.len++; } + /* the rest of the code expects the function to be in order */ + qsort(cpuid->data.x86.items, cpuid->data.x86.len, + sizeof(virCPUx86DataItem), virCPUx86DataSorter); + return cpuid; } #endif -- 2.35.1

The commit adding the code fetching host CPU flags via the KVM_GET_SUPPORTED_CPUID didn't describe at all why such an alghorithm is needed. Add a comment from the documentation outlining how the userspace function is expected to allocate memory here. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostcpu.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/util/virhostcpu.c b/src/util/virhostcpu.c index 011ef8a153..6be00a5b76 100644 --- a/src/util/virhostcpu.c +++ b/src/util/virhostcpu.c @@ -1327,6 +1327,13 @@ virHostCPUGetCPUID(void) return NULL; } + /* Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure + * with the 'nent' field indicating the number of entries in the variable-size + * array 'entries'. If the number of entries is too low to describe the cpu + * capabilities, an error (E2BIG) is returned. If the number is too high, + * the 'nent' field is adjusted and an error (ENOMEM) is returned. If the + * number is just right, the 'nent' field is adjusted to the number of valid + * entries in the 'entries' array, which is then filled. */ for (i = 1; i < INT32_MAX; i *= 2) { g_autofree struct kvm_cpuid2 *kvm_cpuid = NULL; kvm_cpuid = g_malloc0(sizeof(struct kvm_cpuid2) + -- 2.35.1

Move the filtering code into virHostCPUGetCPUIDFilterVolatile. This also removes a safe but very questionable reuse of 'i' iterator in the both the top level and nested loop. It's safe for now as the to level loop will not iterate any more in the current state. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostcpu.c | 36 +++++++++++++++++++++++++++--------- 1 file changed, 27 insertions(+), 9 deletions(-) diff --git a/src/util/virhostcpu.c b/src/util/virhostcpu.c index 6be00a5b76..f07514a11b 100644 --- a/src/util/virhostcpu.c +++ b/src/util/virhostcpu.c @@ -1316,6 +1316,32 @@ virHostCPUGetMSR(unsigned long index, } +/** + * virHostCPUGetCPUIDFilterVolatile: + * + * Filters the 'kvm_cpuid2' struct and removes data which may change depending + * on the CPU core this was run on. + * + * Currently filtered fields: + * - local APIC ID + */ +static void +virHostCPUGetCPUIDFilterVolatile(struct kvm_cpuid2 *kvm_cpuid) +{ + size_t i; + + for (i = 0; i < kvm_cpuid->nent; ++i) { + struct kvm_cpuid_entry2 *entry = &kvm_cpuid->entries[i]; + + /* filter out local apic id */ + if (entry->function == 0x01 && entry->index == 0x00) + entry->ebx &= 0x00ffffff; + if (entry->function == 0x0b) + entry->edx &= 0xffffff00; + } +} + + struct kvm_cpuid2 * virHostCPUGetCPUID(void) { @@ -1341,15 +1367,7 @@ virHostCPUGetCPUID(void) kvm_cpuid->nent = i; if (ioctl(fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) == 0) { - /* filter out local apic id */ - for (i = 0; i < kvm_cpuid->nent; ++i) { - struct kvm_cpuid_entry2 *entry = &kvm_cpuid->entries[i]; - if (entry->function == 0x01 && entry->index == 0x00) - entry->ebx &= 0x00ffffff; - if (entry->function == 0x0b) - entry->edx &= 0xffffff00; - } - + virHostCPUGetCPUIDFilterVolatile(kvm_cpuid); return g_steal_pointer(&kvm_cpuid); } } -- 2.35.1

In case when the 'KVM_GET_SUPPORTED_CPUID' ioctl on /dev/kvm would fail for other reason than the documented E2BIG, our code would continue looping and calling it while always increasing the memory buffer even when that will not help. Rewrite the function to allow another iteration only with the correct errno. Additionally rename the 'i' variable to 'alloc_size' as it's not a pure iterator. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostcpu.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/src/util/virhostcpu.c b/src/util/virhostcpu.c index f07514a11b..aa21c567be 100644 --- a/src/util/virhostcpu.c +++ b/src/util/virhostcpu.c @@ -1345,7 +1345,7 @@ virHostCPUGetCPUIDFilterVolatile(struct kvm_cpuid2 *kvm_cpuid) struct kvm_cpuid2 * virHostCPUGetCPUID(void) { - size_t i; + size_t alloc_size; VIR_AUTOCLOSE fd = open(KVM_DEVICE, O_RDONLY); if (fd < 0) { @@ -1360,16 +1360,26 @@ virHostCPUGetCPUID(void) * the 'nent' field is adjusted and an error (ENOMEM) is returned. If the * number is just right, the 'nent' field is adjusted to the number of valid * entries in the 'entries' array, which is then filled. */ - for (i = 1; i < INT32_MAX; i *= 2) { + for (alloc_size = 1; alloc_size < INT32_MAX; alloc_size *= 2) { g_autofree struct kvm_cpuid2 *kvm_cpuid = NULL; + kvm_cpuid = g_malloc0(sizeof(struct kvm_cpuid2) + - sizeof(struct kvm_cpuid_entry2) * i); - kvm_cpuid->nent = i; + sizeof(struct kvm_cpuid_entry2) * alloc_size); + kvm_cpuid->nent = alloc_size; if (ioctl(fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) == 0) { virHostCPUGetCPUIDFilterVolatile(kvm_cpuid); return g_steal_pointer(&kvm_cpuid); } + + /* enlarge the buffer and try again */ + if (errno == E2BIG) { + VIR_DEBUG("looping %zu", alloc_size); + continue; + } + + /* we fail on any other error code to prevent pointless looping */ + break; } virReportSystemError(errno, "%s", _("Cannot read host CPUID")); -- 2.35.1

Raise the lower bound to '64' as that seems to currently be the first size that can fit the CPU data for a modern cpu. Lower the upper bound to an arbitrary 65536. So many cpu features ougth to be enough for everyone. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostcpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/util/virhostcpu.c b/src/util/virhostcpu.c index aa21c567be..9a9fcf11c0 100644 --- a/src/util/virhostcpu.c +++ b/src/util/virhostcpu.c @@ -1360,7 +1360,7 @@ virHostCPUGetCPUID(void) * the 'nent' field is adjusted and an error (ENOMEM) is returned. If the * number is just right, the 'nent' field is adjusted to the number of valid * entries in the 'entries' array, which is then filled. */ - for (alloc_size = 1; alloc_size < INT32_MAX; alloc_size *= 2) { + for (alloc_size = 64; alloc_size <= 65536; alloc_size *= 2) { g_autofree struct kvm_cpuid2 *kvm_cpuid = NULL; kvm_cpuid = g_malloc0(sizeof(struct kvm_cpuid2) + -- 2.35.1

AMD cpus report Core (compute unit) identifiers of the cpu running the instruction under leaf 0x8000001e. This data is not needed for libvirt and actually breaks caching of the qemu capabilities where we check that all of the CPU flags to be identical. Mask out all of leaf 0x8000001e. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostcpu.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/src/util/virhostcpu.c b/src/util/virhostcpu.c index 9a9fcf11c0..e3f2d6d4b7 100644 --- a/src/util/virhostcpu.c +++ b/src/util/virhostcpu.c @@ -1324,11 +1324,13 @@ virHostCPUGetMSR(unsigned long index, * * Currently filtered fields: * - local APIC ID + * - topology ids and information on AMD cpus */ static void virHostCPUGetCPUIDFilterVolatile(struct kvm_cpuid2 *kvm_cpuid) { size_t i; + bool isAMD = false; for (i = 0; i < kvm_cpuid->nent; ++i) { struct kvm_cpuid_entry2 *entry = &kvm_cpuid->entries[i]; @@ -1338,6 +1340,47 @@ virHostCPUGetCPUIDFilterVolatile(struct kvm_cpuid2 *kvm_cpuid) entry->ebx &= 0x00ffffff; if (entry->function == 0x0b) entry->edx &= 0xffffff00; + + /* Match AMD hosts */ + if (entry->function == 0x00 && entry->index == 0x00 && + entry->ebx == 0x68747541 && /* Auth */ + entry->edx == 0x69746e65 && /* enti */ + entry->ecx == 0x444d4163) /* cAMD */ + isAMD = true; + + /* AMD APIC ID and topology information: + * + * Leaf 0x8000001e + * + * CPUID Fn8000_001E_EAX Extended APIC ID + * 31:0 ExtendedApicId: extended APIC ID. + * + * CPUID Fn8000_001E_EBX Compute Unit Identifiers + * 31:10 Reserved. + * 9:8 CoresPerComputeUnit: cores per compute unit. + * The number of cores per compute unit is CoresPerComputeUnit+1. + * 7:0 ComputeUnitId: compute unit ID. Identifies the processor compute unit ID. + * + * CPUID Fn8000_001E_ECX Node Identifiers + * 31:11 Reserved. + * 10:8 NodesPerProcessor. Specifies the number of nodes per processor. + * 000b 1 node per processor + * 001b 2 nodes per processor + * 111b-010b Reserved + * 7:0 NodeId. Specifies the node ID. + * + * CPUID Fn8000_001E_EDX Reserved + * 31:0 Reserved. + * + * For libvirt none of this information seems to be interesting, thus + * we clear all of it including reserved bits for future-proofing. + */ + if (isAMD && entry->function == 0x8000001e) { + entry->eax = 0x00; + entry->ebx = 0x00; + entry->ecx = 0x00; + entry->edx = 0x00; + } } } -- 2.35.1

On Mon, Apr 25, 2022 at 15:28:24 +0200, Peter Krempa wrote:
Bugs in the code meant to invalidate the qemu capabilities cache resulted in us always re-probing capabilities on startup of libvirtd/virtqemud. This got extremely annoying with the modern deployments using --timeout.
Apart from issues in inserting the detected features, it also turns out that AMD boxes report data which changes based on which CPU the instruction happened to execute, so we need to mask that out.
Peter Krempa (7): virCPUx86DataIsIdentical: Add debug output virCPUx86DataGetHost: Fix construction of the returned data virHostCPUGetCPUID: Add comment on how KVM_GET_SUPPORTED_CPUID works util: virhostcpu: Extract filtering of the returned data from virHostCPUGetCPUID virHostCPUGetCPUID: Fix possible allocation of huge amount of memory virHostCPUGetCPUID: Limit the buffer size ranges virHostCPUGetCPUIDFilterVolatile: Filter out topology data on AMD
src/cpu/cpu_x86.c | 31 ++++++++++--- src/util/virhostcpu.c | 104 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 116 insertions(+), 19 deletions(-)
Wow, KVM_GET_SUPPORTED_CPUID has a very unfriendly semantics. Reviewed-by: Jiri Denemark <jdenemar@redhat.com>

On 4/25/22 15:28, Peter Krempa wrote:
Bugs in the code meant to invalidate the qemu capabilities cache resulted in us always re-probing capabilities on startup of libvirtd/virtqemud. This got extremely annoying with the modern deployments using --timeout.
Apart from issues in inserting the detected features, it also turns out that AMD boxes report data which changes based on which CPU the instruction happened to execute, so we need to mask that out.
Peter Krempa (7): virCPUx86DataIsIdentical: Add debug output virCPUx86DataGetHost: Fix construction of the returned data virHostCPUGetCPUID: Add comment on how KVM_GET_SUPPORTED_CPUID works util: virhostcpu: Extract filtering of the returned data from virHostCPUGetCPUID virHostCPUGetCPUID: Fix possible allocation of huge amount of memory virHostCPUGetCPUID: Limit the buffer size ranges virHostCPUGetCPUIDFilterVolatile: Filter out topology data on AMD
src/cpu/cpu_x86.c | 31 ++++++++++--- src/util/virhostcpu.c | 104 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 116 insertions(+), 19 deletions(-)
Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Michal
participants (3)
-
Jiri Denemark
-
Michal Prívozník
-
Peter Krempa