[PATCH v2 0/4] qemu: Add support for free-page-reporting

gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs. This kernel feature allows for more stable and resource friendly systems. This feature is available in QEMU and is enabled with free-page-reporting=on default virt setting should be off but the user should be able to enable it. Nico Pache (4): Document and parser support for the Virtio free page reporting feature. QEMU: declare qemu capabilities for the Virtio Free page reporting feature QEMU: introduce Virtio free page reporting feature provide testing for free-page-reporting feature in QEMU docs/formatdomain.rst | 6 ++++ docs/schemas/domaincommon.rng | 5 +++ src/conf/domain_conf.c | 21 +++++++++++ src/conf/domain_conf.h | 1 + src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 5 +++ src/qemu/qemu_validate.c | 7 ++++ .../caps_5.1.0.x86_64.xml | 1 + .../caps_5.2.0.x86_64.xml | 1 + ...-options-memballoon-freepage-reporting.err | 1 + ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ tests/qemuxml2argvtest.c | 2 ++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ 15 files changed, 133 insertions(+) create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml -- 2.18.4

This will add the proper documentation and parser support for the free page reporting feature that is introduced in QEMU 5.1. Signed-off-by: Nico Pache <npache@redhat.com> --- docs/formatdomain.rst | 6 ++++++ docs/schemas/domaincommon.rng | 5 +++++ src/conf/domain_conf.c | 21 +++++++++++++++++++++ src/conf/domain_conf.h | 1 + 4 files changed, 33 insertions(+) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index df5ac28028..e75b360b32 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -6755,6 +6755,12 @@ Example: manually added device with static PCI slot 2 requested release some memory at the last moment before a guest's process get killed by Out of Memory killer. :since:`Since 1.3.1, QEMU and KVM only` +``free-page-reporting`` + The optional ``free-page-reporting`` attribute allows to enable/disable + ("on"/"off", respectively) the ability of the QEMU virtio memory balloon to + return unused pages back to the hypervisor to be used by other guests or + processes. :since:`Since 5.1, QEMU and KVM only` + ``period`` The optional ``period`` allows the QEMU virtio memory balloon driver to provide statistics through the ``virsh dommemstat [domain]`` diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 7d4b105981..0a0f0ed8a8 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -4539,6 +4539,11 @@ <ref name="virOnOff"/> </attribute> </optional> + <optional> + <attribute name="free-page-reporting"> + <ref name="virOnOff"/> + </attribute> + </optional> <interleave> <optional> <ref name="alias"/> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 51efeb0e42..883f7081c6 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -15336,6 +15336,7 @@ virDomainMemballoonDefParseXML(virDomainXMLOptionPtr xmlopt, VIR_XPATH_NODE_AUTORESTORE(ctxt) unsigned int period = 0; g_autofree char *model = NULL; + g_autofree char *freepage_reporting = NULL; g_autofree char *deflate = NULL; def = g_new0(virDomainMemballoonDef, 1); @@ -15360,6 +15361,13 @@ virDomainMemballoonDefParseXML(virDomainXMLOptionPtr xmlopt, goto error; } + if ((freepage_reporting = virXMLPropString(node, "free-page-reporting")) && + (def->free_page_reporting = virTristateSwitchTypeFromString(freepage_reporting)) <= 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("invalid free-page-reporting attribute value '%s'"), freepage_reporting); + goto error; + } + ctxt->node = node; if (virXPathUInt("string(./stats/@period)", ctxt, &period) < -1) { virReportError(VIR_ERR_XML_ERROR, "%s", @@ -23538,6 +23546,15 @@ virDomainMemballoonDefCheckABIStability(virDomainMemballoonDefPtr src, return false; } + if (src->free_page_reporting != dst->free_page_reporting) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target balloon free-page-reporting attribute value " + "'%s' does not match source '%s'"), + virTristateSwitchTypeToString(dst->free_page_reporting), + virTristateSwitchTypeToString(src->free_page_reporting)); + return false; + } + if (src->virtio && dst->virtio && !virDomainVirtioOptionsCheckABIStability(src->virtio, dst->virtio)) return false; @@ -27644,6 +27661,10 @@ virDomainMemballoonDefFormat(virBufferPtr buf, virBufferAsprintf(&attrBuf, " autodeflate='%s'", virTristateSwitchTypeToString(def->autodeflate)); + if (def->free_page_reporting != VIR_TRISTATE_SWITCH_ABSENT) + virBufferAsprintf(&attrBuf, " free-page-reporting='%s'", + virTristateSwitchTypeToString(def->free_page_reporting)); + if (def->period) virBufferAsprintf(&childrenBuf, "<stats period='%i'/>\n", def->period); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 450686dfb5..902dd58112 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1790,6 +1790,7 @@ struct _virDomainMemballoonDef { virDomainDeviceInfo info; int period; /* seconds between collections */ int autodeflate; /* enum virTristateSwitch */ + int free_page_reporting; /* enum virTristateSwitch */ virDomainVirtioOptionsPtr virtio; }; -- 2.18.4

On 10/13/20 1:35 AM, Nico Pache wrote:
This will add the proper documentation and parser support for the free page reporting feature that is introduced in QEMU 5.1.
Signed-off-by: Nico Pache <npache@redhat.com> --- docs/formatdomain.rst | 6 ++++++ docs/schemas/domaincommon.rng | 5 +++++ src/conf/domain_conf.c | 21 +++++++++++++++++++++ src/conf/domain_conf.h | 1 + 4 files changed, 33 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index df5ac28028..e75b360b32 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -6755,6 +6755,12 @@ Example: manually added device with static PCI slot 2 requested release some memory at the last moment before a guest's process get killed by Out of Memory killer. :since:`Since 1.3.1, QEMU and KVM only`
+``free-page-reporting`` + The optional ``free-page-reporting`` attribute allows to enable/disable + ("on"/"off", respectively) the ability of the QEMU virtio memory balloon to + return unused pages back to the hypervisor to be used by other guests or + processes. :since:`Since 5.1, QEMU and KVM only`
This "Since" attribute should refer to Libvirt version and not qemu. The idea is that when I am reading libvirt docs, how to enable something in domain XML I want to know whether my libvirt supports it. We could also record QEMU version but in general we don't do that. Michal

On Mon, Oct 12, 2020 at 07:35:36PM -0400, Nico Pache wrote:
This will add the proper documentation and parser support for the free page reporting feature that is introduced in QEMU 5.1.
Signed-off-by: Nico Pache <npache@redhat.com> --- docs/formatdomain.rst | 6 ++++++ docs/schemas/domaincommon.rng | 5 +++++ src/conf/domain_conf.c | 21 +++++++++++++++++++++ src/conf/domain_conf.h | 1 + 4 files changed, 33 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index df5ac28028..e75b360b32 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -6755,6 +6755,12 @@ Example: manually added device with static PCI slot 2 requested release some memory at the last moment before a guest's process get killed by Out of Memory killer. :since:`Since 1.3.1, QEMU and KVM only`
+``free-page-reporting`` + The optional ``free-page-reporting`` attribute allows to enable/disable + ("on"/"off", respectively) the ability of the QEMU virtio memory balloon to + return unused pages back to the hypervisor to be used by other guests or + processes. :since:`Since 5.1, QEMU and KVM only`
This has already been merged, but using hyphens in XML attribute names is not common practice in libvirt. We usually use underscore, or initialCaps styles. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Oct 14, 2020 at 12:26:57PM +0100, Daniel P. Berrangé wrote:
On Mon, Oct 12, 2020 at 07:35:36PM -0400, Nico Pache wrote:
This will add the proper documentation and parser support for the free page reporting feature that is introduced in QEMU 5.1.
Signed-off-by: Nico Pache <npache@redhat.com> --- docs/formatdomain.rst | 6 ++++++ docs/schemas/domaincommon.rng | 5 +++++ src/conf/domain_conf.c | 21 +++++++++++++++++++++ src/conf/domain_conf.h | 1 + 4 files changed, 33 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index df5ac28028..e75b360b32 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -6755,6 +6755,12 @@ Example: manually added device with static PCI slot 2 requested release some memory at the last moment before a guest's process get killed by Out of Memory killer. :since:`Since 1.3.1, QEMU and KVM only`
+``free-page-reporting`` + The optional ``free-page-reporting`` attribute allows to enable/disable + ("on"/"off", respectively) the ability of the QEMU virtio memory balloon to + return unused pages back to the hypervisor to be used by other guests or + processes. :since:`Since 5.1, QEMU and KVM only`
This has already been merged, but using hyphens in XML attribute names is not common practice in libvirt. We usually use underscore, or initialCaps styles.
I would say we can still change the naming as it was not released. Pavel

This patch will introduce the free-page-reporting feature capabilities that are in qemu 5.1 Signed-off-by: Nico Pache <npache@redhat.com> --- src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + tests/qemucapabilitiesdata/caps_5.1.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_5.2.0.x86_64.xml | 1 + 4 files changed, 5 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 02136fd3b3..81d9ecd886 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -600,6 +600,7 @@ VIR_ENUM_IMPL(virQEMUCaps, /* 380 */ "usb-host.hostdevice", + "virtio-balloon.free-page-reporting", ); @@ -1341,6 +1342,7 @@ static struct virQEMUCapsDevicePropsFlags virQEMUCapsDevicePropsVirtioBalloon[] { "iommu_platform", QEMU_CAPS_VIRTIO_PCI_IOMMU_PLATFORM, NULL }, { "ats", QEMU_CAPS_VIRTIO_PCI_ATS, NULL }, { "packed", QEMU_CAPS_VIRTIO_PACKED_QUEUES, NULL }, + { "free-page-reporting", QEMU_CAPS_VIRTIO_BALLOON_FREE_PAGE_REPORTING, NULL }, }; diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 107056ba17..44c45589f0 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -580,6 +580,7 @@ typedef enum { /* virQEMUCapsFlags grouping marker for syntax-check */ /* 380 */ QEMU_CAPS_USB_HOST_HOSTDEVICE, /* -device usb-host.hostdevice */ + QEMU_CAPS_VIRTIO_BALLOON_FREE_PAGE_REPORTING, /*virtio balloon free-page-reporting */ QEMU_CAPS_LAST /* this must always be the last item */ } virQEMUCapsFlags; diff --git a/tests/qemucapabilitiesdata/caps_5.1.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_5.1.0.x86_64.xml index 987beb965e..9ebd7ea582 100644 --- a/tests/qemucapabilitiesdata/caps_5.1.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_5.1.0.x86_64.xml @@ -243,6 +243,7 @@ <flag name='numa.hmat'/> <flag name='blockdev-hostdev-scsi'/> <flag name='usb-host.hostdevice'/> + <flag name='virtio-balloon.free-page-reporting'/> <version>5001000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100242</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_5.2.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_5.2.0.x86_64.xml index 3ee678ef8f..52b6a47004 100644 --- a/tests/qemucapabilitiesdata/caps_5.2.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_5.2.0.x86_64.xml @@ -243,6 +243,7 @@ <flag name='numa.hmat'/> <flag name='blockdev-hostdev-scsi'/> <flag name='usb-host.hostdevice'/> + <flag name='virtio-balloon.free-page-reporting'/> <version>5001050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100243</microcodeVersion> -- 2.18.4

This patch enables the free-page-reporting in qemu. Signed-off-by: Nico Pache <npache@redhat.com> --- src/qemu/qemu_command.c | 5 +++++ src/qemu/qemu_validate.c | 7 +++++++ 2 files changed, 12 insertions(+) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9519861e92..6466efc756 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3830,6 +3830,11 @@ qemuBuildMemballoonCommandLine(virCommandPtr cmd, virTristateSwitchTypeToString(def->memballoon->autodeflate)); } + if (def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ABSENT) { + virBufferAsprintf(&buf, ",free-page-reporting=%s", + virTristateSwitchTypeToString(def->memballoon->free_page_reporting)); + } + qemuBuildVirtioOptionsStr(&buf, def->memballoon->virtio); if (qemuCommandAddExtDevice(cmd, &def->memballoon->info) < 0) diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index a212605579..cfeb1e67c4 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -3926,6 +3926,13 @@ qemuValidateDomainDeviceDefMemballoon(const virDomainMemballoonDef *memballoon, return -1; } + if (memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ABSENT && + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_VIRTIO_BALLOON_FREE_PAGE_REPORTING)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("free-page-reporting is not supported by this QEMU binary")); + return -1; + } + if (qemuValidateDomainVirtioOptions(memballoon->virtio, qemuCaps) < 0) return -1; -- 2.18.4

This provides basic testing for the free-page-reporting feature that is introduced in qemu 5.1. Signed-off-by: Nico Pache <npache@redhat.com> --- ...-options-memballoon-freepage-reporting.err | 1 + ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ tests/qemuxml2argvtest.c | 2 ++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ 5 files changed, 83 insertions(+) create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml diff --git a/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err b/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err new file mode 100644 index 0000000000..de06de0a85 --- /dev/null +++ b/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err @@ -0,0 +1 @@ +unsupported configuration: free-page-reporting is not supported by this QEMU binary diff --git a/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args b/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args new file mode 100644 index 0000000000..6ebc45d931 --- /dev/null +++ b/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args @@ -0,0 +1,36 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-QEMUGuest1 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-QEMUGuest1/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-QEMUGuest1/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name guest=QEMUGuest1,debug-threads=on \ +-S \ +-object secret,id=masterKey0,format=raw,\ +file=/tmp/lib/domain--1-QEMUGuest1/master-key.aes \ +-machine pc,accel=tcg,usb=off,dump-guest-core=off,memory-backend=pc.ram \ +-cpu qemu64 \ +-m 214 \ +-object memory-backend-ram,id=pc.ram,size=224395264 \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-boot strict=on \ +-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x2,\ +free-page-reporting=on \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,\ +resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml b/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml new file mode 100644 index 0000000000..ff7549a789 --- /dev/null +++ b/tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml @@ -0,0 +1,22 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio' free-page-reporting='on'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 49567326d4..8aa791d9f7 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -3195,6 +3195,7 @@ mymain(void) DO_TEST_CAPS_LATEST("virtio-options-fs-packed"); DO_TEST_CAPS_LATEST("virtio-options-input-packed"); DO_TEST_CAPS_LATEST("virtio-options-memballoon-packed"); + DO_TEST_CAPS_LATEST("virtio-options-memballoon-freepage-reporting"); DO_TEST_CAPS_LATEST("virtio-options-net-packed"); DO_TEST_CAPS_LATEST("virtio-options-rng-packed"); DO_TEST_CAPS_LATEST("virtio-options-video-packed"); @@ -3232,6 +3233,7 @@ mymain(void) DO_TEST_PARSE_ERROR("virtio-options-input-packed", QEMU_CAPS_VIRTIO_MOUSE, QEMU_CAPS_VIRTIO_KEYBOARD); DO_TEST_PARSE_ERROR("virtio-options-memballoon-packed", NONE); + DO_TEST_PARSE_ERROR("virtio-options-memballoon-freepage-reporting", NONE); DO_TEST_PARSE_ERROR("virtio-options-net-packed", NONE); DO_TEST_PARSE_ERROR("virtio-options-rng-packed", QEMU_CAPS_DEVICE_VIRTIO_RNG, QEMU_CAPS_OBJECT_RNG_RANDOM); diff --git a/tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml b/tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml new file mode 100644 index 0000000000..ff7549a789 --- /dev/null +++ b/tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml @@ -0,0 +1,22 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio' free-page-reporting='on'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </memballoon> + </devices> +</domain> -- 2.18.4

On 10/12/20 8:35 PM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
This kernel feature allows for more stable and resource friendly systems.
This feature is available in QEMU and is enabled with free-page-reporting=on default virt setting should be off but the user should be able to enable it.
Nico Pache (4): Document and parser support for the Virtio free page reporting feature. QEMU: declare qemu capabilities for the Virtio Free page reporting feature QEMU: introduce Virtio free page reporting feature provide testing for free-page-reporting feature in QEMU
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
docs/formatdomain.rst | 6 ++++ docs/schemas/domaincommon.rng | 5 +++ src/conf/domain_conf.c | 21 +++++++++++ src/conf/domain_conf.h | 1 + src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 5 +++ src/qemu/qemu_validate.c | 7 ++++ .../caps_5.1.0.x86_64.xml | 1 + .../caps_5.2.0.x86_64.xml | 1 + ...-options-memballoon-freepage-reporting.err | 1 + ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ tests/qemuxml2argvtest.c | 2 ++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ 15 files changed, 133 insertions(+) create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml

On 10/13/20 1:35 AM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
This kernel feature allows for more stable and resource friendly systems.
This feature is available in QEMU and is enabled with free-page-reporting=on default virt setting should be off but the user should be able to enable it.
Nico Pache (4): Document and parser support for the Virtio free page reporting feature. QEMU: declare qemu capabilities for the Virtio Free page reporting feature QEMU: introduce Virtio free page reporting feature provide testing for free-page-reporting feature in QEMU
docs/formatdomain.rst | 6 ++++ docs/schemas/domaincommon.rng | 5 +++ src/conf/domain_conf.c | 21 +++++++++++ src/conf/domain_conf.h | 1 + src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 5 +++ src/qemu/qemu_validate.c | 7 ++++ .../caps_5.1.0.x86_64.xml | 1 + .../caps_5.2.0.x86_64.xml | 1 + ...-options-memballoon-freepage-reporting.err | 1 + ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ tests/qemuxml2argvtest.c | 2 ++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ 15 files changed, 133 insertions(+) create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml
Couple of white space problems, but nothing I couldn't fix before pushing. Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Congratulations on your first libvirt contribution! Michal

On 10/13/20 5:10 PM, Michal Privoznik wrote:
On 10/13/20 1:35 AM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
Is this something that we might want to report? I mean, we have 'virsh dommemstat $dom' which under the hood calls: {"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"} Michal

I don't believe so, but David might have some more insight into that question. Once the feature is enabled the guest will report pages on its freelist to the hypervisor, and the host/hypervisor will then release them. I'm not exactly sure what information we can or would want to 'stat' out of this process... A counter for number of pages returned? Cheers, Nico On Tue, Oct 13, 2020, 12:47 PM Michal Privoznik <mprivozn@redhat.com> wrote:
On 10/13/20 5:10 PM, Michal Privoznik wrote:
On 10/13/20 1:35 AM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
Is this something that we might want to report? I mean, we have 'virsh dommemstat $dom' which under the hood calls:
{"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"}
Michal

On 13.10.20 19:00, Nico Pache wrote:
I don't believe so, but David might have some more insight into that question.
Once the feature is enabled the guest will report pages on its freelist to the hypervisor, and the host/hypervisor will then release them.
We report (some, not all) free memory of our guest to the hypervisor. So whatever the guest reported shows up in "stat-free-memory" in the stats already. The hypervisor will "discard" backing storage of reported free memory, resulting in your hypervisor having more free memory available (and the QEMU process consuming less memory).
I'm not exactly sure what information we can or would want to 'stat' out of this process... A counter for number of pages returned?
As the guest can reuse memory any time without coordination with the hypervisor that wouldn't be of too much help. A counter would only tell you if free page reporting was once performed successfully. You would really have to compare the QEMU process memory footprint with the guest stats to figure out if free page reporting is doing what it's supposed to do. -- Thanks, David / dhildenb

On Tue, Oct 13, 2020 at 18:47:39 +0200, Michal Privoznik wrote:
On 10/13/20 5:10 PM, Michal Privoznik wrote:
On 10/13/20 1:35 AM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
Is this something that we might want to report? I mean, we have 'virsh dommemstat $dom' which under the hood calls:
{"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"}
As I've pointed out in earlier review, I think that the feature name is a bit misleading. It sounds like a statistic, not something that actually returns memory to the host. The docs are now better but still leave a lot of room for imagination. IMO, if the feature is mainly for returning memory to the host, the 'reporting' word should not have been used.

On 10/13/20 7:57 PM, Peter Krempa wrote:
On Tue, Oct 13, 2020 at 18:47:39 +0200, Michal Privoznik wrote:
On 10/13/20 5:10 PM, Michal Privoznik wrote:
On 10/13/20 1:35 AM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
Is this something that we might want to report? I mean, we have 'virsh dommemstat $dom' which under the hood calls:
{"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"}
As I've pointed out in earlier review, I think that the feature name is a bit misleading. It sounds like a statistic, not something that actually returns memory to the host. The docs are now better but still leave a lot of room for imagination.
IMO, if the feature is mainly for returning memory to the host, the 'reporting' word should not have been used.
Oh sorry I missed that. I a penance I will post a cleanup patch. How does "free-pages" or "return-pages" or even "discard-pages" sound? Michal

IMO "return-pages" sounds the best out of those and stays relatively consistent with the kernel and qemu terminology for this feature. I personally don't see a huge problem with the current name, but I've also been staring at the words "free page reporting" for too long. Cheers! -- Nico On Wed, Oct 14, 2020 at 1:56 AM Michal Privoznik <mprivozn@redhat.com> wrote:
On 10/13/20 7:57 PM, Peter Krempa wrote:
On Tue, Oct 13, 2020 at 18:47:39 +0200, Michal Privoznik wrote:
On 10/13/20 5:10 PM, Michal Privoznik wrote:
On 10/13/20 1:35 AM, Nico Pache wrote:
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79
The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs.
Is this something that we might want to report? I mean, we have 'virsh dommemstat $dom' which under the hood calls:
{"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"}
As I've pointed out in earlier review, I think that the feature name is a bit misleading. It sounds like a statistic, not something that actually returns memory to the host. The docs are now better but still leave a lot of room for imagination.
IMO, if the feature is mainly for returning memory to the host, the 'reporting' word should not have been used.
Oh sorry I missed that. I a penance I will post a cleanup patch. How does "free-pages" or "return-pages" or even "discard-pages" sound?
Michal

On 10/14/20 8:07 AM, Nico Pache wrote:
IMO "return-pages" sounds the best out of those and stays relatively consistent with the kernel and qemu terminology for this feature.
I personally don't see a huge problem with the current name, but I've also been staring at the words "free page reporting" for too long.
Right, I can see both reasons. But now that Peter raised it (again, sorry) at libvirt level reporting usually means "to report something to user". QEMU <-> KVM communication is too low level and since this is not being reported to user directly (even though it affects "stat-free-memory" attribute of the balloon) reporting does sound a bit weird. Let's wait for Peter's opinion. Michal

On 14.10.20 08:30, Michal Privoznik wrote:
On 10/14/20 8:07 AM, Nico Pache wrote:
IMO "return-pages" sounds the best out of those and stays relatively consistent with the kernel and qemu terminology for this feature.
I personally don't see a huge problem with the current name, but I've also been staring at the words "free page reporting" for too long.
Right, I can see both reasons. But now that Peter raised it (again, sorry) at libvirt level reporting usually means "to report something to user". QEMU <-> KVM communication is too low level and since this is not being reported to user directly (even though it affects "stat-free-memory" attribute of the balloon) reporting does sound a bit weird. Let's wait for Peter's opinion.
We originally wanted to use "free page hinting", but as that name is already taken for another virtio-balloon feature to improve live migration speed (one time hint of free pages just before migration), we used "reporting" instead. On a QEMU level, this feature name makes perfect sense. The guest (continuously) reports free pages to the hypervisor. The hypervisor tries to use the reports to free up some memory (which might not always be possible). Now, all I can say is that 1. "free page reporting" is the official name of the feature. Using a different name in libvirt will crate more confusion than it might actually help. Just saying. 2. The proposed alternatives ""free-pages" or "return-pages" or even "discard-pages" don't match what's actually going on. -- Thanks, David / dhildenb

On 10/14/20 10:26 AM, David Hildenbrand wrote:
On 14.10.20 08:30, Michal Privoznik wrote:
Sorry for hijacking this thread, but I need to report it somewhere (what is the best place?) When I want to start a guest with this feature turned on + memfd + hugepages, the host kernel prints this warning into dmesg and hugepages stop working from then on (meaning, even if the pool of allocated HPs is large enough I can't start any guest with HPs): [ 139.434748] ------------[ cut here ]------------ [ 139.434754] WARNING: CPU: 2 PID: 6280 at mm/page_counter.c:57 page_counter_uncharge+0x33/0x40 [ 139.434754] Modules linked in: kvm_amd amdgpu kvm btusb btrtl btbcm btintel sp5100_tco watchdog k10temp mfd_core gpu_sched ttm [ 139.434759] CPU: 2 PID: 6280 Comm: CPU 1/KVM Not tainted 5.8.13-gentoo-x86_64 #2 [ 139.434759] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1005 08/01/2019 [ 139.434760] RIP: 0010:page_counter_uncharge+0x33/0x40 [ 139.434762] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 [ 139.434762] RSP: 0018:ffffc9000355fb38 EFLAGS: 00010286 [ 139.434763] RAX: fffffffffffb4000 RBX: ffff888fc267e900 RCX: fffffffffffb4000 [ 139.434763] RDX: 0000000000000402 RSI: fffffffffffb4000 RDI: ffff888fd8411dd0 [ 139.434764] RBP: ffff888fcba983c0 R08: 0000000000080400 R09: fffffffffff7fc00 [ 139.434764] R10: ffffc9000355fb40 R11: 000000000000000a R12: 0000000000000001 [ 139.434765] R13: ffff888fc3d89140 R14: 00000000000001b2 R15: 00000000000001b1 [ 139.434765] FS: 00007fc9d4c35700(0000) GS:ffff888fde880000(0000) knlGS:0000000000000000 [ 139.434766] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 139.434766] CR2: 00007f09a4003000 CR3: 0000000fc06fe000 CR4: 0000000000340ee0 [ 139.434767] Call Trace: [ 139.434769] hugetlb_cgroup_uncharge_file_region+0x46/0x70 [ 139.434772] region_del+0x1a0/0x260 [ 139.434773] hugetlb_unreserve_pages+0x32/0xa0 [ 139.434775] remove_inode_hugepages+0x19d/0x3a0 [ 139.434776] hugetlbfs_fallocate+0x3f2/0x4a0 [ 139.434778] ? __seccomp_filter+0x75/0x6a0 [ 139.434779] vfs_fallocate+0x124/0x260 [ 139.434780] __x64_sys_fallocate+0x39/0x60 [ 139.434783] do_syscall_64+0x38/0x60 [ 139.434784] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 139.434785] RIP: 0033:0x7fc9e0994de7 [ 139.434786] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44 [ 139.434787] RSP: 002b:00007fc9d4c337a0 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 139.434787] RAX: ffffffffffffffda RBX: 0000000036400000 RCX: 00007fc9e0994de7 [ 139.434788] RDX: 0000000036200000 RSI: 0000000000000003 RDI: 000000000000001d [ 139.434788] RBP: 00007fc9d4c33800 R08: 0000000000000000 R09: 0000000000000000 [ 139.434789] R10: 0000000000200000 R11: 0000000000000293 R12: 00007fff9a75c3fe [ 139.434789] R13: 00007fff9a75c3ff R14: 00007fc9d4c35700 R15: 00007fc9d4c33dc0 [ 139.434790] ---[ end trace fb9808303959fc01 ]--- Is this known problem? Michal

On 14.10.20 13:53, Michal Privoznik wrote:
On 10/14/20 10:26 AM, David Hildenbrand wrote:
On 14.10.20 08:30, Michal Privoznik wrote:
Sorry for hijacking this thread, but I need to report it somewhere (what is the best place?)
When I want to start a guest with this feature turned on + memfd + hugepages, the host kernel prints this warning into dmesg and hugepages stop working from then on (meaning, even if the pool of allocated HPs is large enough I can't start any guest with HPs):
[ 139.434748] ------------[ cut here ]------------ [ 139.434754] WARNING: CPU: 2 PID: 6280 at mm/page_counter.c:57 page_counter_uncharge+0x33/0x40 [ 139.434754] Modules linked in: kvm_amd amdgpu kvm btusb btrtl btbcm btintel sp5100_tco watchdog k10temp mfd_core gpu_sched ttm [ 139.434759] CPU: 2 PID: 6280 Comm: CPU 1/KVM Not tainted 5.8.13-gentoo-x86_64 #2 [ 139.434759] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1005 08/01/2019 [ 139.434760] RIP: 0010:page_counter_uncharge+0x33/0x40 [ 139.434762] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 [ 139.434762] RSP: 0018:ffffc9000355fb38 EFLAGS: 00010286 [ 139.434763] RAX: fffffffffffb4000 RBX: ffff888fc267e900 RCX: fffffffffffb4000 [ 139.434763] RDX: 0000000000000402 RSI: fffffffffffb4000 RDI: ffff888fd8411dd0 [ 139.434764] RBP: ffff888fcba983c0 R08: 0000000000080400 R09: fffffffffff7fc00 [ 139.434764] R10: ffffc9000355fb40 R11: 000000000000000a R12: 0000000000000001 [ 139.434765] R13: ffff888fc3d89140 R14: 00000000000001b2 R15: 00000000000001b1 [ 139.434765] FS: 00007fc9d4c35700(0000) GS:ffff888fde880000(0000) knlGS:0000000000000000 [ 139.434766] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 139.434766] CR2: 00007f09a4003000 CR3: 0000000fc06fe000 CR4: 0000000000340ee0 [ 139.434767] Call Trace: [ 139.434769] hugetlb_cgroup_uncharge_file_region+0x46/0x70 [ 139.434772] region_del+0x1a0/0x260 [ 139.434773] hugetlb_unreserve_pages+0x32/0xa0 [ 139.434775] remove_inode_hugepages+0x19d/0x3a0 [ 139.434776] hugetlbfs_fallocate+0x3f2/0x4a0 [ 139.434778] ? __seccomp_filter+0x75/0x6a0 [ 139.434779] vfs_fallocate+0x124/0x260 [ 139.434780] __x64_sys_fallocate+0x39/0x60 [ 139.434783] do_syscall_64+0x38/0x60 [ 139.434784] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 139.434785] RIP: 0033:0x7fc9e0994de7 [ 139.434786] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44 [ 139.434787] RSP: 002b:00007fc9d4c337a0 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 139.434787] RAX: ffffffffffffffda RBX: 0000000036400000 RCX: 00007fc9e0994de7 [ 139.434788] RDX: 0000000036200000 RSI: 0000000000000003 RDI: 000000000000001d [ 139.434788] RBP: 00007fc9d4c33800 R08: 0000000000000000 R09: 0000000000000000 [ 139.434789] R10: 0000000000200000 R11: 0000000000000293 R12: 00007fff9a75c3fe [ 139.434789] R13: 00007fff9a75c3ff R14: 00007fc9d4c35700 R15: 00007fc9d4c33dc0 [ 139.434790] ---[ end trace fb9808303959fc01 ]---
Is this known problem?
No, not at all. Thanks for reporting! And the "bad" thing is, that QEMU doesn't do anything too fancy. All it does is "fallocate(FALLOC_FL_PUNCH_HOLE)" on hugetlbfs when trying to zap reported pages. The same mechanism is also used for postcopy live migration and virtio-mem with hugetlbfs. Which kernel are you running? 1. Is it an upstream kernel, lkml + -mm lists are the right place (please cc me, or I can try to reproduce and report it). 2. Is it a distro kernel? Then create a BUG there. I was just recently testing virtio-mem with hugetlbfs and it worked on decent upstream Fedora. But maybe I was not able to trigger it.
Michal
-- Thanks, David / dhildenb

On 10/14/20 2:06 PM, David Hildenbrand wrote:
On 14.10.20 13:53, Michal Privoznik wrote:
On 10/14/20 10:26 AM, David Hildenbrand wrote:
On 14.10.20 08:30, Michal Privoznik wrote:
<snip/>
No, not at all. Thanks for reporting!
And the "bad" thing is, that QEMU doesn't do anything too fancy. All it does is "fallocate(FALLOC_FL_PUNCH_HOLE)" on hugetlbfs when trying to zap reported pages. The same mechanism is also used for postcopy live migration and virtio-mem with hugetlbfs.
Which kernel are you running?
1. Is it an upstream kernel, lkml + -mm lists are the right place (please cc me, or I can try to reproduce and report it).
2. Is it a distro kernel? Then create a BUG there.
I was just recently testing virtio-mem with hugetlbfs and it worked on decent upstream Fedora. But maybe I was not able to trigger it.
Okay, I've upgraded to 5.9.0-gentoo, but the problem persists. Gentoo puts only a very few patches on top of vanilla kernel neither of which touches that area of the code: https://dev.gentoo.org/~mpagano/genpatches/trunk/5.9/ So I think this is reproducible on vanilla too. BTW: Have you tried placing the qemu inside v1 cgroups? Libvirt does that so maybe that's the problem. Anyway, here's the cmd line: /home/zippy/work/qemu/qemu.git/build/qemu-system-x86_64 \ -name guest=fedora,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-fedora/master-key.aes \ -machine pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \ -cpu host,migratable=on \ -m 4096 \ -object memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,prealloc=yes,size=4294967296,host-nodes=0,policy=bind \ -overcommit mem-lock=off \ -smp 4,sockets=1,dies=1,cores=2,threads=2 \ -object iothread,id=iothread1 \ -object iothread,id=iothread2 \ -object iothread,id=iothread3 \ -object iothread,id=iothread4 \ -uuid 63840878-0deb-4095-97e6-fc444d9bc9fa \ -no-user-config \ -nodefaults \ -device sga \ -chardev socket,id=charmonitor,fd=33,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -global PIIX4_PM.disable_s3=0 \ -global PIIX4_PM.disable_s4=0 \ -boot menu=on,strict=on \ -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ -blockdev '{"driver":"file","filename":"/var/lib/libvirt/images/fedora.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \ -netdev tap,fd=35,id=hostnet0 \ -device virtio-net-pci,host_mtu=9000,netdev=hostnet0,id=net0,mac=52:54:00:a4:6f:91,bus=pci.0,addr=0x3 \ -chardev pty,id=charserial0 \ -device isa-serial,chardev=charserial0,id=serial0 \ -chardev socket,id=charchannel0,fd=36,server,nowait \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on \ -device virtio-vga,id=video0,virgl=on,max_outputs=1,bus=pci.0,addr=0x2 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on Michal

On 14.10.20 15:46, Michal Privoznik wrote:
On 10/14/20 2:06 PM, David Hildenbrand wrote:
On 14.10.20 13:53, Michal Privoznik wrote:
On 10/14/20 10:26 AM, David Hildenbrand wrote:
On 14.10.20 08:30, Michal Privoznik wrote:
<snip/>
No, not at all. Thanks for reporting!
And the "bad" thing is, that QEMU doesn't do anything too fancy. All it does is "fallocate(FALLOC_FL_PUNCH_HOLE)" on hugetlbfs when trying to zap reported pages. The same mechanism is also used for postcopy live migration and virtio-mem with hugetlbfs.
Which kernel are you running?
1. Is it an upstream kernel, lkml + -mm lists are the right place (please cc me, or I can try to reproduce and report it).
2. Is it a distro kernel? Then create a BUG there.
I was just recently testing virtio-mem with hugetlbfs and it worked on decent upstream Fedora. But maybe I was not able to trigger it.
Okay, I've upgraded to 5.9.0-gentoo, but the problem persists. Gentoo puts only a very few patches on top of vanilla kernel neither of which touches that area of the code:
https://dev.gentoo.org/~mpagano/genpatches/trunk/5.9/
So I think this is reproducible on vanilla too.
BTW: Have you tried placing the qemu inside v1 cgroups? Libvirt does that so maybe that's the problem. Anyway, here's the cmd line:
/home/zippy/work/qemu/qemu.git/build/qemu-system-x86_64 \ -name guest=fedora,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-fedora/master-key.aes \ -machine pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \ -cpu host,migratable=on \ -m 4096 \ -object memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,prealloc=yes,size=4294967296,host-nodes=0,policy=bind \ -overcommit mem-lock=off \ -smp 4,sockets=1,dies=1,cores=2,threads=2 \ -object iothread,id=iothread1 \ -object iothread,id=iothread2 \ -object iothread,id=iothread3 \ -object iothread,id=iothread4 \ -uuid 63840878-0deb-4095-97e6-fc444d9bc9fa \ -no-user-config \ -nodefaults \ -device sga \ -chardev socket,id=charmonitor,fd=33,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -global PIIX4_PM.disable_s3=0 \ -global PIIX4_PM.disable_s4=0 \ -boot menu=on,strict=on \ -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ -blockdev '{"driver":"file","filename":"/var/lib/libvirt/images/fedora.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \ -netdev tap,fd=35,id=hostnet0 \ -device virtio-net-pci,host_mtu=9000,netdev=hostnet0,id=net0,mac=52:54:00:a4:6f:91,bus=pci.0,addr=0x3 \ -chardev pty,id=charserial0 \ -device isa-serial,chardev=charserial0,id=serial0 \ -chardev socket,id=charchannel0,fd=36,server,nowait \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on \ -device virtio-vga,id=video0,virgl=on,max_outputs=1,bus=pci.0,addr=0x2 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on
Thanks! Reproduced easily on Fedora 32 (5.7.16-200.fc32.x86_64). [ 70.641802] CPU: 3 PID: 2178 Comm: qemu-system-x86 Not tainted 5.7.16-200.fc32.x86_64 #1 [ 70.641802] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020 [ 70.641803] RIP: 0010:page_counter_uncharge+0x4b/0x50 [ 70.641804] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 20 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 0f 1f 44 00 00 48 8b 17 48 39 d6 72 41 41 54 49 89 [ 70.641804] RSP: 0018:ffffb4044139bb18 EFLAGS: 00010286 [ 70.641805] RAX: fffffffffff94600 RBX: fffffffffff94600 RCX: ffff8da63d007000 [ 70.641805] RDX: 000000000000046e RSI: fffffffffff94600 RDI: ffff8da678412e28 [ 70.641806] RBP: ffff8da678412e28 R08: ffff8da678412e28 R09: ffff8da63d0078c0 [ 70.641806] R10: ffff8da634173000 R11: 0000000000000007 R12: 000000000008dc00 [ 70.641806] R13: fffffffffff72400 R14: ffff8da63d007000 R15: 0000000000000391 [ 70.641807] FS: 00007fe7ab5fe700(0000) GS:ffff8da67eac0000(0000) knlGS:0000000000000000 [ 70.641808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 70.641808] CR2: 000055568796a468 CR3: 0000000fe9860000 CR4: 0000000000340ee0 [ 70.641808] Call Trace: [ 70.641813] hugetlb_cgroup_uncharge_file_region+0x4b/0x80 [ 70.641815] region_del+0x1d3/0x300 [ 70.641816] hugetlb_unreserve_pages+0x39/0xb0 [ 70.641818] remove_inode_hugepages+0x1a8/0x3d0 [ 70.641831] ? kvm_mmu_notifier_invalidate_range+0x38/0x60 [kvm] [ 70.641832] ? tlb_finish_mmu+0x7a/0x1d0 [ 70.641833] hugetlbfs_fallocate+0x3ac/0x5e0 [ 70.641835] ? avc_has_perm+0x3b/0x160 [ 70.641836] ? file_has_perm+0xa2/0xb0 [ 70.641837] ? selinux_inode_follow_link+0x4c/0xb0 [ 70.641838] ? selinux_file_permission+0x4e/0x120 [ 70.641839] ? security_file_permission+0x2e/0x160 [ 70.641840] vfs_fallocate+0x146/0x280 [ 70.641841] __x64_sys_fallocate+0x3e/0x70 [ 70.641843] do_syscall_64+0x5b/0xf0 [ 70.641846] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Note: prealloc=yes is a bad choice in this environment. It contradicts memory overcommit - what we want to optimize with free page reporting. You allocate all VM memory to throw it away once the guest is up. Is there an option to turn this of with hugetlbfs? I hope so. I'll try reproducing upstream and send a BUG report upstream, ccing you. Thanks!
Michal
-- Thanks, David / dhildenb

On 10/14/20 4:07 PM, David Hildenbrand wrote:
Note: prealloc=yes is a bad choice in this environment. It contradicts memory overcommit - what we want to optimize with free page reporting. You allocate all VM memory to throw it away once the guest is up. Is there an option to turn this of with hugetlbfs? I hope so.
Is this something that should be addressed on qemu side? Since I'm writing a follow up patches for libvirt I can forbid this combination in libvirt.
I'll try reproducing upstream and send a BUG report upstream, ccing you. Thanks!
Thank you very much! Michal

On 14.10.20 16:15, Michal Privoznik wrote:
On 10/14/20 4:07 PM, David Hildenbrand wrote:
Note: prealloc=yes is a bad choice in this environment. It contradicts memory overcommit - what we want to optimize with free page reporting. You allocate all VM memory to throw it away once the guest is up. Is there an option to turn this of with hugetlbfs? I hope so.
Is this something that should be addressed on qemu side? Since I'm writing a follow up patches for libvirt I can forbid this combination in libvirt.
You mean, bailing out in QEMU if "prealloc=yes" is used along free page reporting? Hmm, not sure if that's the right approach ... we would also have to check for any DIMMs we hotplug and things like that. It's definitely a setup issue to me (where we could try to warn the user, but QEMU warnings tend to get ignored by everybody ...). It's also going to be "not what you want" once we have virtio-mem support in libvirt. (https://virtio-mem.gitlab.io/) But there, it's easier to check+warn. You only want "prealloc=no" for the memory device assigned to a virtio-mem device, not for all system RAM. Thanks! -- Thanks, David / dhildenb
participants (7)
-
Daniel Henrique Barboza
-
Daniel P. Berrangé
-
David Hildenbrand
-
Michal Privoznik
-
Nico Pache
-
Pavel Hrdina
-
Peter Krempa