[libvirt] [PATCH v2] qemu: Add option to enable/disable IOEventFD feature

This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. --- diff to v1: - rebase to current HEAD docs/formatdomain.html.in | 15 ++++++++++++- docs/schemas/domain.rng | 14 +++++++++++++ src/conf/domain_conf.c | 44 +++++++++++++++++++++++++++++++++++++++++- src/conf/domain_conf.h | 12 +++++++++++ src/libvirt_private.syms | 2 + src/qemu/qemu_capabilities.c | 2 + src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 23 +++++++++++++++++++++ tests/qemuhelptest.c | 3 +- 9 files changed, 112 insertions(+), 4 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 989dcf6..a4d7cbd 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -767,7 +767,7 @@ </disk> ... <disk type='network'> - <driver name="qemu" type="raw" io="threads"/> + <driver name="qemu" type="raw" io="threads" ioeventfd="on"/> <source protocol="sheepdog" name="image_name"> <host name="hostname" port="7000"/> </source> @@ -851,6 +851,11 @@ policies on I/O; qemu guests support "threads" and "native". <span class="since">Since 0.8.8</span> </li> + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span> </ul> </dd> <dt><code>boot</code></dt> @@ -1585,7 +1590,7 @@ qemu-kvm -net nic,model=? /dev/null <source network='default'/> <target dev='vnet1'/> <model type='virtio'/> - <b><driver name='vhost' txmode='iothread'/></b> + <b><driver name='vhost' txmode='iothread' ioeventfd='on'/></b> </interface> </devices> ...</pre> @@ -1636,6 +1641,12 @@ qemu-kvm -net nic,model=? /dev/null <b>In general you should leave this option alone, unless you are very certain you know what you are doing.</b> </dd> + <dt><code>ioeventfd</code><code></dt> + <dd> + This optional attribute allows users to enable or disable IOeventFD + feature for virtqueue notify. The value can be either 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span> + </dd> </dl> <h5><a name="elementsNICSTargetOverride">Overriding the target element</a></h5> diff --git a/docs/schemas/domain.rng b/docs/schemas/domain.rng index 7163c6e..c3e8e15 100644 --- a/docs/schemas/domain.rng +++ b/docs/schemas/domain.rng @@ -735,6 +735,9 @@ <optional> <ref name="driverIO"/> </optional> + <optional> + <ref name="ioeventfd"/> + </optional> <empty/> </element> </define> @@ -774,6 +777,14 @@ </choice> </attribute> </define> + <define name="ioeventfd"> + <attribute name="ioeventfd"> + <choice> + <value>on</value> + <value>off</value> + </choice> + </attribute> + </define> <define name="controller"> <element name="controller"> <choice> @@ -1074,6 +1085,9 @@ </choice> </attribute> </optional> + <optional> + <ref name="ioeventfd"/> + </optional> <empty/> </element> </optional> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 498438a..2cb8d66 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -162,6 +162,11 @@ VIR_ENUM_IMPL(virDomainDiskIo, VIR_DOMAIN_DISK_IO_LAST, "native", "threads") +VIR_ENUM_IMPL(virDomainVirtioEventFd, VIR_DOMAIN_VIRTIO_EVENT_FD_LAST, + "default", + "on", + "off") + VIR_ENUM_IMPL(virDomainController, VIR_DOMAIN_CONTROLLER_TYPE_LAST, "ide", "fdc", @@ -1901,6 +1906,7 @@ virDomainDiskDefParseXML(virCapsPtr caps, char *cachetag = NULL; char *error_policy = NULL; char *iotag = NULL; + char *ioeventfd = NULL; char *devaddr = NULL; virStorageEncryptionPtr encryption = NULL; char *serial = NULL; @@ -2016,6 +2022,7 @@ virDomainDiskDefParseXML(virCapsPtr caps, cachetag = virXMLPropString(cur, "cache"); error_policy = virXMLPropString(cur, "error_policy"); iotag = virXMLPropString(cur, "io"); + ioeventfd = virXMLPropString(cur, "ioeventfd"); } else if (xmlStrEqual(cur->name, BAD_CAST "readonly")) { def->readonly = 1; } else if (xmlStrEqual(cur->name, BAD_CAST "shareable")) { @@ -2152,6 +2159,18 @@ virDomainDiskDefParseXML(virCapsPtr caps, } } + if (ioeventfd) { + int i; + if ((i = virDomainVirtioEventFdTypeFromString(ioeventfd)) <= 0) { + virDomainReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("unknown disk ioeventfd mode '%s'"), + ioeventfd); + goto error; + } + def->ioeventfd = i; + } + + if (devaddr) { if (virDomainParseLegacyDeviceAddress(devaddr, &def->info.addr.pci) < 0) { @@ -2214,6 +2233,7 @@ cleanup: VIR_FREE(cachetag); VIR_FREE(error_policy); VIR_FREE(iotag); + VIR_FREE(ioeventfd); VIR_FREE(devaddr); VIR_FREE(serial); virStorageEncryptionFree(encryption); @@ -2601,6 +2621,7 @@ virDomainNetDefParseXML(virCapsPtr caps, char *model = NULL; char *backend = NULL; char *txmode = NULL; + char *ioeventfd = NULL; char *filter = NULL; char *internal = NULL; char *devaddr = NULL; @@ -2690,6 +2711,7 @@ virDomainNetDefParseXML(virCapsPtr caps, } else if (xmlStrEqual (cur->name, BAD_CAST "driver")) { backend = virXMLPropString(cur, "name"); txmode = virXMLPropString(cur, "txmode"); + ioeventfd = virXMLPropString(cur, "ioeventfd"); } else if (xmlStrEqual (cur->name, BAD_CAST "filterref")) { filter = virXMLPropString(cur, "filter"); VIR_FREE(filterparams); @@ -2906,6 +2928,17 @@ virDomainNetDefParseXML(virCapsPtr caps, } def->driver.virtio.txmode = m; } + if (ioeventfd != NULL) { + int i; + if ((i = virDomainVirtioEventFdTypeFromString(ioeventfd)) <= 0) { + virDomainReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("unknown interface <driver " + "ioeventfd='%s'> has been specified"), + ioeventfd); + goto error; + } + def->driver.virtio.ioeventfd = i; + } } if (filter != NULL) { @@ -2945,6 +2978,7 @@ cleanup: VIR_FREE(model); VIR_FREE(backend); VIR_FREE(txmode); + VIR_FREE(ioeventfd); VIR_FREE(filter); VIR_FREE(type); VIR_FREE(internal); @@ -7006,6 +7040,7 @@ virDomainDiskDefFormat(virBufferPtr buf, const char *cachemode = virDomainDiskCacheTypeToString(def->cachemode); const char *error_policy = virDomainDiskErrorPolicyTypeToString(def->error_policy); const char *iomode = virDomainDiskIoTypeToString(def->iomode); + const char *ioeventfd = virDomainVirtioEventFdTypeToString(def->ioeventfd); if (!type) { virDomainReportError(VIR_ERR_INTERNAL_ERROR, @@ -7037,7 +7072,8 @@ virDomainDiskDefFormat(virBufferPtr buf, " <disk type='%s' device='%s'>\n", type, device); - if (def->driverName || def->driverType || def->cachemode) { + if (def->driverName || def->driverType || def->cachemode || + def->ioeventfd) { virBufferAsprintf(buf, " <driver"); if (def->driverName) virBufferAsprintf(buf, " name='%s'", def->driverName); @@ -7049,6 +7085,8 @@ virDomainDiskDefFormat(virBufferPtr buf, virBufferAsprintf(buf, " error_policy='%s'", error_policy); if (def->iomode) virBufferAsprintf(buf, " io='%s'", iomode); + if (def->ioeventfd) + virBufferAsprintf(buf, " ioeventfd='%s'", ioeventfd); virBufferAsprintf(buf, "/>\n"); } @@ -7339,6 +7377,10 @@ virDomainNetDefFormat(virBufferPtr buf, virBufferAsprintf(buf, " txmode='%s'", virDomainNetVirtioTxModeTypeToString(def->driver.virtio.txmode)); } + if (def->driver.virtio.ioeventfd) { + virBufferAsprintf(buf, " ioeventfd='%s'", + virDomainVirtioEventFdTypeToString(def->driver.virtio.ioeventfd)); + } virBufferAddLit(buf, "/>\n"); } } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index fe42f21..d6b4002 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -197,6 +197,15 @@ enum virDomainDiskIo { VIR_DOMAIN_DISK_IO_LAST }; +enum virDomainVirtioEventFd { + VIR_DOMAIN_VIRTIO_EVENT_FD_DEFAULT = 0, + VIR_DOMAIN_VIRTIO_EVENT_FD_ON, + VIR_DOMAIN_VIRTIO_EVENT_FD_OFF, + + VIR_DOMAIN_VIRTIO_EVENT_FD_LAST +}; + + /* Stores the virtual disk configuration */ typedef struct _virDomainDiskDef virDomainDiskDef; typedef virDomainDiskDef *virDomainDiskDefPtr; @@ -216,6 +225,7 @@ struct _virDomainDiskDef { int error_policy; int bootIndex; int iomode; + enum virDomainVirtioEventFd ioeventfd; unsigned int readonly : 1; unsigned int shared : 1; virDomainDeviceInfo info; @@ -351,6 +361,7 @@ struct _virDomainNetDef { struct { enum virDomainNetBackendType name; /* which driver backend to use */ enum virDomainNetVirtioTxModeType txmode; + enum virDomainVirtioEventFd ioeventfd; } virtio; } driver; union { @@ -1473,6 +1484,7 @@ VIR_ENUM_DECL(virDomainDiskCache) VIR_ENUM_DECL(virDomainDiskErrorPolicy) VIR_ENUM_DECL(virDomainDiskProtocol) VIR_ENUM_DECL(virDomainDiskIo) +VIR_ENUM_DECL(virDomainVirtioEventFd) VIR_ENUM_DECL(virDomainController) VIR_ENUM_DECL(virDomainControllerModel) VIR_ENUM_DECL(virDomainFS) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index acd5af3..cdf1ccb 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -351,6 +351,8 @@ virDomainVideoDefaultType; virDomainVideoTypeFromString; virDomainVideoTypeToString; virDomainVirtTypeToString; +virDomainVirtioEventFdTypeFromString; +virDomainVirtioEventFdTypeToString; virDomainWatchdogActionTypeFromString; virDomainWatchdogActionTypeToString; virDomainWatchdogModelTypeFromString; diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index ea55df5..1cd7aa7 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -1202,6 +1202,8 @@ qemuCapsParseDeviceStr(const char *str, virBitmapPtr flags) qemuCapsSet(flags, QEMU_CAPS_VIRTIO_TX_ALG); if (strstr(str, "name \"qxl-vga\"")) qemuCapsSet(flags, QEMU_CAPS_DEVICE_QXL_VGA); + if (strstr(str, "virtio-blk-pci.ioeventfd")) + qemuCapsSet(flags, QEMU_CAPS_VIRTIO_IOEVENTFD); return 0; } diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index ab47f22..56f3998 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -95,6 +95,7 @@ enum qemuCapsFlags { QEMU_CAPS_DEVICE_SPICEVMC = 57, /* older -device spicevmc*/ QEMU_CAPS_VIRTIO_TX_ALG = 58, /* -device virtio-net-pci,tx=string */ QEMU_CAPS_DEVICE_QXL_VGA = 59, /* Is the primary and vga campatible qxl device named qxl-vga? */ + QEMU_CAPS_VIRTIO_IOEVENTFD = 60, /* IOEventFD feature: virtio-{net|blk}-pci.ioeventfd=on/off */ QEMU_CAPS_LAST, /* this must always be the last item */ }; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2828823..0729664 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -1169,6 +1169,26 @@ qemuBuildDeviceAddressStr(virBufferPtr buf, return 0; } +static int +qemuBuildIoEventFdStr(virBufferPtr buf, + enum virDomainVirtioEventFd use, + virBitmapPtr qemuCaps) +{ + if (qemuCapsGet(qemuCaps, QEMU_CAPS_VIRTIO_IOEVENTFD)) { + switch (use) { + case VIR_DOMAIN_VIRTIO_EVENT_FD_ON: + case VIR_DOMAIN_VIRTIO_EVENT_FD_OFF: + virBufferAsprintf(buf, ",ioeventfd=%s", + virDomainVirtioEventFdTypeToString(use)); + break; + default: + /* In other cases (_DEFAULT, _LAST) we don't + * want to add anything */ + break; + } + } + return 0; +} #define QEMU_SERIAL_PARAM_ACCEPTED_CHARS \ "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_" @@ -1434,6 +1454,7 @@ qemuBuildDriveDevStr(virDomainDiskDefPtr disk, break; case VIR_DOMAIN_DISK_BUS_VIRTIO: virBufferAddLit(&opt, "virtio-blk-pci"); + qemuBuildIoEventFdStr(&opt, disk->ioeventfd, qemuCaps); qemuBuildDeviceAddressStr(&opt, &disk->info, qemuCaps); break; case VIR_DOMAIN_DISK_BUS_USB: @@ -1656,6 +1677,8 @@ qemuBuildNicDevStr(virDomainNetDefPtr net, goto error; } } + if(usingVirtio) + qemuBuildIoEventFdStr(&buf, net->driver.virtio.ioeventfd, qemuCaps); if (vlan == -1) virBufferAsprintf(&buf, ",netdev=host%s", net->info.alias); else diff --git a/tests/qemuhelptest.c b/tests/qemuhelptest.c index 2522396..387f46c 100644 --- a/tests/qemuhelptest.c +++ b/tests/qemuhelptest.c @@ -474,7 +474,8 @@ mymain(void) QEMU_CAPS_CCID_PASSTHRU, QEMU_CAPS_CHARDEV_SPICEVMC, QEMU_CAPS_DEVICE_QXL_VGA, - QEMU_CAPS_VIRTIO_TX_ALG); + QEMU_CAPS_VIRTIO_TX_ALG, + QEMU_CAPS_VIRTIO_IOEVENTFD); return ret == 0 ? EXIT_SUCCESS : EXIT_FAILURE; } -- 1.7.5.rc3

On Tue, May 17, 2011 at 04:49:17PM +0200, Michal Privoznik wrote:
This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. --- diff to v1: - rebase to current HEAD
docs/formatdomain.html.in | 15 ++++++++++++- docs/schemas/domain.rng | 14 +++++++++++++ src/conf/domain_conf.c | 44 +++++++++++++++++++++++++++++++++++++++++- src/conf/domain_conf.h | 12 +++++++++++ src/libvirt_private.syms | 2 + src/qemu/qemu_capabilities.c | 2 + src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 23 +++++++++++++++++++++ tests/qemuhelptest.c | 3 +- 9 files changed, 112 insertions(+), 4 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 989dcf6..a4d7cbd 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -767,7 +767,7 @@ </disk> ... <disk type='network'> - <driver name="qemu" type="raw" io="threads"/> + <driver name="qemu" type="raw" io="threads" ioeventfd="on"/> <source protocol="sheepdog" name="image_name"> <host name="hostname" port="7000"/> </source> @@ -851,6 +851,11 @@ policies on I/O; qemu guests support "threads" and "native". <span class="since">Since 0.8.8</span> </li> + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span>
This is a qemu specific attribute name & description. IMHO we shouldn't be exposing that directly. Who even knows what effect it actually has on the guests... Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, May 17, 2011 at 03:56:11PM +0100, Daniel P. Berrange wrote:
On Tue, May 17, 2011 at 04:49:17PM +0200, Michal Privoznik wrote:
This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. [...] + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span>
This is a qemu specific attribute name & description. IMHO we shouldn't be exposing that directly. Who even knows what effect it actually has on the guests...
Agreed, what is the semantic of this flag, beside allowing to switch something in qemu ? Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, May 18, 2011 at 04:07:30PM +0800, Daniel Veillard wrote:
On Tue, May 17, 2011 at 03:56:11PM +0100, Daniel P. Berrange wrote:
On Tue, May 17, 2011 at 04:49:17PM +0200, Michal Privoznik wrote:
This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. [...] + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span>
This is a qemu specific attribute name & description. IMHO we shouldn't be exposing that directly. Who even knows what effect it actually has on the guests...
Agreed, what is the semantic of this flag, beside allowing to switch something in qemu ?
Just to clarify my answer a bit, the problem here is that the patch does not explain what the ioeventfd qemu flag does in practice and how it influence the virtualization. To be able to provide a good API and maintain it long term we need to be able to explain the semantic of the API (be it a function of the library or part of the XML being used), only then we can guarantee that there is no misunderstanding about what it does, and also allow us to reuse it in case the same functionality is provided by another hypervisor. So instead of explaining the option using terms from QEmu, let's explain what it does in general terms and use those general terms to model the API, I hope it's clearer :-) thanks ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Thu, May 19, 2011 at 8:26 AM, Daniel Veillard <veillard@redhat.com> wrote:
On Wed, May 18, 2011 at 04:07:30PM +0800, Daniel Veillard wrote:
On Tue, May 17, 2011 at 03:56:11PM +0100, Daniel P. Berrange wrote:
On Tue, May 17, 2011 at 04:49:17PM +0200, Michal Privoznik wrote:
This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. [...] + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span>
This is a qemu specific attribute name & description. IMHO we shouldn't be exposing that directly. Who even knows what effect it actually has on the guests...
Agreed, what is the semantic of this flag, beside allowing to switch something in qemu ?
Just to clarify my answer a bit, the problem here is that the patch does not explain what the ioeventfd qemu flag does in practice and how it influence the virtualization. To be able to provide a good API and maintain it long term we need to be able to explain the semantic of the API (be it a function of the library or part of the XML being used), only then we can guarantee that there is no misunderstanding about what it does, and also allow us to reuse it in case the same functionality is provided by another hypervisor. So instead of explaining the option using terms from QEmu, let's explain what it does in general terms and use those general terms to model the API,
I don't think there is a general API here, ioeventfd is specific to QEMU's architecture. It allows you to switch between two internal threading models for handling I/O emulation. It could change in the future if QEMU's architecture changes. This is not an end-user feature, it's more an internal performance tunable. Stefan

On Thu, May 19, 2011 at 09:44:35AM +0100, Stefan Hajnoczi wrote:
On Thu, May 19, 2011 at 8:26 AM, Daniel Veillard <veillard@redhat.com> wrote:
On Wed, May 18, 2011 at 04:07:30PM +0800, Daniel Veillard wrote:
On Tue, May 17, 2011 at 03:56:11PM +0100, Daniel P. Berrange wrote:
On Tue, May 17, 2011 at 04:49:17PM +0200, Michal Privoznik wrote:
This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. [...] + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span>
This is a qemu specific attribute name & description. IMHO we shouldn't be exposing that directly. Who even knows what effect it actually has on the guests...
Agreed, what is the semantic of this flag, beside allowing to switch something in qemu ?
Just to clarify my answer a bit, the problem here is that the patch does not explain what the ioeventfd qemu flag does in practice and how it influence the virtualization. To be able to provide a good API and maintain it long term we need to be able to explain the semantic of the API (be it a function of the library or part of the XML being used), only then we can guarantee that there is no misunderstanding about what it does, and also allow us to reuse it in case the same functionality is provided by another hypervisor. So instead of explaining the option using terms from QEmu, let's explain what it does in general terms and use those general terms to model the API,
I don't think there is a general API here, ioeventfd is specific to QEMU's architecture. It allows you to switch between two internal threading models for handling I/O emulation. It could change in the future if QEMU's architecture changes. This is not an end-user feature, it's more an internal performance tunable.
Actually reading about it at https://patchwork.kernel.org/patch/43390/ it seems that can be described as "domain I/O asynchronous handling", it's a shortcut because it's not for the whole I/O only a part of it but that in itself is sufficiently generic to be potentially useful for something else. I would just suggest to rename the attribute "asyncio" with value on or off, document the fact that it allows to force on or off some of the asynchronous I/O handling for the device, and that the default is left to the discretion of the hypervisor. In case we need to refine later, we can still provide a larger set of accepted values for that attribute, assuming people really want to make more distinctive tuning, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Thu, May 19, 2011 at 11:20 AM, Daniel Veillard <veillard@redhat.com> wrote:
On Thu, May 19, 2011 at 09:44:35AM +0100, Stefan Hajnoczi wrote:
On Thu, May 19, 2011 at 8:26 AM, Daniel Veillard <veillard@redhat.com> wrote:
On Wed, May 18, 2011 at 04:07:30PM +0800, Daniel Veillard wrote:
On Tue, May 17, 2011 at 03:56:11PM +0100, Daniel P. Berrange wrote:
On Tue, May 17, 2011 at 04:49:17PM +0200, Michal Privoznik wrote:
This feature allows QEMU to achieve higher throughput, but is available only in recent versions. It is accessible via ioeventfd attribute with accepting values 'on', 'off'. Only experienced users needs to set this, because QEMU defaults to 'on', meaning higher performance. Translates into virtio-{blk|net}-pci.ioeventfd option. [...] + <li> + The optional <code>ioeventfd</code> attribute enables or disables + IOEventFD feature for virtqueue notify. The value can be either + 'on' or 'off'. + <span class="since">Since 0.9.2 (QEMU and KVM only)</span>
This is a qemu specific attribute name & description. IMHO we shouldn't be exposing that directly. Who even knows what effect it actually has on the guests...
Agreed, what is the semantic of this flag, beside allowing to switch something in qemu ?
Just to clarify my answer a bit, the problem here is that the patch does not explain what the ioeventfd qemu flag does in practice and how it influence the virtualization. To be able to provide a good API and maintain it long term we need to be able to explain the semantic of the API (be it a function of the library or part of the XML being used), only then we can guarantee that there is no misunderstanding about what it does, and also allow us to reuse it in case the same functionality is provided by another hypervisor. So instead of explaining the option using terms from QEmu, let's explain what it does in general terms and use those general terms to model the API,
I don't think there is a general API here, ioeventfd is specific to QEMU's architecture. It allows you to switch between two internal threading models for handling I/O emulation. It could change in the future if QEMU's architecture changes. This is not an end-user feature, it's more an internal performance tunable.
Actually reading about it at https://patchwork.kernel.org/patch/43390/
it seems that can be described as "domain I/O asynchronous handling", it's a shortcut because it's not for the whole I/O only a part of it but that in itself is sufficiently generic to be potentially useful for something else.
I would just suggest to rename the attribute "asyncio" with value on or off, document the fact that it allows to force on or off some of the asynchronous I/O handling for the device, and that the default is left to the discretion of the hypervisor.
In case we need to refine later, we can still provide a larger set of accepted values for that attribute, assuming people really want to make more distinctive tuning,
Inventing a different name makes life harder for everyone. There is a need for a generic API/notation that covers all virtualization software but this is a hypervisor-specific performance tunable that does not benefit from abstraction. When I ask a user to try disabling ioeventfd I need to first search through libvirt documentation and/or source code to reverse-engineer this artificial mapping. This creates an extra source of errors for people who are trying to configure or troubleshoot their systems. The "I know what the hypervisor-specific setting is but have no idea how to express it in libvirt domain XML" problem is really common and creates a gap between the hypervisor and libvirt communities. The next time an optimization is added to QEMU you'll have to pick a new name, "asyncio" (already overloaded terminology today) won't be available anymore. We're going to end up with increasingly contrived or off-base naming. Regarding semantics: Ioeventfd decouples vcpu execution from I/O emulation, allowing the VM to execute guest code while a separate thread handles I/O. This results in reduced steal time and lowers spinlock contention inside the guest. Typically guests that are experiencing high system cpu utilization during I/O will benefit from ioeventfd. On an overcommitted host it could increase guest I/O latency though. The ioeventfd option is currently only supported on virtio-blk (default: on) and virtio-net (default: off) devices. Please call it ioeventfd. Also, it can always be toggled using the <qemu:commandline> tag if you don't want to expose it natively in domain XML. Stefan

On 05/19/2011 04:09 PM, Stefan Hajnoczi wrote:
In case we need to refine later, we can still provide a larger set of accepted values for that attribute, assuming people really want to make more distinctive tuning,
Inventing a different name makes life harder for everyone. There is a need for a generic API/notation that covers all virtualization software but this is a hypervisor-specific performance tunable that does not benefit from abstraction.
That's true if we narrow-mindedly think that qemu will be the _only_ hypervisor that will ever use this feature. But libvirt has the stated goal of accommodating multiple hypervisors, so we'd rather go with the generic name even if it requires more mapping.
When I ask a user to try disabling ioeventfd I need to first search through libvirt documentation and/or source code to reverse-engineer this artificial mapping. This creates an extra source of errors for people who are trying to configure or troubleshoot their systems. The "I know what the hypervisor-specific setting is but have no idea how to express it in libvirt domain XML" problem is really common and creates a gap between the hypervisor and libvirt communities.
Yes, good documentation that mentions the specific qemu option it maps to would be helpful. Also, libvirt has the domxml-from-native command, and if we do this correctly, then you should be able to pass an arbitrary qemu command line to libvirt, and ask what the corresponding xml would be. That would make your reverse mapping less difficult.
The next time an optimization is added to QEMU you'll have to pick a new name, "asyncio" (already overloaded terminology today) won't be available anymore. We're going to end up with increasingly contrived or off-base naming.
Then please help us, by suggesting a generic name more suited to this particular qemu tuning, but which can still be used for other hypervisors if they can also tune the same semantics.
Regarding semantics:
Ioeventfd decouples vcpu execution from I/O emulation, allowing the VM to execute guest code while a separate thread handles I/O. This results in reduced steal time and lowers spinlock contention inside the guest. Typically guests that are experiencing high system cpu utilization during I/O will benefit from ioeventfd. On an overcommitted host it could increase guest I/O latency though. The ioeventfd option is currently only supported on virtio-blk (default: on) and virtio-net (default: off) devices.
Please call it ioeventfd. Also, it can always be toggled using the <qemu:commandline> tag if you don't want to expose it natively in domain XML.
However, use of <qemu:commandline> results in an unsupported configuration; we need to expose it natively in domain XML under some name or another. We already have the XML attribute io="threads|native" in the same <driver> element, for selecting whether to use the aio flag for a given disk driver, so this would not be the first case of selecting a different name in the XML. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Thu, May 19, 2011 at 11:24 PM, Eric Blake <eblake@redhat.com> wrote:
On 05/19/2011 04:09 PM, Stefan Hajnoczi wrote:
In case we need to refine later, we can still provide a larger set of accepted values for that attribute, assuming people really want to make more distinctive tuning,
Inventing a different name makes life harder for everyone. There is a need for a generic API/notation that covers all virtualization software but this is a hypervisor-specific performance tunable that does not benefit from abstraction.
That's true if we narrow-mindedly think that qemu will be the _only_ hypervisor that will ever use this feature. But libvirt has the stated goal of accommodating multiple hypervisors, so we'd rather go with the generic name even if it requires more mapping.
That approach is an extreme approach, I'm not suggesting to make everything hypervisor-specific, I'm asking for a practical middle ground - abstract what is common, leave what is not. There are many common concepts: how much RAM should the VM have, what CPU topology, network cards, storage devices. Those should be abstracted because they are genuinely common concepts. However, for hypervisor-specific features inventing names does no good. The feature is not relevant on other hypervisors and abstracting it removes its semantics. It becomes purely an exercise in obfuscation. When I browse the Python documentation many modules are platform-independent and can be used everywhere. Some functions or modules are specifically marked as platform-dependent (and the version they were introduced in): "os.initgroups(username, gid) Call the system initgroups() to initialize the group access list with all of the groups of which the specified username is a member, plus the specified group id. Availability: Unix. New in version 2.7." Doing the same in libvirt allows common concepts to be abstracted while exposing unique ones in a clearly documented way.
When I ask a user to try disabling ioeventfd I need to first search through libvirt documentation and/or source code to reverse-engineer this artificial mapping. This creates an extra source of errors for people who are trying to configure or troubleshoot their systems. The "I know what the hypervisor-specific setting is but have no idea how to express it in libvirt domain XML" problem is really common and creates a gap between the hypervisor and libvirt communities.
Yes, good documentation that mentions the specific qemu option it maps to would be helpful.
Also, libvirt has the domxml-from-native command, and if we do this correctly, then you should be able to pass an arbitrary qemu command line to libvirt, and ask what the corresponding xml would be. That would make your reverse mapping less difficult.
Thanks for mentioning this command, I'll try it next time.
The next time an optimization is added to QEMU you'll have to pick a new name, "asyncio" (already overloaded terminology today) won't be available anymore. We're going to end up with increasingly contrived or off-base naming.
Then please help us, by suggesting a generic name more suited to this particular qemu tuning, but which can still be used for other hypervisors if they can also tune the same semantics.
As I've said, inventing generic names only serves to confuse when the concept is hypervisor-specific and not a common feature.
Regarding semantics:
Ioeventfd decouples vcpu execution from I/O emulation, allowing the VM to execute guest code while a separate thread handles I/O. This results in reduced steal time and lowers spinlock contention inside the guest. Typically guests that are experiencing high system cpu utilization during I/O will benefit from ioeventfd. On an overcommitted host it could increase guest I/O latency though. The ioeventfd option is currently only supported on virtio-blk (default: on) and virtio-net (default: off) devices.
Please call it ioeventfd. Also, it can always be toggled using the <qemu:commandline> tag if you don't want to expose it natively in domain XML.
However, use of <qemu:commandline> results in an unsupported configuration; we need to expose it natively in domain XML under some name or another.
We already have the XML attribute io="threads|native" in the same <driver> element, for selecting whether to use the aio flag for a given disk driver, so this would not be the first case of selecting a different name in the XML.
Just because it has been done before doesn't mean it is a good idea. Stefan

On Thu, May 19, 2011 at 11:09:54PM +0100, Stefan Hajnoczi wrote:
Please call it ioeventfd. Also, it can always be toggled using the <qemu:commandline> tag if you don't want to expose it natively in domain XML.
A general comment on the XML ... I think at least this should be discoverable through the libvirt capabilities (see "virsh capabilities"). At the moment there is no section there which relates to tunables of the hypervisor, nor any for *disk* tunables of the HV. Perhaps there should be both. Is ioeventfd always a win? If so, turn it on by default. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones New in Fedora 11: Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 70 libraries supprt'd http://fedoraproject.org/wiki/MinGW http://www.annexia.org/fedora_mingw
participants (6)
-
Daniel P. Berrange
-
Daniel Veillard
-
Eric Blake
-
Michal Privoznik
-
Richard W.M. Jones
-
Stefan Hajnoczi