[libvirt] [PATCH 0/7] QEMU network/bridge backed NIC hotplug

Hey, So, here are the patches needed to complete the QEMU NIC hotplug work by supporting network and bridge backed interfaces. We support this by passing a tap file descriptor to qemu over the monitor socket using SCM_RIGHTS. The monitor commands required for this (getfd and closefd) are queued up in Anthony's queue and should be in qemu-0.11-rc1 which is due to be released very soon: http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=4edd6c2645 http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=f4069061e3 Note: I just remembered I'm not handling missing support for these commands very well. I'll make it handle an 'unknown command' reply gracefully by failing the attach tommorrow. Cheers, Mark.

* src/qemu_conf.h: export qemudNetworkIfaceConnect() * src/qemu_conf.c: move vnet_hdr logic into qemudNetworkIfaceConnect() since we need it for hotplug too --- src/qemu_conf.c | 17 ++++++++--------- src/qemu_conf.h | 5 +++++ 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/src/qemu_conf.c b/src/qemu_conf.c index 96f83cb..ed77153 100644 --- a/src/qemu_conf.c +++ b/src/qemu_conf.c @@ -743,15 +743,16 @@ int qemudExtractVersion(virConnectPtr conn, } -static int +int qemudNetworkIfaceConnect(virConnectPtr conn, struct qemud_driver *driver, virDomainNetDefPtr net, - int vnet_hdr) + int qemuCmdFlags) { char *brname; int err; int tapfd = -1; + int vnet_hdr = 0; if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK) { virNetworkPtr network = virNetworkLookupByName(conn, @@ -791,6 +792,10 @@ qemudNetworkIfaceConnect(virConnectPtr conn, return -1; } + if (qemuCmdFlags & QEMUD_CMD_FLAG_VNET_HDR && + net->model && STREQ(net->model, "virtio")) + vnet_hdr = 1; + if ((err = brAddTap(driver->brctl, brname, &net->ifname, vnet_hdr, &tapfd))) { if (errno == ENOTSUP) { @@ -1545,13 +1550,7 @@ int qemudBuildCommandLine(virConnectPtr conn, if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK || net->type == VIR_DOMAIN_NET_TYPE_BRIDGE) { - int vnet_hdr = 0; - - if (qemuCmdFlags & QEMUD_CMD_FLAG_VNET_HDR && - net->model && STREQ(net->model, "virtio")) - vnet_hdr = 1; - - tapfd = qemudNetworkIfaceConnect(conn, driver, net, vnet_hdr); + tapfd = qemudNetworkIfaceConnect(conn, driver, net, qemuCmdFlags); if (tapfd < 0) goto error; diff --git a/src/qemu_conf.h b/src/qemu_conf.h index 50d7c0a..8ad5f94 100644 --- a/src/qemu_conf.h +++ b/src/qemu_conf.h @@ -160,6 +160,11 @@ int qemuBuildNicStr (virConnectPtr conn, int vlan, char **str); +int qemudNetworkIfaceConnect (virConnectPtr conn, + struct qemud_driver *driver, + virDomainNetDefPtr net, + int qemuCmdFlags); + int qemuAssignNetNames (virDomainDefPtr def, virDomainNetDefPtr net); -- 1.6.2.5

This patch allows passing of a "-mem-path <arg>" flag to qemu for support of huge page backed guests. A guest may request this option via specifying: <hugepage>on</hugepage> in its domain definition xml file. The request for huge page backing will be attempted within libvirt if the host system has indicated a hugetlbfs mount point in qemu.conf, for example: hugepage_mount = "/hugetlbfs" _and_ the target qemu executable is aware of the "-mem-path" flag. Otherwise this request by a guest will result in an error. This patch does not address setup of the required host hugetlbfs mount point, verifying the mount point is correct/usable, nor assure sufficient free huge pages are available; which are assumed to be addressed by other means. Signed-off-by: john cooper <john.cooper@redhat.com> --- diff --git a/src/domain_conf.c b/src/domain_conf.c index f3e4c6c..04d6911 100644 --- a/src/domain_conf.c +++ b/src/domain_conf.c @@ -2369,6 +2369,17 @@ static virDomainDefPtr virDomainDefParseXML(virConnectPtr conn, if (virXPathULong(conn, "string(./currentMemory[1])", ctxt, &def->memory) < 0) def->memory = def->maxmem; + tmp = virXPathString(conn, "string(./hugepage[1])", ctxt); + if (!tmp || STREQ(tmp, "off")) + def->hugepage_backed = 0; + else if (STREQ(tmp, "on")) + def->hugepage_backed = 1; + else { + virDomainReportError(conn, VIR_ERR_INTERNAL_ERROR, + _("invalid hugepage mode %s"), tmp); + goto error; + } + if (virXPathULong(conn, "string(./vcpu[1])", ctxt, &def->vcpus) < 0) def->vcpus = 1; @@ -3933,6 +3944,8 @@ char *virDomainDefFormat(virConnectPtr conn, virBufferVSprintf(&buf, " <memory>%lu</memory>\n", def->maxmem); virBufferVSprintf(&buf, " <currentMemory>%lu</currentMemory>\n", def->memory); + if (def->hugepage_backed) + virBufferVSprintf(&buf, " <hugepage>%s</hugepage>\n", "on"); for (n = 0 ; n < def->cpumasklen ; n++) if (def->cpumask[n] != 1) diff --git a/src/domain_conf.h b/src/domain_conf.h index 6e111fa..d6bdcdb 100644 --- a/src/domain_conf.h +++ b/src/domain_conf.h @@ -481,6 +481,7 @@ struct _virDomainDef { unsigned long memory; unsigned long maxmem; + unsigned char hugepage_backed; unsigned long vcpus; int cpumasklen; char *cpumask; diff --git a/src/qemu.conf b/src/qemu.conf index 3009725..a3387f1 100644 --- a/src/qemu.conf +++ b/src/qemu.conf @@ -95,3 +95,10 @@ # The group ID for QEMU processes run by the system instance #group = "root" + +# If provided by the host and this hugetlbfs mount point is +# configured, a guest may request huge page backing. When this +# mount point is undefined, huge page backing is disabled. + +hugepage_mount = "/hugetlbfs" + diff --git a/src/qemu_conf.c b/src/qemu_conf.c index 4043d70..632b784 100644 --- a/src/qemu_conf.c +++ b/src/qemu_conf.c @@ -218,6 +218,17 @@ int qemudLoadDriverConfig(struct qemud_driver *driver, } VIR_FREE(group); + p = virConfGetValue (conf, "hugepage_mount"); + CHECK_TYPE ("hugepage_mount", VIR_CONF_STRING); + if (p && p->str) { + VIR_FREE(driver->hugepage_mount); + if (!(driver->hugepage_mount = strdup(p->str))) { + virReportOOMError(NULL); + virConfFree(conf); + return -1; + } + } + virConfFree (conf); return 0; } @@ -500,6 +511,8 @@ static unsigned int qemudComputeCmdFlags(const char *help, flags |= QEMUD_CMD_FLAG_VGA; if (strstr(help, "boot=on")) flags |= QEMUD_CMD_FLAG_DRIVE_BOOT; + if (strstr(help, "-mem-path")) + flags |= QEMUD_CMD_FLAG_MEM_PATH; if (version >= 9000) flags |= QEMUD_CMD_FLAG_VNC_COLON; @@ -1125,6 +1138,15 @@ int qemudBuildCommandLine(virConnectPtr conn, ADD_ARG_LIT("-no-kvm"); ADD_ARG_LIT("-m"); ADD_ARG_LIT(memory); + if (def->hugepage_backed) { + if (!driver->hugepage_mount || !(qemuCmdFlags & QEMUD_CMD_FLAG_MEM_PATH)) { + qemudReportError(conn, NULL, NULL, VIR_ERR_NO_SUPPORT, + "%s", _("hugepage backing not supported")); + goto error; + } + ADD_ARG_LIT("-mem-path"); + ADD_ARG_LIT(driver->hugepage_mount); + } ADD_ARG_LIT("-smp"); ADD_ARG_LIT(vcpus); diff --git a/src/qemu_conf.h b/src/qemu_conf.h index fbf2ab9..847597f 100644 --- a/src/qemu_conf.h +++ b/src/qemu_conf.h @@ -58,6 +58,7 @@ enum qemud_cmd_flags { QEMUD_CMD_FLAG_KVM = (1 << 13), /* Whether KVM is compiled in */ QEMUD_CMD_FLAG_DRIVE_FORMAT = (1 << 14), /* Is -drive format= avail */ QEMUD_CMD_FLAG_VGA = (1 << 15), /* Is -vga avail */ + QEMUD_CMD_FLAG_MEM_PATH = (1 << 16), /* mmap'ped guest backing supported */ }; /* Main driver state */ @@ -86,6 +87,7 @@ struct qemud_driver { char *vncListen; char *vncPassword; char *vncSASLdir; + char *hugepage_mount; virCapsPtr caps; diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 00dc6e5..bdecf5a 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -638,6 +638,7 @@ qemudShutdown(void) { VIR_FREE(qemu_driver->vncListen); VIR_FREE(qemu_driver->vncPassword); VIR_FREE(qemu_driver->vncSASLdir); + VIR_FREE(qemu_driver->hugepage_mount); /* Free domain callback list */ virDomainEventCallbackListFree(qemu_driver->domainEventCallbacks); -- john.cooper@redhat.com

Hey, For anyone interested, there is more background here: https://fedoraproject.org/wiki/Features/KVM_Huge_Page_Backed_Memory On Wed, 2009-07-22 at 21:25 -0400, john cooper wrote:
This patch allows passing of a "-mem-path <arg>" flag to qemu for support of huge page backed guests. A guest may request this option via specifying:
<hugepage>on</hugepage>
Other options include: - <hugepages/> - <memory hugepages="yes">XXXXX</memory>
in its domain definition xml file. The request for huge page backing will be attempted within libvirt if the host system has indicated a hugetlbfs mount point in qemu.conf, for example:
hugepage_mount = "/hugetlbfs"
I'd suggest /dev/hugepages as the default - /hugetlbfs has an seriously unstandard whiff about it Apart from that, the patch looks good to me - it needs to be re-based onto latest master, but the conflict is no big deal. Cheers, Mark.

Mark McLoughlin wrote:
Other options include:
- <hugepages/>
- <memory hugepages="yes">XXXXX</memory>
Yes, I'd expect additional options will need to be addressed. Currently the only additional qemu-resident knob is the -mem-prealloc flag which is enabled by default. I've removed support for it here in the interest of simplicity but it will fall into the existing scheme.
hugepage_mount = "/hugetlbfs"
I'd suggest /dev/hugepages as the default - /hugetlbfs has an seriously unstandard whiff about it
That was intended as an example for no other reason than it is what I happen to use. I suspected it wouldn't be the mount point's final resting place. Thanks, -john -- john.cooper@redhat.com

* Mark McLoughlin (markmc@redhat.com) wrote:
I'd suggest /dev/hugepages as the default - /hugetlbfs has an seriously unstandard whiff about it
What about /var/lib/libvirt/qemu/hugetlb and having the whole thing under libvirt's control? It can allow for better security I think. thanks, -chris

On Thu, Jul 23, 2009 at 11:35:17AM -0700, Chris Wright wrote:
* Mark McLoughlin (markmc@redhat.com) wrote:
I'd suggest /dev/hugepages as the default - /hugetlbfs has an seriously unstandard whiff about it
What about /var/lib/libvirt/qemu/hugetlb and having the whole thing under libvirt's control? It can allow for better security I think.
Does hugetlbfs support extended attributes ? If so, the sVirt will automatically ensure isolation of each VM's backing file. If it doesn't supported extended attrs, then using hugetlbs at all would seem to blow a huge hole in our security model ... Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

* Daniel P. Berrange (berrange@redhat.com) wrote:
On Thu, Jul 23, 2009 at 11:35:17AM -0700, Chris Wright wrote:
* Mark McLoughlin (markmc@redhat.com) wrote:
I'd suggest /dev/hugepages as the default - /hugetlbfs has an seriously unstandard whiff about it
What about /var/lib/libvirt/qemu/hugetlb and having the whole thing under libvirt's control? It can allow for better security I think.
Does hugetlbfs support extended attributes ? If so, the sVirt will automatically ensure isolation of each VM's backing file. If it doesn't supported extended attrs, then using hugetlbs at all would seem to blow a huge hole in our security model ...
You should be able to specify a per mount point security context. There is no xattr support for hugetlbfs. thanks, -chris

On Wed, Jul 22, 2009 at 09:25:02PM -0400, john cooper wrote:
This patch allows passing of a "-mem-path <arg>" flag to qemu for support of huge page backed guests. A guest may request this option via specifying:
<hugepage>on</hugepage>
in its domain definition xml file.
This really opens a can of worms. While obviously this maps very simply onto KVM's -mem-path argument, I can't help thinking things are going to get much more advanced later. For example, I don't think a boolean on/off is sufficient for this, since Xen already has a 3rd option of 'best effort' it uses by default where it'll try to allocate hugepages and fallback to normal pages - in fact you can't tell Xen not to use hugepages AFAIK. I'm also wondering whether we need to be concerned about different hugepage sizes for guest configs eg 2M vs 1 GB, vs a mix of both - who decides? KVM also seems to have ability to request that huge pages are pre-allocated upfront, vs on demand, though I'm not sure what happens to a VM if it doesn't pre-allocate and it later can't be satisfied.
The request for huge page backing will be attempted within libvirt if the host system has indicated a hugetlbfs mount point in qemu.conf, for example:
hugepage_mount = "/hugetlbfs"
Seems like it would be simpler to just open /proc/mounts and scan it to find whether/where hugetlbfs is mounted, so it would 'just work' if the user had mounted it.
_and_ the target qemu executable is aware of the "-mem-path" flag. Otherwise this request by a guest will result in an error.
It looks like argument is not available in upstream QEMU, only part of the KVM fork ? ANy idea why it hasn't been sent upstream, and/or whether it will be soon. I'm loathe to add more KVM specific options since we've been burnt everytime we've done this in the past with its semantics changing when merged to QEMU :-(
This patch does not address setup of the required host hugetlbfs mount point, verifying the mount point is correct/usable, nor assure sufficient free huge pages are available; which are assumed to be addressed by other means.
I agree that setting up hugetlbfs is out of scope for libvirt. We should just probe to see whether its available or not. We ought to have some way of reporting available hugepages though, both at a host level, and likely per NUMA node too. Without this a mgmt app using libvirt has no clue whether they'll be able to actually use hugepages successfully or not. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Wed, Jul 22, 2009 at 09:25:02PM -0400, john cooper wrote:
This patch allows passing of a "-mem-path <arg>" flag to qemu for support of huge page backed guests. A guest may request this option via specifying:
<hugepage>on</hugepage>
in its domain definition xml file.
This really opens a can of worms. While obviously this maps very simply onto KVM's -mem-path argument, I can't help thinking things are going to get much more advanced later. For example, I don't think a boolean on/off is sufficient for this, since Xen already has a 3rd option of 'best effort' it uses by default where it'll try to allocate hugepages and fallback to normal pages - in fact you can't tell Xen not to use hugepages AFAIK.
I agree growing beyond a simple on/off switch is likely. The patch originally had a "prealloc" option (since removed) to accommodate passing of the existing "-mem-prealloc" flag. That option defaults to enabled, which is desired in general and therefore was dropped for the sake of patch simplicity.
I'm also wondering whether we need to be concerned about different hugepage sizes for guest configs eg 2M vs 1 GB, vs a mix of both - who decides?
Not a consideration currently, but it does underscore the argument for a more extensible option syntax.
KVM also seems to have ability to request that huge pages are pre-allocated upfront, vs on demand, though I'm not sure what happens to a VM if it doesn't pre-allocate and it later can't be satisfied.
That was the motivation for "-mem-prealloc", prior to which a VM would SIGSEGV terminate if a huge page fault couldn't be satisfied during runtime. The (now default) preallocation behavior guarantees the guest has its working set present at startup but at the potential cost of overly pessimistic memory allocation.
The request for huge page backing will be attempted within libvirt if the host system has indicated a hugetlbfs mount point in qemu.conf, for example:
hugepage_mount = "/hugetlbfs"
Seems like it would be simpler to just open /proc/mounts and scan it to find whether/where hugetlbfs is mounted, so it would 'just work' if the user had mounted it.
Checking /proc/mounts solely seemed a bit too speculative which is where the qemu.conf option arose. But I can see both being useful as in checking whether the qemu.conf mount point exists, otherwise attempting to glean the information from /proc/mounts, and if neither are satisfied flagging the error.
It looks like argument is not available in upstream QEMU, only part of the KVM fork ? ANy idea why it hasn't been sent upstream, and/or whether it will be soon.
I'd hazard due to the existing huge page support being closely tied to kvm and no motivation as of yet to reconcile this upstream.
I'm loathe to add more KVM specific options since we've been burnt everytime we've done this in the past with its semantics changing when merged to QEMU :-(
Quite understandable. It is also why I was attempting to be as generic (and simple) as possible here and not excessively cast the existing kvm implementation specifics into the exported libvirt option.
I agree that setting up hugetlbfs is out of scope for libvirt. We should just probe to see whether its available or not. We ought to have some way of reporting available hugepages though, both at a host level, and likely per NUMA node too. Without this a mgmt app using libvirt has no clue whether they'll be able to actually use hugepages successfully or not.
Agree. Extracting this host information will be needed when more comprehensive management of the same exists. Still it would seem a case of best-effort enforcement without some sort of additional coordination. The process of gleaning the number of free huge pages from the host and launching of the guest currently has an inherent race. Thanks, -john -- john.cooper@redhat.com

I've incorporated feedback received on the prior version into the patch below. The host mount point for hugetlbfs is queried by default from /proc/mounts unless overridden in qemu.conf via: hugepage_mount = "<path_to_mount_point>" This should make the concern of establishing a mount point path convention a non-issue for the general case while still allowing the same to be deterministically set if needed. Complete disable of hugepage backing can be accomplished by setting the mount point to "". A guest now requests hugepage backing via specifying: <hugepage enable='on'/> in its domain definition xml file. A 'hugepage' element along with an enable attribute is added to the definition, but the element content is open for future use. As before if a guest requests hugepage backing and either the host mount point can't be determined, or the target qemu doesn't support -mem-path, an error is flagged. Signed-off-by: john cooper <john.cooper@redhat.com> --- diff --git a/src/domain_conf.c b/src/domain_conf.c index f3e4c6c..ac328ea 100644 --- a/src/domain_conf.c +++ b/src/domain_conf.c @@ -2369,6 +2369,18 @@ static virDomainDefPtr virDomainDefParseXML(virConnectPtr conn, if (virXPathULong(conn, "string(./currentMemory[1])", ctxt, &def->memory) < 0) def->memory = def->maxmem; + tmp = virXPathString(conn, "string(./hugepage/@enable)", ctxt); + if (!tmp || STREQ(tmp, "off")) + def->hugepage_backed = 0; + else if (STREQ(tmp, "on")) + def->hugepage_backed = 1; + else { + virDomainReportError(conn, VIR_ERR_INTERNAL_ERROR, + _("invalid hugepage mode %s"), tmp); + goto error; + } + VIR_FREE(tmp); + if (virXPathULong(conn, "string(./vcpu[1])", ctxt, &def->vcpus) < 0) def->vcpus = 1; @@ -3933,6 +3945,8 @@ char *virDomainDefFormat(virConnectPtr conn, virBufferVSprintf(&buf, " <memory>%lu</memory>\n", def->maxmem); virBufferVSprintf(&buf, " <currentMemory>%lu</currentMemory>\n", def->memory); + virBufferVSprintf(&buf, " <hugepage enable='%s'/>\n", + def->hugepage_backed ? "on" : "off"); for (n = 0 ; n < def->cpumasklen ; n++) if (def->cpumask[n] != 1) diff --git a/src/domain_conf.h b/src/domain_conf.h index 6e111fa..d6bdcdb 100644 --- a/src/domain_conf.h +++ b/src/domain_conf.h @@ -481,6 +481,7 @@ struct _virDomainDef { unsigned long memory; unsigned long maxmem; + unsigned char hugepage_backed; unsigned long vcpus; int cpumasklen; char *cpumask; diff --git a/src/qemu.conf b/src/qemu.conf index 3009725..9ecac23 100644 --- a/src/qemu.conf +++ b/src/qemu.conf @@ -95,3 +95,13 @@ # The group ID for QEMU processes run by the system instance #group = "root" + +# If provided by the host and a hugetlbfs mount point is configured, +# a guest may request huge page backing. When this mount point is +# unspecified here, determination of a host mount point in /proc/mounts +# will be attempted. Specifying an explicit mount overrides detection +# of the same in /proc/mounts. Setting the mount point to "" will +# disable guest hugepage backing. + +# hugepage_mount = "<path_to_mount_point>" + diff --git a/src/qemu_conf.c b/src/qemu_conf.c index 4043d70..5a3d70c 100644 --- a/src/qemu_conf.c +++ b/src/qemu_conf.c @@ -35,6 +35,7 @@ #include <sys/wait.h> #include <arpa/inet.h> #include <sys/utsname.h> +#include <mntent.h> #include "c-ctype.h" #include "virterror_internal.h" @@ -50,6 +51,8 @@ #include "nodeinfo.h" #include "logging.h" +unsigned char *find_mnt(unsigned char *type); + #define VIR_FROM_THIS VIR_FROM_QEMU VIR_ENUM_DECL(virDomainDiskQEMUBus) @@ -87,6 +90,7 @@ VIR_ENUM_IMPL(qemuVideo, VIR_DOMAIN_VIDEO_TYPE_LAST, NULL, /* no arg needed for xen */ NULL /* don't support vbox */); +#define PROC_MOUNT_BUF_LEN 255 int qemudLoadDriverConfig(struct qemud_driver *driver, const char *filename) { @@ -218,10 +222,41 @@ int qemudLoadDriverConfig(struct qemud_driver *driver, } VIR_FREE(group); + p = virConfGetValue (conf, "hugepage_mount"); + CHECK_TYPE ("hugepage_mount", VIR_CONF_STRING); + if (!p) + driver->hugepage_mount = find_mnt("hugetlbfs"); + else if (!p->str) + driver->hugepage_mount = NULL; + else { + VIR_FREE(driver->hugepage_mount); + if (!(driver->hugepage_mount = strdup(p->str))) { + virReportOOMError(NULL); + virConfFree(conf); + return -1; + } + } + virConfFree (conf); return 0; } +/* search /proc/mounts for mount point of *type; return pointer to + * malloc'ed string of the path if found, otherwise return NULL + */ +unsigned char *find_mnt(unsigned char *type) +{ + FILE *f = setmntent("/proc/mounts", "r"); + unsigned char buf[PROC_MOUNT_BUF_LEN]; + struct mntent mb; + + if (f) + while (getmntent_r(f, &mb, buf, PROC_MOUNT_BUF_LEN)) + if (STREQ(mb.mnt_type, type)) + return (strdup(mb.mnt_dir)); + return (NULL); +} + /* The list of possible machine types for various architectures, as supported by QEMU - taken from 'qemu -M ?' for each arch */ static const char *const arch_info_hvm_x86_machines[] = { @@ -500,6 +535,8 @@ static unsigned int qemudComputeCmdFlags(const char *help, flags |= QEMUD_CMD_FLAG_VGA; if (strstr(help, "boot=on")) flags |= QEMUD_CMD_FLAG_DRIVE_BOOT; + if (strstr(help, "-mem-path")) + flags |= QEMUD_CMD_FLAG_MEM_PATH; if (version >= 9000) flags |= QEMUD_CMD_FLAG_VNC_COLON; @@ -1125,6 +1162,15 @@ int qemudBuildCommandLine(virConnectPtr conn, ADD_ARG_LIT("-no-kvm"); ADD_ARG_LIT("-m"); ADD_ARG_LIT(memory); + if (def->hugepage_backed) { + if (!driver->hugepage_mount || !(qemuCmdFlags & QEMUD_CMD_FLAG_MEM_PATH)) { + qemudReportError(conn, NULL, NULL, VIR_ERR_NO_SUPPORT, + "%s", _("hugepage backing not supported")); + goto error; + } + ADD_ARG_LIT("-mem-path"); + ADD_ARG_LIT(driver->hugepage_mount); + } ADD_ARG_LIT("-smp"); ADD_ARG_LIT(vcpus); diff --git a/src/qemu_conf.h b/src/qemu_conf.h index fbf2ab9..847597f 100644 --- a/src/qemu_conf.h +++ b/src/qemu_conf.h @@ -58,6 +58,7 @@ enum qemud_cmd_flags { QEMUD_CMD_FLAG_KVM = (1 << 13), /* Whether KVM is compiled in */ QEMUD_CMD_FLAG_DRIVE_FORMAT = (1 << 14), /* Is -drive format= avail */ QEMUD_CMD_FLAG_VGA = (1 << 15), /* Is -vga avail */ + QEMUD_CMD_FLAG_MEM_PATH = (1 << 16), /* mmap'ped guest backing supported */ }; /* Main driver state */ @@ -86,6 +87,7 @@ struct qemud_driver { char *vncListen; char *vncPassword; char *vncSASLdir; + char *hugepage_mount; virCapsPtr caps; diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 00dc6e5..bdecf5a 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -638,6 +638,7 @@ qemudShutdown(void) { VIR_FREE(qemu_driver->vncListen); VIR_FREE(qemu_driver->vncPassword); VIR_FREE(qemu_driver->vncSASLdir); + VIR_FREE(qemu_driver->hugepage_mount); /* Free domain callback list */ virDomainEventCallbackListFree(qemu_driver->domainEventCallbacks); -- john.cooper@redhat.com

On Thu, Jul 23, 2009 at 09:00:18PM -0400, john cooper wrote:
I've incorporated feedback received on the prior version into the patch below.
The host mount point for hugetlbfs is queried by default from /proc/mounts unless overridden in qemu.conf via:
hugepage_mount = "<path_to_mount_point>"
This should make the concern of establishing a mount point path convention a non-issue for the general case while still allowing the same to be deterministically set if needed.
In light of what Chris said about extended attribute support for SELinux I think we, sadly, have no choice by to mount a new instance of hugetlbfs per VM, labelled with the context of that VM. The problem is that this doesn't really fit into the internal architecture we have in the slightest. The SELinux support we have is focused around re-labelling existing resources. This hugetlbfs support implies that the SELinux driver is altering our command line arg generator, which is not an easy thing for us to support, given the code flow here. We might have to resort to sick gross hacks.... unless the kernel guys think its easy to add extended attribute support to hugetlbfs in no time at all.
Complete disable of hugepage backing can be accomplished by setting the mount point to "".
A guest now requests hugepage backing via specifying:
<hugepage enable='on'/>
Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, 2009-07-27 at 22:55 +0100, Daniel P. Berrange wrote:
On Thu, Jul 23, 2009 at 09:00:18PM -0400, john cooper wrote:
I've incorporated feedback received on the prior version into the patch below.
The host mount point for hugetlbfs is queried by default from /proc/mounts unless overridden in qemu.conf via:
hugepage_mount = "<path_to_mount_point>"
This should make the concern of establishing a mount point path convention a non-issue for the general case while still allowing the same to be deterministically set if needed.
In light of what Chris said about extended attribute support for SELinux I think we, sadly, have no choice by to mount a new instance of hugetlbfs per VM, labelled with the context of that VM.
I haven't played with hugetlbfs much, so let me make sure I understand the issue: -) With --mempath, qemu creates a file under the supplied directory and mmap()s it, using the mapped region for guest memory -) Because the directory is under hugetlbfs, that memory is huge page backed -) Guest memory can be read by reading this file -) The file is owned by the uid of the qemu process, so any other processes with that uid can access guest memory -) If we could label the file, we could use policy to prevent qemu processes from accessing each other's files -) We can't, so the alternative is to use a separate hugetlbfs mount for each guest, each with a different label - each mount will only contain the file for that guest, and only be accessible by that guest Cheers, Mark.

On Tue, Jul 28, 2009 at 10:55:58AM +0100, Mark McLoughlin wrote:
On Mon, 2009-07-27 at 22:55 +0100, Daniel P. Berrange wrote:
On Thu, Jul 23, 2009 at 09:00:18PM -0400, john cooper wrote:
I've incorporated feedback received on the prior version into the patch below.
The host mount point for hugetlbfs is queried by default from /proc/mounts unless overridden in qemu.conf via:
hugepage_mount = "<path_to_mount_point>"
This should make the concern of establishing a mount point path convention a non-issue for the general case while still allowing the same to be deterministically set if needed.
In light of what Chris said about extended attribute support for SELinux I think we, sadly, have no choice by to mount a new instance of hugetlbfs per VM, labelled with the context of that VM.
I haven't played with hugetlbfs much, so let me make sure I understand the issue:
-) With --mempath, qemu creates a file under the supplied directory and mmap()s it, using the mapped region for guest memory
-) Because the directory is under hugetlbfs, that memory is huge page backed
-) Guest memory can be read by reading this file
-) The file is owned by the uid of the qemu process, so any other processes with that uid can access guest memory
-) If we could label the file, we could use policy to prevent qemu processes from accessing each other's files
-) We can't, so the alternative is to use a separate hugetlbfs mount for each guest, each with a different label - each mount will only contain the file for that guest, and only be accessible by that guest
Yes, that's pretty much the size of it. NB, the file QEMU creates in hugetlbfs is immediately deleted, but you can still access it via /proc/$PID/fd eg $ mount -t hugetlbfs none /mnt/ $ qemu-kvm -m 60 --mem-prealloc --mem-path /mnt/ -cdrom ~berrange/boot.iso $ pgrep qemu-kvm 12441 $ lsof -p 12441 | grep /mnt qemu-kvm 12441 root DEL REG 0,19 1756707 /mnt/kvm.izG73m qemu-kvm 12441 root 8u REG 0,19 85983232 1756707 /mnt/kvm.izG73m (deleted) $ cat > filecon.c <<EOF #include <selinux/selinux.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> int main(int argc, char **argv) { int fd; security_context_t con; fd = open (argv[1], O_RDONLY); fgetfilecon(fd, &con); fprintf(stderr, "Context: %s\n", (char *)con); close(fd); return 0; } $ gcc -lselinux -o filecon filecon.c $ ./filecon /proc/12441/fd/8 Context: system_u:object_r:hugetlbfs_t:s0 We need the latest 's0' be to be different for every VM. Normally this would 'just work' since, if qemu is running under ':s75', then any fle it create would be labelled ':s75' , but hugetlbfs doesn't support setting attributes # touch /mnt/foo # ls -lZ /mnt/foo -rw-r--r--. root root system_u:object_r:hugetlbfs_t:s0 /mnt/foo # chcon system_u:object_r:hugetlbfs_t:s75 /mnt/foo chcon: failed to change context of `/mnt/foo' to `system_u:object_r:hugetlbfs_t:s75': Invalid argument But we can change the context per mount # mount -o remount,context=system_u:object_r:hugetlbfs_t:s75 /mnt/ # ls -lZ /root/foo -rw-r--r--. root root system_u:object_r:hugetlbfs_t:s75 /root/foo Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, 2009-07-27 at 22:55 +0100, Daniel P. Berrange wrote:
On Thu, Jul 23, 2009 at 09:00:18PM -0400, john cooper wrote:
I've incorporated feedback received on the prior version into the patch below.
The host mount point for hugetlbfs is queried by default from /proc/mounts unless overridden in qemu.conf via:
hugepage_mount = "<path_to_mount_point>"
This should make the concern of establishing a mount point path convention a non-issue for the general case while still allowing the same to be deterministically set if needed.
In light of what Chris said about extended attribute support for SELinux I think we, sadly, have no choice by to mount a new instance of hugetlbfs per VM, labelled with the context of that VM. The problem is that this doesn't really fit into the internal architecture we have in the slightest. The SELinux support we have is focused around re-labelling existing resources.
This hugetlbfs support implies that the SELinux driver is altering our command line arg generator, which is not an easy thing for us to support, given the code flow here. We might have to resort to sick gross hacks.... unless the kernel guys think its easy to add extended attribute support to hugetlbfs in no time at all.
There is a vfs fallback for setxattr of the security.* namespace to the security module, which would work for hugetlbfs if not for the fact that policy defines it as a genfscon-labeled filesystem. We only started prohibiting setxattr on genfscon-labeled filesystems in 2.6.30; prior to that we only did that for mountpoint-labeled filesystems. I can actually chcon a file in a hugetlbfs mount on 2.6.29. To convert hugetlbfs to fully support labeling we'd need hugetlbfs_mknod() to call security_inode_init_security() to set up new inode security labels, just like shmem_mknod() does for tmpfs. And then we'd need to switch over the policy from genfscon to fs_use_trans. -- Stephen Smalley National Security Agency

On Tue, Jul 28, 2009 at 08:04:31AM -0400, Stephen Smalley wrote:
On Mon, 2009-07-27 at 22:55 +0100, Daniel P. Berrange wrote:
In light of what Chris said about extended attribute support for SELinux I think we, sadly, have no choice by to mount a new instance of hugetlbfs per VM, labelled with the context of that VM. The problem is that this doesn't really fit into the internal architecture we have in the slightest. The SELinux support we have is focused around re-labelling existing resources.
This hugetlbfs support implies that the SELinux driver is altering our command line arg generator, which is not an easy thing for us to support, given the code flow here. We might have to resort to sick gross hacks.... unless the kernel guys think its easy to add extended attribute support to hugetlbfs in no time at all.
There is a vfs fallback for setxattr of the security.* namespace to the security module, which would work for hugetlbfs if not for the fact that policy defines it as a genfscon-labeled filesystem. We only started prohibiting setxattr on genfscon-labeled filesystems in 2.6.30; prior to that we only did that for mountpoint-labeled filesystems. I can actually chcon a file in a hugetlbfs mount on 2.6.29.
Ahh, I can get that to work too on 2.6.29, I had previously been testing 2.6.30 :-)
To convert hugetlbfs to fully support labeling we'd need hugetlbfs_mknod() to call security_inode_init_security() to set up new inode security labels, just like shmem_mknod() does for tmpfs. And then we'd need to switch over the policy from genfscon to fs_use_trans.
This sounds like a preferrable plan to me - avoids having to have 100s, if not 1000s, of isntances of hugetlbfs mounted on large machines, then John's latest patch for libvirt would pretty much be sufficient. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Aug 05, 2009 at 03:50:56PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 28, 2009 at 08:04:31AM -0400, Stephen Smalley wrote:
On Mon, 2009-07-27 at 22:55 +0100, Daniel P. Berrange wrote:
In light of what Chris said about extended attribute support for SELinux I think we, sadly, have no choice by to mount a new instance of hugetlbfs per VM, labelled with the context of that VM. The problem is that this doesn't really fit into the internal architecture we have in the slightest. The SELinux support we have is focused around re-labelling existing resources.
This hugetlbfs support implies that the SELinux driver is altering our command line arg generator, which is not an easy thing for us to support, given the code flow here. We might have to resort to sick gross hacks.... unless the kernel guys think its easy to add extended attribute support to hugetlbfs in no time at all.
There is a vfs fallback for setxattr of the security.* namespace to the security module, which would work for hugetlbfs if not for the fact that policy defines it as a genfscon-labeled filesystem. We only started prohibiting setxattr on genfscon-labeled filesystems in 2.6.30; prior to that we only did that for mountpoint-labeled filesystems. I can actually chcon a file in a hugetlbfs mount on 2.6.29.
Ahh, I can get that to work too on 2.6.29, I had previously been testing 2.6.30 :-)
To convert hugetlbfs to fully support labeling we'd need hugetlbfs_mknod() to call security_inode_init_security() to set up new inode security labels, just like shmem_mknod() does for tmpfs. And then we'd need to switch over the policy from genfscon to fs_use_trans.
This sounds like a preferrable plan to me - avoids having to have 100s, if not 1000s, of isntances of hugetlbfs mounted on large machines, then John's latest patch for libvirt would pretty much be sufficient.
FYI, for those interested, we're going to try & fix the kernel support as part of Fedora 12 work. If you want to track progress, the BZ is here https://bugzilla.redhat.com/show_bug.cgi?id=515741 Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Jul 22, 2009 at 10:57:30PM +0100, Mark McLoughlin wrote:
* src/qemu_conf.h: export qemudNetworkIfaceConnect()
* src/qemu_conf.c: move vnet_hdr logic into qemudNetworkIfaceConnect() since we need it for hotplug too
Okay just some refactoring, ACK -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

With hotplug, we're going to want to pass a tapfd name rather than an actual file descriptor, so prepare the way by passing a string tapfd to qemuBuildHostNetStr(). * src/qemu_conf.h: qemuBuildHostNetStr() takes a string tapfd now * src/qemu_conf.c: pass qemuBuildHostNetStr() a string rather than an actual file descriptor * src/qemu_driver.c: update qemudDomainAttachNetDevice() for change --- src/qemu_conf.c | 17 ++++++++++++----- src/qemu_conf.h | 2 +- src/qemu_driver.c | 2 +- 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/src/qemu_conf.c b/src/qemu_conf.c index ed77153..e875435 100644 --- a/src/qemu_conf.c +++ b/src/qemu_conf.c @@ -910,13 +910,13 @@ qemuBuildHostNetStr(virConnectPtr conn, const char *prefix, char type_sep, int vlan, - int tapfd, + const char *tapfd, char **str) { switch (net->type) { case VIR_DOMAIN_NET_TYPE_NETWORK: case VIR_DOMAIN_NET_TYPE_BRIDGE: - if (virAsprintf(str, "%stap%cfd=%d,vlan=%d%s%s", + if (virAsprintf(str, "%stap%cfd=%s,vlan=%d%s%s", prefix ? prefix : "", type_sep, tapfd, vlan, (net->hostnet_name ? ",name=" : ""), @@ -1534,7 +1534,7 @@ int qemudBuildCommandLine(virConnectPtr conn, for (i = 0 ; i < def->nnets ; i++) { virDomainNetDefPtr net = def->nets[i]; char *nic, *host; - int tapfd = -1; + char *tapfd_name = NULL; net->vlan = i; @@ -1550,7 +1550,7 @@ int qemudBuildCommandLine(virConnectPtr conn, if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK || net->type == VIR_DOMAIN_NET_TYPE_BRIDGE) { - tapfd = qemudNetworkIfaceConnect(conn, driver, net, qemuCmdFlags); + int tapfd = qemudNetworkIfaceConnect(conn, driver, net, qemuCmdFlags); if (tapfd < 0) goto error; @@ -1560,14 +1560,21 @@ int qemudBuildCommandLine(virConnectPtr conn, } (*tapfds)[(*ntapfds)++] = tapfd; + + if (virAsprintf(&tapfd_name, "%d", tapfd) < 0) + goto no_memory; } if (qemuBuildHostNetStr(conn, net, NULL, ',', - net->vlan, tapfd, &host) < 0) + net->vlan, tapfd_name, &host) < 0) { + VIR_FREE(tapfd_name); goto error; + } ADD_ARG_LIT("-net"); ADD_ARG(host); + + VIR_FREE(tapfd_name); } } diff --git a/src/qemu_conf.h b/src/qemu_conf.h index 8ad5f94..8150a54 100644 --- a/src/qemu_conf.h +++ b/src/qemu_conf.h @@ -150,7 +150,7 @@ int qemuBuildHostNetStr (virConnectPtr conn, const char *prefix, char type_sep, int vlan, - int tapfd, + const char *tapfd, char **str); int qemuBuildNicStr (virConnectPtr conn, diff --git a/src/qemu_driver.c b/src/qemu_driver.c index c4683ae..7f4e2a1 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -4587,7 +4587,7 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, net->vlan = vm->def->nets[i]->vlan; if (qemuBuildHostNetStr(conn, net, - "host_net_add ", ' ', net->vlan, -1, &cmd) < 0) + "host_net_add ", ' ', net->vlan, NULL, &cmd) < 0) return -1; remove_cmd = NULL; -- 1.6.2.5

On Wed, Jul 22, 2009 at 10:57:31PM +0100, Mark McLoughlin wrote:
With hotplug, we're going to want to pass a tapfd name rather than an actual file descriptor, so prepare the way by passing a string tapfd to qemuBuildHostNetStr().
* src/qemu_conf.h: qemuBuildHostNetStr() takes a string tapfd now
* src/qemu_conf.c: pass qemuBuildHostNetStr() a string rather than an actual file descriptor
* src/qemu_driver.c: update qemudDomainAttachNetDevice() for change
Okay looks simple, ACK Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

In subsequent patches we're going to have a file descriptor to close too, so centralize the error handling cleanups to make things easier. * src/qemu_conf.c: in qemudDomainAttachNetDevice() consolidate the error handling cleanups together --- src/qemu_driver.c | 37 +++++++++++++++---------------------- 1 files changed, 15 insertions(+), 22 deletions(-) diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 7f4e2a1..4f0c60e 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -4549,7 +4549,7 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, unsigned int qemuCmdFlags) { virDomainNetDefPtr net = dev->data.net; - char *cmd, *reply, *remove_cmd; + char *cmd = NULL, *reply = NULL, *remove_cmd = NULL; int i; unsigned domain, bus, slot; @@ -4567,16 +4567,12 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, return -1; } - if (VIR_REALLOC_N(vm->def->nets, vm->def->nnets+1) < 0) { - virReportOOMError(conn); - return -1; - } + if (VIR_REALLOC_N(vm->def->nets, vm->def->nnets+1) < 0) + goto no_memory; if ((qemuCmdFlags & QEMUD_CMD_FLAG_NET_NAME) && - qemuAssignNetNames(vm->def, net) < 0) { - virReportOOMError(conn); - return -1; - } + qemuAssignNetNames(vm->def, net) < 0) + goto no_memory; /* Choose a vlan value greater than all other values since * older versions did not store the value in the state file. @@ -4588,23 +4584,19 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, if (qemuBuildHostNetStr(conn, net, "host_net_add ", ' ', net->vlan, NULL, &cmd) < 0) - return -1; + goto cleanup; remove_cmd = NULL; if (net->vlan >= 0 && net->hostnet_name && virAsprintf(&remove_cmd, "host_net_remove %d %s", net->vlan, net->hostnet_name) < 0) { - VIR_FREE(cmd); - virReportOOMError(conn); - return -1; + goto no_memory; } if (qemudMonitorCommand(vm, cmd, &reply) < 0) { qemudReportError(conn, dom, NULL, VIR_ERR_OPERATION_FAILED, _("failed to add network backend with '%s'"), cmd); - VIR_FREE(remove_cmd); - VIR_FREE(cmd); - return -1; + goto cleanup; } DEBUG("%s: host_net_add reply: %s", vm->def->name, reply); @@ -4619,19 +4611,16 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, if (qemudMonitorCommand(vm, cmd, &reply) < 0) { qemudReportError(conn, dom, NULL, VIR_ERR_OPERATION_FAILED, _("failed to add NIC with '%s'"), cmd); - VIR_FREE(cmd); goto try_remove; } - VIR_FREE(cmd); - if (qemudParsePciAddReply(vm, reply, &domain, &bus, &slot) < 0) { qemudReportError(conn, dom, NULL, VIR_ERR_OPERATION_FAILED, _("parsing pci_add reply failed: %s"), reply); - VIR_FREE(reply); goto try_remove; } + VIR_FREE(cmd); VIR_FREE(reply); VIR_FREE(remove_cmd); @@ -4644,7 +4633,7 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, return 0; try_remove: - reply = NULL; + VIR_FREE(reply); if (!remove_cmd) VIR_WARN0(_("Unable to remove network backend\n")); @@ -4652,10 +4641,14 @@ try_remove: VIR_WARN(_("Failed to remove network backend with '%s'\n"), remove_cmd); else VIR_DEBUG("%s: host_net_remove reply: %s\n", vm->def->name, reply); + goto cleanup; +no_memory: + virReportOOMError(conn); +cleanup: + VIR_FREE(cmd); VIR_FREE(reply); VIR_FREE(remove_cmd); - return -1; } -- 1.6.2.5

On Wed, Jul 22, 2009 at 10:57:32PM +0100, Mark McLoughlin wrote:
In subsequent patches we're going to have a file descriptor to close too, so centralize the error handling cleanups to make things easier.
* src/qemu_conf.c: in qemudDomainAttachNetDevice() consolidate the error handling cleanups together
Okay, having a more standard error handling mechanism with centralized exits looks better, ACK Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Add a little helper function to write the monitor command followed by carriage return in a single write. This doesn't make any real difference, but allows us to more easily switch to using sendmsg() when using the monitor over a unix socket. * src/qemu_conf.c: split qemudMonitorSend() out --- src/qemu_driver.c | 31 +++++++++++++++++++++++-------- 1 files changed, 23 insertions(+), 8 deletions(-) diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 4f0c60e..7d4de9e 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -2012,6 +2012,27 @@ qemuMonitorDiscardPendingData(virDomainObjPtr vm) { } while (ret > 0); } +static int +qemudMonitorSend(const virDomainObjPtr vm, + const char *cmd) +{ + char *full; + size_t len; + int ret = -1; + + if (virAsprintf(&full, "%s\r", cmd) < 0) + return -1; + + len = strlen(full); + + if (safewrite(vm->monitor, full, len) != len) + goto out; + + ret = 0; +out: + VIR_FREE(full); + return ret; +} static int qemudMonitorCommandExtra(const virDomainObjPtr vm, @@ -2021,14 +2042,10 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, char **reply) { int size = 0; char *buf = NULL; - size_t cmdlen = strlen(cmd); - size_t extralen = extra ? strlen(extra) : 0; qemuMonitorDiscardPendingData(vm); - if (safewrite(vm->monitor, cmd, cmdlen) != cmdlen) - return -1; - if (safewrite(vm->monitor, "\r", 1) != 1) + if (qemudMonitorSend(vm, cmd) < 0) return -1; *reply = NULL; @@ -2063,9 +2080,7 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, if (buf) { if (extra) { if (strstr(buf, extraPrompt) != NULL) { - if (safewrite(vm->monitor, extra, extralen) != extralen) - return -1; - if (safewrite(vm->monitor, "\r", 1) != 1) + if (qemudMonitorSend(vm, extra) < 0) return -1; extra = NULL; } -- 1.6.2.5

On Wed, Jul 22, 2009 at 10:57:33PM +0100, Mark McLoughlin wrote:
Add a little helper function to write the monitor command followed by carriage return in a single write.
This doesn't make any real difference, but allows us to more easily switch to using sendmsg() when using the monitor over a unix socket. [...] +static int +qemudMonitorSend(const virDomainObjPtr vm, + const char *cmd) +{ + char *full; + size_t len; + int ret = -1; + + if (virAsprintf(&full, "%s\r", cmd) < 0) + return -1;
Hum, we introduce an allocation here, and if it fails we really should call the OOM errror centralized routine Once done looks fine to me, ACK, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Mon, 2009-07-27 at 10:50 +0200, Daniel Veillard wrote:
On Wed, Jul 22, 2009 at 10:57:33PM +0100, Mark McLoughlin wrote:
Add a little helper function to write the monitor command followed by carriage return in a single write.
This doesn't make any real difference, but allows us to more easily switch to using sendmsg() when using the monitor over a unix socket. [...] +static int +qemudMonitorSend(const virDomainObjPtr vm, + const char *cmd) +{ + char *full; + size_t len; + int ret = -1; + + if (virAsprintf(&full, "%s\r", cmd) < 0) + return -1;
Hum, we introduce an allocation here, and if it fails we really should call the OOM errror centralized routine
The other error cases in this function don't set an error and every call to qemudMonitorCommand() sets an error if it fails. Cheers, Mark.

On Wed, Jul 22, 2009 at 10:57:33PM +0100, Mark McLoughlin wrote:
Add a little helper function to write the monitor command followed by carriage return in a single write.
This doesn't make any real difference, but allows us to more easily switch to using sendmsg() when using the monitor over a unix socket.
* src/qemu_conf.c: split qemudMonitorSend() out --- src/qemu_driver.c | 31 +++++++++++++++++++++++-------- 1 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 4f0c60e..7d4de9e 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -2012,6 +2012,27 @@ qemuMonitorDiscardPendingData(virDomainObjPtr vm) { } while (ret > 0); }
+static int +qemudMonitorSend(const virDomainObjPtr vm, + const char *cmd) +{ + char *full; + size_t len; + int ret = -1; + + if (virAsprintf(&full, "%s\r", cmd) < 0) + return -1; + + len = strlen(full); + + if (safewrite(vm->monitor, full, len) != len) + goto out; + + ret = 0; +out: + VIR_FREE(full); + return ret; +}
It would seem easier to just keep the original model of two safewrite() calls, than to introduce an extra allocation & free here. Not a mustfix for comitting, but it'd be nicer.
static int qemudMonitorCommandExtra(const virDomainObjPtr vm, @@ -2021,14 +2042,10 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, char **reply) { int size = 0; char *buf = NULL; - size_t cmdlen = strlen(cmd); - size_t extralen = extra ? strlen(extra) : 0;
qemuMonitorDiscardPendingData(vm);
- if (safewrite(vm->monitor, cmd, cmdlen) != cmdlen) - return -1; - if (safewrite(vm->monitor, "\r", 1) != 1) + if (qemudMonitorSend(vm, cmd) < 0) return -1;
*reply = NULL; @@ -2063,9 +2080,7 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, if (buf) { if (extra) { if (strstr(buf, extraPrompt) != NULL) { - if (safewrite(vm->monitor, extra, extralen) != extralen) - return -1; - if (safewrite(vm->monitor, "\r", 1) != 1) + if (qemudMonitorSend(vm, extra) < 0) return -1; extra = NULL; }
ACK, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Switch from using write() to using sendmsg() on QEMU's monitor socket so that we can add support for SCM_RIGHTS. * src/qemu_driver.c: add sendmsg() based qemudMonitorSendUnix() and use it when the monitor fd is a unix socket --- src/qemu_driver.c | 37 +++++++++++++++++++++++++++++++++++-- 1 files changed, 35 insertions(+), 2 deletions(-) diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 7d4de9e..e6e6786 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -2013,6 +2013,30 @@ qemuMonitorDiscardPendingData(virDomainObjPtr vm) { } static int +qemudMonitorSendUnix(const virDomainObjPtr vm, + const char *cmd, + size_t cmdlen) +{ + struct msghdr msg; + struct iovec iov[1]; + ssize_t ret; + + memset(&msg, 0, sizeof(msg)); + + iov[0].iov_base = (void *)cmd; + iov[0].iov_len = cmdlen; + + msg.msg_iov = iov; + msg.msg_iovlen = 1; + + do { + ret = sendmsg(vm->monitor, &msg, 0); + } while (ret < 0 && errno == EINTR); + + return ret == cmdlen ? 0 : -1; +} + +static int qemudMonitorSend(const virDomainObjPtr vm, const char *cmd) { @@ -2025,8 +2049,17 @@ qemudMonitorSend(const virDomainObjPtr vm, len = strlen(full); - if (safewrite(vm->monitor, full, len) != len) - goto out; + switch (vm->monitor_chr->type) { + case VIR_DOMAIN_CHR_TYPE_UNIX: + if (qemudMonitorSendUnix(vm, full, len) < 0) + goto out; + break; + default: + case VIR_DOMAIN_CHR_TYPE_PTY: + if (safewrite(vm->monitor, full, len) != len) + goto out; + break; + } ret = 0; out: -- 1.6.2.5

On Wed, Jul 22, 2009 at 10:57:34PM +0100, Mark McLoughlin wrote:
Switch from using write() to using sendmsg() on QEMU's monitor socket so that we can add support for SCM_RIGHTS.
* src/qemu_driver.c: add sendmsg() based qemudMonitorSendUnix() and use it when the monitor fd is a unix socket
Okay this makes sense only once looking at the next patch where the extra information can only be passed as part of a struct iovec ACK, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, Jul 22, 2009 at 10:57:34PM +0100, Mark McLoughlin wrote:
Switch from using write() to using sendmsg() on QEMU's monitor socket so that we can add support for SCM_RIGHTS.
* src/qemu_driver.c: add sendmsg() based qemudMonitorSendUnix() and use it when the monitor fd is a unix socket --- src/qemu_driver.c | 37 +++++++++++++++++++++++++++++++++++-- 1 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 7d4de9e..e6e6786 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -2013,6 +2013,30 @@ qemuMonitorDiscardPendingData(virDomainObjPtr vm) { }
static int +qemudMonitorSendUnix(const virDomainObjPtr vm, + const char *cmd, + size_t cmdlen) +{ + struct msghdr msg; + struct iovec iov[1]; + ssize_t ret; + + memset(&msg, 0, sizeof(msg)); + + iov[0].iov_base = (void *)cmd; + iov[0].iov_len = cmdlen; + + msg.msg_iov = iov; + msg.msg_iovlen = 1; + + do { + ret = sendmsg(vm->monitor, &msg, 0); + } while (ret < 0 && errno == EINTR); + + return ret == cmdlen ? 0 : -1; +} + +static int qemudMonitorSend(const virDomainObjPtr vm, const char *cmd) { @@ -2025,8 +2049,17 @@ qemudMonitorSend(const virDomainObjPtr vm,
len = strlen(full);
- if (safewrite(vm->monitor, full, len) != len) - goto out; + switch (vm->monitor_chr->type) { + case VIR_DOMAIN_CHR_TYPE_UNIX: + if (qemudMonitorSendUnix(vm, full, len) < 0) + goto out; + break; + default: + case VIR_DOMAIN_CHR_TYPE_PTY: + if (safewrite(vm->monitor, full, len) != len) + goto out; + break; + }
Oooooh, of course this is why you added the allocation instead of doing two safewrite() calls. In which case, ACK to this & previous patch Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Add qemudMonitorCommandWithFd() which allows a file descriptor to be sent to qemu over a unix monitor socket using SCM_RIGHTS. See the unix(7) and cmsg(3) man pages. * src/qemu_conf.c: add a scm_fd param to qemudMonitorCommandExtra(), add qemudMonitorCommandWithFd(), implement SCM_RIGHTS support in qemudMonitorSendUnix() --- src/qemu_driver.c | 46 +++++++++++++++++++++++++++++++++++++--------- 1 files changed, 37 insertions(+), 9 deletions(-) diff --git a/src/qemu_driver.c b/src/qemu_driver.c index e6e6786..a7861be 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -111,10 +111,15 @@ static int qemudDomainGetMaxVcpus(virDomainPtr dom); static int qemudMonitorCommand(const virDomainObjPtr vm, const char *cmd, char **reply); +static int qemudMonitorCommandWithFd(const virDomainObjPtr vm, + const char *cmd, + int scm_fd, + char **reply); static int qemudMonitorCommandExtra(const virDomainObjPtr vm, const char *cmd, const char *extra, const char *extraPrompt, + int scm_fd, char **reply); static int qemudDomainSetMemoryBalloon(virConnectPtr conn, virDomainObjPtr vm, @@ -1256,7 +1261,7 @@ qemudInitPasswords(virConnectPtr conn, vm->def->graphics[0]->data.vnc.passwd : driver->vncPassword, QEMU_PASSWD_PROMPT, - &info) < 0) { + -1, &info) < 0) { qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR, "%s", _("setting VNC password failed")); return -1; @@ -2015,7 +2020,8 @@ qemuMonitorDiscardPendingData(virDomainObjPtr vm) { static int qemudMonitorSendUnix(const virDomainObjPtr vm, const char *cmd, - size_t cmdlen) + size_t cmdlen, + int scm_fd) { struct msghdr msg; struct iovec iov[1]; @@ -2029,6 +2035,20 @@ qemudMonitorSendUnix(const virDomainObjPtr vm, msg.msg_iov = iov; msg.msg_iovlen = 1; + if (scm_fd != -1) { + char control[CMSG_SPACE(sizeof(int))]; + struct cmsghdr *cmsg; + + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_len = CMSG_LEN(sizeof(int)); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + memcpy(CMSG_DATA(cmsg), &scm_fd, sizeof(int)); + } + do { ret = sendmsg(vm->monitor, &msg, 0); } while (ret < 0 && errno == EINTR); @@ -2038,7 +2058,8 @@ qemudMonitorSendUnix(const virDomainObjPtr vm, static int qemudMonitorSend(const virDomainObjPtr vm, - const char *cmd) + const char *cmd, + int scm_fd) { char *full; size_t len; @@ -2051,7 +2072,7 @@ qemudMonitorSend(const virDomainObjPtr vm, switch (vm->monitor_chr->type) { case VIR_DOMAIN_CHR_TYPE_UNIX: - if (qemudMonitorSendUnix(vm, full, len) < 0) + if (qemudMonitorSendUnix(vm, full, len, scm_fd) < 0) goto out; break; default: @@ -2072,13 +2093,14 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, const char *cmd, const char *extra, const char *extraPrompt, + int scm_fd, char **reply) { int size = 0; char *buf = NULL; qemuMonitorDiscardPendingData(vm); - if (qemudMonitorSend(vm, cmd) < 0) + if (qemudMonitorSend(vm, cmd, scm_fd) < 0) return -1; *reply = NULL; @@ -2113,7 +2135,7 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, if (buf) { if (extra) { if (strstr(buf, extraPrompt) != NULL) { - if (qemudMonitorSend(vm, extra) < 0) + if (qemudMonitorSend(vm, extra, -1) < 0) return -1; extra = NULL; } @@ -2154,14 +2176,20 @@ qemudMonitorCommandExtra(const virDomainObjPtr vm, } static int +qemudMonitorCommandWithFd(const virDomainObjPtr vm, + const char *cmd, + int scm_fd, + char **reply) { + return qemudMonitorCommandExtra(vm, cmd, NULL, NULL, scm_fd, reply); +} + +static int qemudMonitorCommand(const virDomainObjPtr vm, const char *cmd, char **reply) { - return qemudMonitorCommandExtra(vm, cmd, NULL, NULL, reply); + return qemudMonitorCommandWithFd(vm, cmd, -1, reply); } - - static virDrvOpenStatus qemudOpen(virConnectPtr conn, virConnectAuthPtr auth ATTRIBUTE_UNUSED, int flags ATTRIBUTE_UNUSED) { -- 1.6.2.5

On Wed, Jul 22, 2009 at 10:57:35PM +0100, Mark McLoughlin wrote:
Add qemudMonitorCommandWithFd() which allows a file descriptor to be sent to qemu over a unix monitor socket using SCM_RIGHTS. See the unix(7) and cmsg(3) man pages.
* src/qemu_conf.c: add a scm_fd param to qemudMonitorCommandExtra(), add qemudMonitorCommandWithFd(), implement SCM_RIGHTS support in qemudMonitorSendUnix()
ACK, that's the logical next step from previous patch, apparently SCM_RIGHTS has been around forever, so this should not break on older kernel I assume. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Mon, Jul 27, 2009 at 10:59:47AM +0200, Daniel Veillard wrote:
On Wed, Jul 22, 2009 at 10:57:35PM +0100, Mark McLoughlin wrote:
Add qemudMonitorCommandWithFd() which allows a file descriptor to be sent to qemu over a unix monitor socket using SCM_RIGHTS. See the unix(7) and cmsg(3) man pages.
* src/qemu_conf.c: add a scm_fd param to qemudMonitorCommandExtra(), add qemudMonitorCommandWithFd(), implement SCM_RIGHTS support in qemudMonitorSendUnix()
ACK, that's the logical next step from previous patch, apparently SCM_RIGHTS has been around forever, so this should not break on older kernel I assume.
It won't break on any kernel we care about in the slightest ;-P I'm sure other things will break first ! Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

In order to hotplug a network/bridge backed NIC, we need to first create the tap file descriptor, add the tap interface to the bridge and then pass the file descriptor to the qemu process using the 'getfd' monitor command. Once the tapfd has been accepted, we create the network backend using host_net_add, supplying the name assigned to the tapfd. If this fails, we need to close the tapfd in qemu using the 'closefd' monitor command. * src/qemu_driver.c: add support for tapfd based hotplug in qemudDomainAttachNetDevice() --- src/qemu_driver.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 62 insertions(+), 11 deletions(-) diff --git a/src/qemu_driver.c b/src/qemu_driver.c index a7861be..23cbd8f 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -4620,13 +4620,15 @@ static int qemudDomainAttachUsbMassstorageDevice(virConnectPtr conn, } static int qemudDomainAttachNetDevice(virConnectPtr conn, + struct qemud_driver *driver, virDomainObjPtr vm, virDomainDeviceDefPtr dev, unsigned int qemuCmdFlags) { virDomainNetDefPtr net = dev->data.net; char *cmd = NULL, *reply = NULL, *remove_cmd = NULL; - int i; + char *tapfd_name = NULL, *tapfd_close = NULL; + int i, tapfd = -1; unsigned domain, bus, slot; if (!(qemuCmdFlags & QEMUD_CMD_FLAG_HOST_NET_ADD)) { @@ -4637,10 +4639,16 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, if (net->type == VIR_DOMAIN_NET_TYPE_BRIDGE || net->type == VIR_DOMAIN_NET_TYPE_NETWORK) { - qemudReportError(conn, dom, NULL, VIR_ERR_NO_SUPPORT, - _("network device type '%s' cannot be attached"), - virDomainNetTypeToString(net->type)); - return -1; + if (vm->monitor_chr->type != VIR_DOMAIN_CHR_TYPE_UNIX) { + qemudReportError(conn, dom, NULL, VIR_ERR_NO_SUPPORT, + _("network device type '%s' cannot be attached: " + "qemu is not using a unix socket monitor"), + virDomainNetTypeToString(net->type)); + return -1; + } + + if ((tapfd = qemudNetworkIfaceConnect(conn, driver, net, qemuCmdFlags)) < 0) + return -1; } if (VIR_REALLOC_N(vm->def->nets, vm->def->nnets+1) < 0) @@ -4658,27 +4666,55 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, if (vm->def->nets[i]->vlan >= net->vlan) net->vlan = vm->def->nets[i]->vlan; - if (qemuBuildHostNetStr(conn, net, - "host_net_add ", ' ', net->vlan, NULL, &cmd) < 0) - goto cleanup; + if (tapfd != -1) { + if (virAsprintf(&tapfd_name, "fd-%s", net->hostnet_name) < 0) + goto no_memory; + + if (virAsprintf(&tapfd_close, "closefd %s", tapfd_name) < 0) + goto no_memory; + + if (virAsprintf(&cmd, "getfd %s", tapfd_name) < 0) + goto no_memory; + + if (qemudMonitorCommandWithFd(vm, cmd, tapfd, &reply) < 0) { + qemudReportError(conn, dom, NULL, VIR_ERR_OPERATION_FAILED, + _("failed to pass fd to qemu with '%s'"), cmd); + goto cleanup; + } + + DEBUG("%s: getfd reply: %s", vm->def->name, reply); + + VIR_FREE(reply); + VIR_FREE(cmd); + } + + if (qemuBuildHostNetStr(conn, net, "host_net_add ", ' ', + net->vlan, tapfd_name, &cmd) < 0) + goto try_tapfd_close; remove_cmd = NULL; if (net->vlan >= 0 && net->hostnet_name && virAsprintf(&remove_cmd, "host_net_remove %d %s", net->vlan, net->hostnet_name) < 0) { - goto no_memory; + virReportOOMError(conn); + goto try_tapfd_close; } if (qemudMonitorCommand(vm, cmd, &reply) < 0) { qemudReportError(conn, dom, NULL, VIR_ERR_OPERATION_FAILED, _("failed to add network backend with '%s'"), cmd); - goto cleanup; + goto try_tapfd_close; } DEBUG("%s: host_net_add reply: %s", vm->def->name, reply); VIR_FREE(reply); VIR_FREE(cmd); + VIR_FREE(tapfd_name); + VIR_FREE(tapfd_close); + if (tapfd != -1) + close(tapfd); + tapfd = -1; if (qemuBuildNicStr(conn, net, "pci_add pci_addr=auto ", ' ', net->vlan, &cmd) < 0) @@ -4719,12 +4755,27 @@ try_remove: VIR_DEBUG("%s: host_net_remove reply: %s\n", vm->def->name, reply); goto cleanup; +try_tapfd_close: + VIR_FREE(reply); + + if (tapfd_close) { + if (qemudMonitorCommand(vm, tapfd_close, &reply) < 0) + VIR_WARN(_("Failed to close tapfd with '%s'\n"), tapfd_close); + else + VIR_DEBUG("%s: closefd: %s\n", vm->def->name, reply); + } + goto cleanup; + no_memory: virReportOOMError(conn); cleanup: VIR_FREE(cmd); VIR_FREE(reply); VIR_FREE(remove_cmd); + VIR_FREE(tapfd_close); + VIR_FREE(tapfd_name); + if (tapfd != -1) + close(tapfd); return -1; } @@ -4854,7 +4905,7 @@ static int qemudDomainAttachDevice(virDomainPtr dom, goto cleanup; } } else if (dev->type == VIR_DOMAIN_DEVICE_NET) { - ret = qemudDomainAttachNetDevice(dom->conn, vm, dev, qemuCmdFlags); + ret = qemudDomainAttachNetDevice(dom->conn, driver, vm, dev, qemuCmdFlags); } else if (dev->type == VIR_DOMAIN_DEVICE_HOSTDEV && dev->data.hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && dev->data.hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB) { -- 1.6.2.5

On Wed, Jul 22, 2009 at 10:57:36PM +0100, Mark McLoughlin wrote:
In order to hotplug a network/bridge backed NIC, we need to first create the tap file descriptor, add the tap interface to the bridge and then pass the file descriptor to the qemu process using the 'getfd' monitor command.
Once the tapfd has been accepted, we create the network backend using host_net_add, supplying the name assigned to the tapfd. If this fails, we need to close the tapfd in qemu using the 'closefd' monitor command.
* src/qemu_driver.c: add support for tapfd based hotplug in qemudDomainAttachNetDevice() [...] + if (tapfd != -1) { + if (virAsprintf(&tapfd_name, "fd-%s", net->hostnet_name) < 0) + goto no_memory; + + if (virAsprintf(&tapfd_close, "closefd %s", tapfd_name) < 0) + goto no_memory;
[...]
+try_tapfd_close: + VIR_FREE(reply); + + if (tapfd_close) { + if (qemudMonitorCommand(vm, tapfd_close, &reply) < 0) + VIR_WARN(_("Failed to close tapfd with '%s'\n"), tapfd_close); + else + VIR_DEBUG("%s: closefd: %s\n", vm->def->name, reply); + } + goto cleanup;
This makes sense except I didn't expect tapfd_close which seems used only in an error path to be allocated on the main branch (assuming a tapfd). Otherwise looks clean, ACK Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Mon, 2009-07-27 at 11:04 +0200, Daniel Veillard wrote:
On Wed, Jul 22, 2009 at 10:57:36PM +0100, Mark McLoughlin wrote:
In order to hotplug a network/bridge backed NIC, we need to first create the tap file descriptor, add the tap interface to the bridge and then pass the file descriptor to the qemu process using the 'getfd' monitor command.
Once the tapfd has been accepted, we create the network backend using host_net_add, supplying the name assigned to the tapfd. If this fails, we need to close the tapfd in qemu using the 'closefd' monitor command.
* src/qemu_driver.c: add support for tapfd based hotplug in qemudDomainAttachNetDevice() [...] + if (tapfd != -1) { + if (virAsprintf(&tapfd_name, "fd-%s", net->hostnet_name) < 0) + goto no_memory; + + if (virAsprintf(&tapfd_close, "closefd %s", tapfd_name) < 0) + goto no_memory;
[...]
+try_tapfd_close: + VIR_FREE(reply); + + if (tapfd_close) { + if (qemudMonitorCommand(vm, tapfd_close, &reply) < 0) + VIR_WARN(_("Failed to close tapfd with '%s'\n"), tapfd_close); + else + VIR_DEBUG("%s: closefd: %s\n", vm->def->name, reply); + } + goto cleanup;
This makes sense except I didn't expect tapfd_close which seems used only in an error path to be allocated on the main branch (assuming a tapfd).
Right, it's the same idea as the way we pre-allocate the "host_net_remove" monitor command - by allocating upfront, you avoid the potential for OOM failure in the error path. e.g. if the pci_add monitor command fails because of OOM, we're unlikely to be then able to format the closefd monitor command to clean up after ourselves. cheers, Mark.

On Wed, Jul 22, 2009 at 10:57:36PM +0100, Mark McLoughlin wrote:
@@ -4637,10 +4639,16 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn,
if (net->type == VIR_DOMAIN_NET_TYPE_BRIDGE || net->type == VIR_DOMAIN_NET_TYPE_NETWORK) { - qemudReportError(conn, dom, NULL, VIR_ERR_NO_SUPPORT, - _("network device type '%s' cannot be attached"), - virDomainNetTypeToString(net->type)); - return -1; + if (vm->monitor_chr->type != VIR_DOMAIN_CHR_TYPE_UNIX) { + qemudReportError(conn, dom, NULL, VIR_ERR_NO_SUPPORT, + _("network device type '%s' cannot be attached: " + "qemu is not using a unix socket monitor"), + virDomainNetTypeToString(net->type)); + return -1; + } + + if ((tapfd = qemudNetworkIfaceConnect(conn, driver, net, qemuCmdFlags)) < 0) + return -1; }
Obviously this code is supporting upgrades frm 0.6.5 -> 0.7.0 where libvirtd restarts with VMs already running using a PTY. We should decide on how many versions we wish to support upgrades with live VMs for. ie, delete all this PTY compatability code in something like the 0.8.0 release ? It is also probably worth formalizing how long we wish to support a particular hypervisor release series for. ie how long do we wish to support Xen 3.0.3 for (for benefit of RHEL-5) before we declare only Xen 3.1.0 or later & thus drop all the crufty xm_internal.c code. As a starting point I'd say we should support a particular HV release for as long as it is shipped in the current release of a major distro. ie we'd support Xen 3.0.3 for as long as RHEL-5 was most recent, and once RHEL-6 came out we'd drop Xen 3.0.3 support, picking the next oldest Xen version that's used in a current major distro whether that's 3.1.0 or 3.3.0 or something else. This would allow distro maintainers reasonably easy backports and/or rebases to newer libvirt versions for some amount of time, without burdening us (upstream) to support ancient versions of Xen / QEMU forever. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, Jul 27, 2009 at 10:18:18PM +0100, Daniel P. Berrange wrote:
It is also probably worth formalizing how long we wish to support a particular hypervisor release series for.
ie how long do we wish to support Xen 3.0.3 for (for benefit of RHEL-5) before we declare only Xen 3.1.0 or later & thus drop all the crufty xm_internal.c code.
As a starting point I'd say we should support a particular HV release for as long as it is shipped in the current release of a major distro. ie we'd support Xen 3.0.3 for as long as RHEL-5 was most recent, and once RHEL-6 came out we'd drop Xen 3.0.3 support, picking the next oldest Xen version that's used in a current major distro whether that's 3.1.0 or 3.3.0 or something else. This would allow distro maintainers reasonably easy backports and/or rebases to newer libvirt versions for some amount of time, without burdening us (upstream) to support ancient versions of Xen / QEMU forever.
This thread is probably not the best place to discuss that policy, and should be started separately. I tend to think that it depends how intrusive the specific bits are to manage an old hypervisor version. If it doesn't hurt, except for the size IMHO there is no reason to drop support, it's only if the associated code makes maintainance harder than necessary that it should be dropped. For example in xm_internal.c case as long as it's cleanly separated and doesn't make patches for new versions more complex, I would keep it. Another example is old Xen hypervisor code, we have that big detection code and switch but beside that it's not making other things any more complex. The goal of libvirt being really to decouple from low level changes and simplify long term maintainance, I would be very conservative and drop ancien versions support only if it really gets painful and we are pretty sure there is no real users left. I consider the project users the final users, not the distro maintainers ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, Jul 22, 2009 at 10:57:29PM +0100, Mark McLoughlin wrote:
Hey, So, here are the patches needed to complete the QEMU NIC hotplug work by supporting network and bridge backed interfaces.
We support this by passing a tap file descriptor to qemu over the monitor socket using SCM_RIGHTS.
The monitor commands required for this (getfd and closefd) are queued up in Anthony's queue and should be in qemu-0.11-rc1 which is due to be released very soon:
http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=4edd6c2645 http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=f4069061e3
Note: I just remembered I'm not handling missing support for these commands very well. I'll make it handle an 'unknown command' reply gracefully by failing the attach tommorrow.
Yes that's my main uncertainty about this patch set, we assume the support is present and I don't see any detection code, can you send this quickly to finish the serie of patches ? Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Mon, 2009-07-27 at 11:06 +0200, Daniel Veillard wrote:
On Wed, Jul 22, 2009 at 10:57:29PM +0100, Mark McLoughlin wrote:
Hey, So, here are the patches needed to complete the QEMU NIC hotplug work by supporting network and bridge backed interfaces.
We support this by passing a tap file descriptor to qemu over the monitor socket using SCM_RIGHTS.
The monitor commands required for this (getfd and closefd) are queued up in Anthony's queue and should be in qemu-0.11-rc1 which is due to be released very soon:
http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=4edd6c2645 http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=f4069061e3
Note: I just remembered I'm not handling missing support for these commands very well. I'll make it handle an 'unknown command' reply gracefully by failing the attach tommorrow.
Yes that's my main uncertainty about this patch set, we assume the support is present and I don't see any detection code, can you send this quickly to finish the serie of patches ?
Yep, the below hunk is all we need - I'll squash it into 7/7 Thanks, Mark. diff --git a/src/qemu_driver.c b/src/qemu_driver.c index 23cbd8f..d65a2a3 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -4684,6 +4684,16 @@ static int qemudDomainAttachNetDevice(virConnectPtr conn, DEBUG("%s: getfd reply: %s", vm->def->name, reply); + /* If the command isn't supported then qemu prints: + * unknown command: getfd" */ + if (strstr(reply, "unknown command:")) { + qemudReportError(conn, dom, NULL, VIR_ERR_NO_SUPPORT, + "%s", + _("bridge/network interface attach not supported: " + "qemu 'getfd' monitor command available")); + goto cleanup; + } + VIR_FREE(reply); VIR_FREE(cmd); }

On Mon, Jul 27, 2009 at 10:32:42AM +0100, Mark McLoughlin wrote:
On Mon, 2009-07-27 at 11:06 +0200, Daniel Veillard wrote:
On Wed, Jul 22, 2009 at 10:57:29PM +0100, Mark McLoughlin wrote:
Hey, So, here are the patches needed to complete the QEMU NIC hotplug work by supporting network and bridge backed interfaces.
We support this by passing a tap file descriptor to qemu over the monitor socket using SCM_RIGHTS.
The monitor commands required for this (getfd and closefd) are queued up in Anthony's queue and should be in qemu-0.11-rc1 which is due to be released very soon:
http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=4edd6c2645 http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=f4069061e3
Note: I just remembered I'm not handling missing support for these commands very well. I'll make it handle an 'unknown command' reply gracefully by failing the attach tommorrow.
Yes that's my main uncertainty about this patch set, we assume the support is present and I don't see any detection code, can you send this quickly to finish the serie of patches ?
Yep, the below hunk is all we need - I'll squash it into 7/7
Okay, thanks ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, Jul 22, 2009 at 10:57:29PM +0100, Mark McLoughlin wrote:
Hey, So, here are the patches needed to complete the QEMU NIC hotplug work by supporting network and bridge backed interfaces.
We support this by passing a tap file descriptor to qemu over the monitor socket using SCM_RIGHTS.
The monitor commands required for this (getfd and closefd) are queued up in Anthony's queue and should be in qemu-0.11-rc1 which is due to be released very soon:
http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=4edd6c2645 http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=f4069061e3
Note: I just remembered I'm not handling missing support for these commands very well. I'll make it handle an 'unknown command' reply gracefully by failing the attach tommorrow.
ACK to this patch series since its almost certainly going into QEMU upstream any minute. It doesn't have any public API impact, so worst case we later change / remove it. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, 2009-07-27 at 22:19 +0100, Daniel P. Berrange wrote:
On Wed, Jul 22, 2009 at 10:57:29PM +0100, Mark McLoughlin wrote:
Hey, So, here are the patches needed to complete the QEMU NIC hotplug work by supporting network and bridge backed interfaces.
We support this by passing a tap file descriptor to qemu over the monitor socket using SCM_RIGHTS.
The monitor commands required for this (getfd and closefd) are queued up in Anthony's queue and should be in qemu-0.11-rc1 which is due to be released very soon:
http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=4edd6c2645 http://repo.or.cz/w/qemu/aliguori-queue.git?a=commitdiff;h=f4069061e3
Note: I just remembered I'm not handling missing support for these commands very well. I'll make it handle an 'unknown command' reply gracefully by failing the attach tommorrow.
ACK to this patch series since its almost certainly going into QEMU upstream any minute. It doesn't have any public API impact, so worst case we later change / remove it.
Thanks, it's upstream now: http://git.savannah.gnu.org/cgit/qemu.git/commit/?id=c1d6eed7e8 as is the machine type aliases: http://git.savannah.gnu.org/cgit/qemu.git/commit/?id=9574758134 This is on the 0.12 branch, but I've asked for them to be pulled into the stable-0.11 branch ... we'll see. Cheers, Mark.
participants (6)
-
Chris Wright
-
Daniel P. Berrange
-
Daniel Veillard
-
john cooper
-
Mark McLoughlin
-
Stephen Smalley