[libvirt] [PATCH] docs: formatdomain: unify naming for CPUs/vCPUs
by Katerina Koukiou
CPU is an acronym and should be written in uppercase
when part of plain text and not refering to an element.
Signed-off-by: Katerina Koukiou <kkoukiou(a)redhat.com>
---
As asked in the review here
https://www.redhat.com/archives/libvir-list/2018-July/msg01093.html
docs/formatdomain.html.in | 84 +++++++++++++++++++--------------------
1 file changed, 42 insertions(+), 42 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
index b00971a945..d08ede9ab5 100644
--- a/docs/formatdomain.html.in
+++ b/docs/formatdomain.html.in
@@ -631,45 +631,45 @@
</dd>
<dt><code>vcpus</code></dt>
<dd>
- The vcpus element allows to control state of individual vcpus.
+ The vcpus element allows to control state of individual vCPUs.
The <code>id</code> attribute specifies the vCPU id as used by libvirt
- in other places such as vcpu pinning, scheduler information and NUMA
- assignment. Note that the vcpu ID as seen in the guest may differ from
- libvirt ID in certain cases. Valid IDs are from 0 to the maximum vcpu
+ in other places such as vCPU pinning, scheduler information and NUMA
+ assignment. Note that the vCPU ID as seen in the guest may differ from
+ libvirt ID in certain cases. Valid IDs are from 0 to the maximum vCPU
count as set by the <code>vcpu</code> element minus 1.
The <code>enabled</code> attribute allows to control the state of the
- vcpu. Valid values are <code>yes</code> and <code>no</code>.
+ vCPU. Valid values are <code>yes</code> and <code>no</code>.
- <code>hotpluggable</code> controls whether given vcpu can be hotplugged
- and hotunplugged in cases when the cpu is enabled at boot. Note that
- all disabled vcpus must be hotpluggable. Valid values are
+ <code>hotpluggable</code> controls whether given vCPU can be hotplugged
+ and hotunplugged in cases when the CPU is enabled at boot. Note that
+ all disabled vCPUs must be hotpluggable. Valid values are
<code>yes</code> and <code>no</code>.
- <code>order</code> allows to specify the order to add the online vcpus.
- For hypervisors/platforms that require to insert multiple vcpus at once
- the order may be duplicated across all vcpus that need to be
- enabled at once. Specifying order is not necessary, vcpus are then
+ <code>order</code> allows to specify the order to add the online vCPUs.
+ For hypervisors/platforms that require to insert multiple vCPUs at once
+ the order may be duplicated across all vCPUs that need to be
+ enabled at once. Specifying order is not necessary, vCPUs are then
added in an arbitrary order. If order info is used, it must be used for
- all online vcpus. Hypervisors may clear or update ordering information
+ all online vCPUs. Hypervisors may clear or update ordering information
during certain operations to assure valid configuration.
- Note that hypervisors may create hotpluggable vcpus differently from
- boot vcpus thus special initialization may be necessary.
+ Note that hypervisors may create hotpluggable vCPUs differently from
+ boot vCPUs thus special initialization may be necessary.
- Hypervisors may require that vcpus enabled on boot which are not
+ Hypervisors may require that vCPUs enabled on boot which are not
hotpluggable are clustered at the beginning starting with ID 0. It may
- be also required that vcpu 0 is always present and non-hotpluggable.
+ be also required that vCPU 0 is always present and non-hotpluggable.
- Note that providing state for individual cpus may be necessary to enable
+ Note that providing state for individual CPUs may be necessary to enable
support of addressable vCPU hotplug and this feature may not be
supported by all hypervisors.
- For QEMU the following conditions are required. Vcpu 0 needs to be
- enabled and non-hotpluggable. On PPC64 along with it vcpus that are in
- the same core need to be enabled as well. All non-hotpluggable cpus
- present at boot need to be grouped after vcpu 0.
+ For QEMU the following conditions are required. vCPU 0 needs to be
+ enabled and non-hotpluggable. On PPC64 along with it vCPUs that are in
+ the same core need to be enabled as well. All non-hotpluggable CPUs
+ present at boot need to be grouped after vCPU 0.
<span class="since">Since 2.2.0 (QEMU only)</span>
</dd>
</dl>
@@ -774,11 +774,11 @@
<dt><code>vcpupin</code></dt>
<dd>
The optional <code>vcpupin</code> element specifies which of host's
- physical CPUs the domain VCPU will be pinned to. If this is omitted,
+ physical CPUs the domain vCPU will be pinned to. If this is omitted,
and attribute <code>cpuset</code> of element <code>vcpu</code> is
not specified, the vCPU is pinned to all the physical CPUs by default.
It contains two required attributes, the attribute <code>vcpu</code>
- specifies vcpu id, and the attribute <code>cpuset</code> is same as
+ specifies vCPU id, and the attribute <code>cpuset</code> is same as
attribute <code>cpuset</code> of element <code>vcpu</code>.
(NB: Only qemu driver support)
<span class="since">Since 0.9.0</span>
@@ -786,7 +786,7 @@
<dt><code>emulatorpin</code></dt>
<dd>
The optional <code>emulatorpin</code> element specifies which of host
- physical CPUs the "emulator", a subset of a domain not including vcpu
+ physical CPUs the "emulator", a subset of a domain not including vCPU
or iothreads will be pinned to. If this is omitted, and attribute
<code>cpuset</code> of element <code>vcpu</code> is not specified,
"emulator" is pinned to all the physical CPUs by default. It contains
@@ -820,7 +820,7 @@
<dt><code>period</code></dt>
<dd>
The optional <code>period</code> element specifies the enforcement
- interval(unit: microseconds). Within <code>period</code>, each vcpu of
+ interval(unit: microseconds). Within <code>period</code>, each vCPU of
the domain will not be allowed to consume more than <code>quota</code>
worth of runtime. The value should be in range [1000, 1000000]. A period
with value 0 means no value.
@@ -835,7 +835,7 @@
vCPU threads, which means that it is not bandwidth controlled. The value
should be in range [1000, 18446744073709551] or less than 0. A quota
with value 0 means no value. You can use this feature to ensure that all
- vcpus run at the same speed.
+ vCPUs run at the same speed.
<span class="since">Only QEMU driver support since 0.9.4, LXC since
0.9.10</span>
</dd>
@@ -864,7 +864,7 @@
<dd>
The optional <code>emulator_period</code> element specifies the enforcement
interval(unit: microseconds). Within <code>emulator_period</code>, emulator
- threads(those excluding vcpus) of the domain will not be allowed to consume
+ threads(those excluding vCPUs) of the domain will not be allowed to consume
more than <code>emulator_quota</code> worth of runtime. The value should be
in range [1000, 1000000]. A period with value 0 means no value.
<span class="since">Only QEMU driver support since 0.10.0</span>
@@ -873,9 +873,9 @@
<dd>
The optional <code>emulator_quota</code> element specifies the maximum
allowed bandwidth(unit: microseconds) for domain's emulator threads(those
- excluding vcpus). A domain with <code>emulator_quota</code> as any negative
+ excluding vCPUs). A domain with <code>emulator_quota</code> as any negative
value indicates that the domain has infinite bandwidth for emulator threads
- (those excluding vcpus), which means that it is not bandwidth controlled.
+ (those excluding vCPUs), which means that it is not bandwidth controlled.
The value should be in range [1000, 18446744073709551] or less than 0. A
quota with value 0 means no value.
<span class="since">Only QEMU driver support since 0.10.0</span>
@@ -2131,13 +2131,13 @@
QEMU, the user-configurable extended TSEG feature was unavailable up
to and including <code>pc-q35-2.9</code>. Starting with
<code>pc-q35-2.10</code> the feature is available, with default size
- 16 MiB. That should suffice for up to roughly 272 VCPUs, 5 GiB guest
+ 16 MiB. That should suffice for up to roughly 272 vCPUs, 5 GiB guest
RAM in total, no hotplug memory range, and 32 GiB of 64-bit PCI MMIO
- aperture. Or for 48 VCPUs, with 1TB of guest RAM, no hotplug DIMM
+ aperture. Or for 48 vCPUs, with 1TB of guest RAM, no hotplug DIMM
range, and 32GB of 64-bit PCI MMIO aperture. The values may also vary
based on the loader the VM is using.
</p><p>
- Additional size might be needed for significantly higher VCPU counts
+ Additional size might be needed for significantly higher vCPU counts
or increased address space (that can be memory, maxMemory, 64-bit PCI
MMIO aperture size; roughly 8 MiB of TSEG per 1 TiB of address space)
which can also be rounded up.
@@ -2147,7 +2147,7 @@
documentation of the guest OS or loader (if there is any), or test
this by trial-and-error changing the value until the VM boots
successfully. Yet another guiding value for users might be the fact
- that 48 MiB should be enough for pretty large guests (240 VCPUs and
+ that 48 MiB should be enough for pretty large guests (240 vCPUs and
4TB guest RAM), but it is on purpose not set as default as 48 MiB of
unavailable RAM might be too much for small guests (e.g. with 512 MiB
of RAM).
@@ -2425,7 +2425,7 @@
</tr>
<tr>
<td><code>cpu_cycles</code></td>
- <td>the count of cpu cycles (total/elapsed)</td>
+ <td>the count of CPU cycles (total/elapsed)</td>
<td><code>perf.cpu_cycles</code></td>
</tr>
<tr>
@@ -2460,25 +2460,25 @@
</tr>
<tr>
<td><code>stalled_cycles_frontend</code></td>
- <td>the count of stalled cpu cycles in the frontend of the instruction
+ <td>the count of stalled CPU cycles in the frontend of the instruction
processor pipeline by applications running on the platform</td>
<td><code>perf.stalled_cycles_frontend</code></td>
</tr>
<tr>
<td><code>stalled_cycles_backend</code></td>
- <td>the count of stalled cpu cycles in the backend of the instruction
+ <td>the count of stalled CPU cycles in the backend of the instruction
processor pipeline by applications running on the platform</td>
<td><code>perf.stalled_cycles_backend</code></td>
</tr>
<tr>
<td><code>ref_cpu_cycles</code></td>
- <td>the count of total cpu cycles not affected by CPU frequency scaling
+ <td>the count of total CPU cycles not affected by CPU frequency scaling
by applications running on the platform</td>
<td><code>perf.ref_cpu_cycles</code></td>
</tr>
<tr>
<td><code>cpu_clock</code></td>
- <td>the count of cpu clock time, as measured by a monotonic
+ <td>the count of CPU clock time, as measured by a monotonic
high-resolution per-CPU timer, by applications running on
the platform</td>
<td><code>perf.cpu_clock</code></td>
@@ -2505,7 +2505,7 @@
</tr>
<tr>
<td><code>cpu_migrations</code></td>
- <td>the count of cpu migrations, that is, where the process
+ <td>the count of CPU migrations, that is, where the process
moved from one logical processor to another, by
applications running on the platform</td>
<td><code>perf.cpu_migrations</code></td>
@@ -5621,8 +5621,8 @@ qemu-kvm -net nic,model=? /dev/null
The resulting difference, according to the qemu developer who
added the option is: "bh makes tx more asynchronous and reduces
latency, but potentially causes more processor bandwidth
- contention since the cpu doing the tx isn't necessarily the
- cpu where the guest generated the packets."<br/><br/>
+ contention since the CPU doing the tx isn't necessarily the
+ CPU where the guest generated the packets."<br/><br/>
<b>In general you should leave this option alone, unless you
are very certain you know what you are doing.</b>
--
2.17.1
6 years, 4 months
[libvirt] [dbus PATCH] tests: Explicitly spawn a session libvirt-dbus instance
by Andrea Bolognani
Tests are performed using the session D-Bus instance, so we
should launch libvirt-dbus in session mode as well.
This was working fine when running the tests as a regular
user, because in that case libvirt-dbus would default to
session mode, but fail when running them as root because
libvirt-dbus would run in system mode and consequently not
show up on the session bus.
Of course building and runnning the test suite as root is
a pretty bad idea in general, but a lot of distributions
run at least part of their package build steps with pretend
root privileges (eg. fakeroot), so we have to make sure it
works in that scenario too.
Signed-off-by: Andrea Bolognani <abologna(a)redhat.com>
---
tests/libvirttest.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tests/libvirttest.py b/tests/libvirttest.py
index 3741abd..4653055 100644
--- a/tests/libvirttest.py
+++ b/tests/libvirttest.py
@@ -33,7 +33,7 @@ class BaseTestClass():
"""Start libvirt-dbus for each test function
"""
os.environ['LIBVIRT_DEBUG'] = '3'
- self.libvirt_dbus = subprocess.Popen([exe])
+ self.libvirt_dbus = subprocess.Popen([exe, '--session'])
self.bus = dbus.SessionBus()
for i in range(10):
--
2.17.1
6 years, 4 months
[libvirt] [jenkins-ci PATCH v2 00/12] lcitool: Rewrite in Python (and add Dockefile generator)
by Andrea Bolognani
Read the initial cover letter for background and motivations.
Changes from [v1]:
* add Dockerfile generator;
* rename the 'list' action to 'hosts' to better fit along with
the additional 'projects' action;
* always list items in alphabetical order;
* move some generic functions to an Util class.
[v1] https://www.redhat.com/archives/libvir-list/2018-July/msg00717.html
Andrea Bolognani (12):
lcitool: Drop shell implementation
lcitool: Stub out Python implementation
lcitool: Add tool configuration handling
lcitool: Add inventory handling
lcitool: Implement the 'hosts' action
lcitool: Implement the 'install' action
lcitool: Implement the 'update' action
guests: Update documentation
guests: Add Docker-related information to the inventory
lcitool: Add projects information handling
lcitool: Implement the 'projects' action
lcitool: Implement the 'dockerfile' action
guests/README.markdown | 8 +-
guests/host_vars/libvirt-centos-7/docker.yml | 2 +
guests/host_vars/libvirt-debian-8/docker.yml | 2 +
guests/host_vars/libvirt-debian-9/docker.yml | 2 +
.../host_vars/libvirt-debian-sid/docker.yml | 2 +
guests/host_vars/libvirt-fedora-27/docker.yml | 2 +
guests/host_vars/libvirt-fedora-28/docker.yml | 2 +
.../libvirt-fedora-rawhide/docker.yml | 2 +
guests/host_vars/libvirt-ubuntu-16/docker.yml | 2 +
guests/host_vars/libvirt-ubuntu-18/docker.yml | 2 +
guests/lcitool | 722 ++++++++++++------
11 files changed, 521 insertions(+), 227 deletions(-)
create mode 100644 guests/host_vars/libvirt-centos-7/docker.yml
create mode 100644 guests/host_vars/libvirt-debian-8/docker.yml
create mode 100644 guests/host_vars/libvirt-debian-9/docker.yml
create mode 100644 guests/host_vars/libvirt-debian-sid/docker.yml
create mode 100644 guests/host_vars/libvirt-fedora-27/docker.yml
create mode 100644 guests/host_vars/libvirt-fedora-28/docker.yml
create mode 100644 guests/host_vars/libvirt-fedora-rawhide/docker.yml
create mode 100644 guests/host_vars/libvirt-ubuntu-16/docker.yml
create mode 100644 guests/host_vars/libvirt-ubuntu-18/docker.yml
--
2.17.1
6 years, 4 months
[libvirt] [PATCH 0/2] Some additions/fixes in formatdomain docs
by Katerina Koukiou
Katerina Koukiou (2):
docs: formatdomain: add info about global_period and global_quota for
cputune
docs: formatdomain: clarify period cputune subelement
docs/formatdomain.html.in | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
--
2.17.1
6 years, 4 months
[libvirt] [jenkins-ci PATCH v3 00/12] lcitool: Rewrite in Python (and add Dockefile generator)
by Andrea Bolognani
pylint is reasonably happy with the script now:
$ pylint lcitool
No config file found, using default configuration
************* Module lcitool
C: 1, 0: Missing module docstring (missing-docstring)
C: 37, 0: Missing class docstring (missing-docstring)
C: 47, 4: Missing method docstring (missing-docstring)
W:108,15: Catching too general exception Exception (broad-except)
R:427, 4: Too many branches (14/12) (too-many-branches)
R:289, 0: Too few public methods (1/2) (too-few-public-methods)
-------------------------------------------------------------------
Your code has been rated at 9.21/10 (previous run: 9.21/10, +0.00)
The remaining issues are not considered blockers and will be
addressed, if at all, later down the line.
Changes from [v2]:
* address review comments;
* improve pycodestyle and pylint compliance;
* replace FSF address with an URL.
Changes from [v1]:
* add Dockerfile generator;
* rename the 'list' action to 'hosts' to better fit along with
the additional 'projects' action;
* always list items in alphabetical order;
* move some generic functions to an Util class.
[v2] https://www.redhat.com/archives/libvir-list/2018-July/msg00795.html
[v1] https://www.redhat.com/archives/libvir-list/2018-July/msg00717.html
Andrea Bolognani (12):
lcitool: Drop shell implementation
lcitool: Stub out Python implementation
lcitool: Add tool configuration handling
lcitool: Add inventory handling
lcitool: Implement the 'hosts' action
lcitool: Implement the 'install' action
lcitool: Implement the 'update' action
guests: Update documentation
guests: Add Docker-related information to the inventory
lcitool: Add projects information handling
lcitool: Implement the 'projects' action
lcitool: Implement the 'dockerfile' action
guests/README.markdown | 8 +-
guests/host_vars/libvirt-centos-7/docker.yml | 2 +
guests/host_vars/libvirt-debian-8/docker.yml | 2 +
guests/host_vars/libvirt-debian-9/docker.yml | 2 +
.../host_vars/libvirt-debian-sid/docker.yml | 2 +
guests/host_vars/libvirt-fedora-27/docker.yml | 2 +
guests/host_vars/libvirt-fedora-28/docker.yml | 2 +
.../libvirt-fedora-rawhide/docker.yml | 2 +
guests/host_vars/libvirt-ubuntu-16/docker.yml | 2 +
guests/host_vars/libvirt-ubuntu-18/docker.yml | 2 +
guests/lcitool | 730 ++++++++++++------
11 files changed, 529 insertions(+), 227 deletions(-)
create mode 100644 guests/host_vars/libvirt-centos-7/docker.yml
create mode 100644 guests/host_vars/libvirt-debian-8/docker.yml
create mode 100644 guests/host_vars/libvirt-debian-9/docker.yml
create mode 100644 guests/host_vars/libvirt-debian-sid/docker.yml
create mode 100644 guests/host_vars/libvirt-fedora-27/docker.yml
create mode 100644 guests/host_vars/libvirt-fedora-28/docker.yml
create mode 100644 guests/host_vars/libvirt-fedora-rawhide/docker.yml
create mode 100644 guests/host_vars/libvirt-ubuntu-16/docker.yml
create mode 100644 guests/host_vars/libvirt-ubuntu-18/docker.yml
--
2.17.1
6 years, 4 months
[libvirt] [PATCH] qemu: stop qemu progress when restore failed
by Jie Wang
>From 29482622218f525f0133be0b7db74835174035d9 Mon Sep 17 00:00:00 2001
From: Jie Wang <wangjie88(a)huawei.com>
Date: Thu, 5 Jul 2018 09:52:03 +0800
Subject: [PATCH] qemu: stop qemu progress when restore failed
if qemuProcessStartCPUs perform failed in qemuDomainSaveImageStartVM,
we need to stop qemu progress, otherwise will remains a wild VM
which can't be managed by libvirt.
Signed-off-by: Jie Wang <wangjie88.huawei.com>
---
src/qemu/qemu_driver.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 9a35e04a85..639b57316d 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -6621,6 +6621,8 @@ qemuDomainSaveImageStartVM(virConnectPtr conn,
if (virGetLastErrorCode() == VIR_ERR_OK)
virReportError(VIR_ERR_OPERATION_FAILED,
"%s", _("failed to resume domain"));
+
+ qemuProcessStop(driver, vm, VIR_DOMAIN_SHUTOFF_FAILED, asyncJob, 0);
goto cleanup;
}
if (virDomainSaveStatus(driver->xmlopt, cfg->stateDir, vm, driver->caps) < 0) {
--
2.15.0.windows.1
6 years, 4 months
[libvirt] [PATCH v4 0/6] Support network stats for VF representor interface
by Jai Singh Rana
With availability of switchdev model in linux, it is possible to capture
stats for SR-IOV device with interface_type as 'hostdev' provided device
supports VF represontor in switchdev mode on host.
These stats are supported by adding helper APIs for getting/verifying VF
representor name based on PCI Bus:Device:Function information in domains
'hostdev' structure and querying required net sysfs directory and file
entries on host according to switchdev model. These helper APIs are then
used in qemu/conf to get the interface stats for VF representor of
pci SR-IOV device.
V4 includes changes based on feedback received for v3 patchset. Added new
generic API for linux is added to fetch stats from /proc/net/dev which
will be used by tap and hostdev devices. Also introduced new API to retrieve
net def from given domain based on the given hostdev which supports network.
[1] https://www.kernel.org/doc/Documentation/networking/switchdev.txt
V3: https://www.redhat.com/archives/libvir-list/2018-April/msg00306.html
Jai Singh Rana (6):
util: Add helper function to clean extra spaces in string
util: Add generic API to fetch network stats from procfs
util: Add helper APIs to get/verify VF Representor name
conf: util: Add API to find net def given its domain's hostdev
qemu: Network stats support for VF Representor
docs: Update news about Network stats support for VF Representor
docs/news.xml | 9 ++
po/POTFILES | 1 +
src/conf/domain_conf.c | 43 +++++++
src/conf/domain_conf.h | 2 +
src/libvirt_private.syms | 11 ++
src/qemu/qemu_driver.c | 34 ++++-
src/util/Makefile.inc.am | 2 +
src/util/virhostdev.c | 4 +-
src/util/virhostdev.h | 11 ++
src/util/virnetdev.c | 202 ++++++++++++++++++++++++++++-
src/util/virnetdev.h | 5 +
src/util/virnetdevhostdev.c | 300 ++++++++++++++++++++++++++++++++++++++++++++
src/util/virnetdevhostdev.h | 34 +++++
src/util/virnetdevtap.c | 71 +----------
src/util/virstring.c | 36 ++++++
src/util/virstring.h | 3 +
16 files changed, 691 insertions(+), 77 deletions(-)
create mode 100644 src/util/virnetdevhostdev.c
create mode 100644 src/util/virnetdevhostdev.h
--
2.13.7
6 years, 4 months
[libvirt] [PATCH 0/3] virStr*cpy*() related fixes and cleanups
by Andrea Bolognani
1/3 contains bug fixes, 2/3 a bunch of cleanups and 3/3 makes
libvirt build again on MinGW.
Andrea Bolognani (3):
src: Use virStrcpyStatic() to avoid truncation
src: Use virStrcpyStatic() wherever possible
m4: Work around MinGW detection of strncpy() usage
cfg.mk | 2 +-
m4/virt-compile-warnings.m4 | 5 +++++
src/conf/nwfilter_conf.c | 3 +--
src/esx/esx_driver.c | 4 +---
src/hyperv/hyperv_driver.c | 3 +--
src/util/virfdstream.c | 2 +-
src/util/virlog.c | 5 ++---
src/util/virnetdev.c | 3 +--
src/xenconfig/xen_xl.c | 17 ++++-------------
9 files changed, 17 insertions(+), 27 deletions(-)
--
2.17.1
6 years, 4 months
Re: [libvirt] opening tap devices that are created in a container
by Roman Mohr
On Wed, Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote:
> On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
> >
> >
> >On 07/08/2018 02:01 AM, Martin Kletzander wrote:
> >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
> >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron(a)akamai.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Opening tap devices, such as macvtap, that are created in containers
> is
> >>>> problematic because the interface for opening tap devices is via
> >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> >>>> its not namespace aware. It is possible to do a mknod() in the
> >>>> container, once the tap devices are created, however, since the tap
> >>>> devices are created dynamically its not possible to apriori allow
> access
> >>>> to certain major/minor numbers, since we don't know what these are
> going
> >>>> to be. In addition, its desirable to not allow the mknod capability in
> >>>> containers. This behavior, I think is somewhat inconsistent with the
> >>>> tuntap driver where one can create tuntap devices inside a container
> by
> >>>> first opening /dev/net/tun and then using them by supplying the tuntap
> >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates
> the
> >>>> network namespace, one is limited to opening network devices that
> belong
> >>>> to your current network namespace.
> >>>>
> >>>> Here are some options to this issue, that I wanted to get feedback
> >>>> about, and just wondering if anybody else has run into this.
> >>>>
> >>>> 1)
> >>>>
> >>>> Don't create the tap device, such as macvtap in the container.
> Instead,
> >>>> create the tap device outside of the container and then move it into
> the
> >>>> desired container network namespace. In addition, do a mknod() for the
> >>>> corresponding /dev/tapNN device from outside the container before
> doing
> >>>> chroot().
> >>>>
> >>>> This solution still doesn't allow tap devices to be created inside the
> >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside
> of
> >>>> a container, it would mean changing libvirtd to open existing tap
> >>>> devices (as opposed to the current behavior of creating new ones).
> This
> >>>> would not require any kernel changes, but as mentioned seems
> >>>> inconsistent with the tuntap interface.
> >>>>
> >>>
> >>> For KubeVirt, apart from how exactly the device ends up in the
> >>> container, I
> >>> would want to pursue a way where all network preparations which require
> >>> privileges happens from a privileged process *outside* of the
> container.
> >>> Like CNI solutions do it. They run outside, have privileges and then
> >>> create
> >>> devices in the right network/mount namespace or move them there. The
> >>> final
> >>> goal for KubeVirt is that our pod with the qemu process is completely
> >>> unprivileged and privileged setup happens from outside.
> >>>
> >>> As a consequence, and depending on which route Dan pursues with the
> >>> restructured libvirt, I would assume that either a privileged
> >>> libvirtd-part
> >>> outside of containers creates the devices by entering the right
> >>> namespaces,
> >>> or that libvirt in the container can consume pre-created tun/tap
> devices,
> >>> like qemu.
> >>>
> >>
> >> That would be nice, but as far as I understand there will always be a
> >> need for
> >> some privileges if you want to use a tap device. It's nice that CNI
> >> does that
> >> and all the containers can run unprivileged, but that's because they do
> >> not open
> >> the tap device and they do not do any privileged operations on it. But
> >> QEMU
> >> needs to. So the only way would be passing an opened fd to the
> >> container or
> >> opening the tap device there and making the fd usable for one process in
> >> the
> >> container. Is this already supported for some type of containers in
> >> some way?
> >>
> >> Martin
> >
> >Hi,
> >
> >So another option here call it #3 is to pass open fds via unix sockets.
> >If there are privileged operations that QEMU is trying to do with the fd
> >though, how will opening it first and then passing it to an unprivileged
> >QEMU address that? Is the opener doing those operations first?
> >
>
> Sorry for the confusion, but QEMU is not doing any privileged operations.
> I got
> confused by the fact that anyone can open and do a R/W on a tap device.
> But it
> looks like that's on purpose. No capabilities are needed for opening
> /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then
> doing R/W
> operations on it. It just works.
>
> Correct me if I'm wrong, but to sum it all up, the only things that we
> need to
> figure out (which might possibly be solved by ideas in the other thread)
> are:
>
> tap:
> - Existence of /dev/net/tun
> - Having permissions to open it (0666 by default, shouldn't be a nig deal)
> - Knowing the device name
>
> macvtap:
> - Existence of /dev/tapXX
> - Having permissions to open /dev/tapXX
> - One of the following:
> - Knowing the device name (and being able to translate it using a
> netlink socket)
> - Knowing the the device index
>
> The rest should be an implementation detail.
>
> Am I right? Did I miss anything?
At least from the KubeVirt use-case that sounds to be the things which we
would need to solve the networking setup in a similar way like the
Container Network Interface implementations solve the setup in k8s.
Best Regards,
Roman
6 years, 4 months
[libvirt] opening tap devices that are created in a container
by Jason Baron
Hi,
Opening tap devices, such as macvtap, that are created in containers is
problematic because the interface for opening tap devices is via
/dev/tapNN and devtmpfs is not typically mounted inside a container as
its not namespace aware. It is possible to do a mknod() in the
container, once the tap devices are created, however, since the tap
devices are created dynamically its not possible to apriori allow access
to certain major/minor numbers, since we don't know what these are going
to be. In addition, its desirable to not allow the mknod capability in
containers. This behavior, I think is somewhat inconsistent with the
tuntap driver where one can create tuntap devices inside a container by
first opening /dev/net/tun and then using them by supplying the tuntap
device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
network namespace, one is limited to opening network devices that belong
to your current network namespace.
Here are some options to this issue, that I wanted to get feedback
about, and just wondering if anybody else has run into this.
1)
Don't create the tap device, such as macvtap in the container. Instead,
create the tap device outside of the container and then move it into the
desired container network namespace. In addition, do a mknod() for the
corresponding /dev/tapNN device from outside the container before doing
chroot().
This solution still doesn't allow tap devices to be created inside the
container. Thus, in the case of kubevirt, which runs libvirtd inside of
a container, it would mean changing libvirtd to open existing tap
devices (as opposed to the current behavior of creating new ones). This
would not require any kernel changes, but as mentioned seems
inconsistent with the tuntap interface.
2)
Add a new kernel interface for tap devices similar to how /dev/net/tun
currently works. It might be nice to use TUNSETIFF for tap devices, but
because tap devices have different fops they can't be easily switched
after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where
the tap device name is supplied and a new fd (distinct from the fd
returned by the open of /dev/net/tun) is returned as an output field as
part of the new ioctl parameter.
It may not make sense to have this new ioctl call for /dev/net/tun since
its really about opening a tap device, so it may make sense to introduce
it as part of a new device, such as /dev/net/tap. This new ioctl could
be used for macvtap and ipvtap (or any tap device). I think it might
also improve performance for tuntap devices themselves, if they are
opened this way since currently all tun operations such as read() and
write() take a reference count on the underlying tuntap device, since it
can be changed via TUNSETIFF. I tested this interface out, so I can
provide the kernel changes if that's helpful for clarification.
Thanks,
-Jason
6 years, 4 months