[libvirt] [PATCH] storageVolCreateXMLFrom: Allow multiple accesses to origvol
by Michal Privoznik
When creating a new volume, it is possible to copy data into it from
another already existing volume (referred to as @origvol). Obviously,
the read-only access to @origvol is required, which is thread safe
(probably not performance-wise though). However, with current code
both @newvol and @origvol are marked as building for the time of
copying data from the @origvol to @newvol. The rationale behind
is to disallow some operations on both @origvol and @newvol, e.g.
vol-wipe, vol-delete, vol-download. While it makes sense to not allow
such operations on partly copied mirror, but it doesn't make sense to
disallow the operations on the source (@origvol).
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
src/storage/storage_driver.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 2cb8347..a3f398f 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1912,7 +1912,6 @@ storageVolCreateXMLFrom(virStoragePoolPtr obj,
/* Drop the pool lock during volume allocation */
pool->asyncjobs++;
- origvol->building = 1;
newvol->building = 1;
virStoragePoolObjUnlock(pool);
@@ -1929,7 +1928,6 @@ storageVolCreateXMLFrom(virStoragePoolPtr obj,
virStoragePoolObjLock(origpool);
storageDriverUnlock(driver);
- origvol->building = 0;
newvol->building = 0;
allocation = newvol->target.allocation;
pool->asyncjobs--;
--
1.9.0
10 years, 7 months
[libvirt] [PATCH] Shorten the udevadm settle timeout for refresh/start pool cases
by John Ferlan
https://bugzilla.redhat.com/show_bug.cgi?id=789766
This patch addresses a possible misperception that libvirt is hung
during pool startup/refresh as well as virt-manager startup as it
needs to start/refresh storage pools found.
The problem is that the 'udevadm settle' command will wait for a
default of two minutes prior to returning a failure. For code paths
such as pool starts or refreshes the timeout doesn't necessarily cause
failures - it may just not find anything. Additionally, since no error
message or any other indication is proviced, blame is placed up libvirt
for the problem; however, use of debug tools shows libvirt waiting in
udevadm. NB: This timeout may be longer for iSCSI and SCSI pools since
both seem to run through settle paths twice under certain conditions.
Based on feedback described here:
http://lists.freedesktop.org/archives/systemd-devel/2013-July/011845.html
this timeout is part of the architecture of udevadm settle. Although the
two minute timeout expires, the code doesn't check status. Callers can
decide to continue regardless of whether settle is still handling events.
For paths waiting for something specific, failure will be inevitable;
however, for paths such as perhaps pool refreshes waiting for "anything"
to be present within the directory space udevadm manages having a
shorter timeout may allow them to gather some data quicker/sooner.
It's always possible to refresh the data again at some later point.
For paths that are waiting for something specific, such as finding a
node device, creating a backing disk volume, or removing a logical
volume waiting longer for settle to work through its events will be
more important.
So given this, I modified the calls to virFileWaitForDevices() pass a
boolean 'quiesce' indicating whether the caller should wait for the
default timeout or to wait a shorter time. I chose 5 seconds for the
shorter time, although there was no real logic/reason for it.
Additionally, for either case use VIR_WARN to indicate that the command
to settle the database returned a failure.
---
Note that I did try to use a timeout value of 0 which as I was reading
things would somehow indicate that something needed to be flushed; however,
for the veth issue described here:
http://lists.freedesktop.org/archives/systemd-devel/2013-July/011829.html
the return value of a udevadm settle --timeout 0 was always 0; whereas,
if using --timeout 1 (or more) the return value was 1.
Callers to virFileWaitForDevices() and my reasoning/logic behind why
I chose to wait for quiesce or not...
node_device_driver
-> Is trying to find node devices - wait long time
storage_backend_*.c: (virStorageBackend*() APIs)
disk: DiskRefreshPool
-> Will read directory path - wait short time
disk: DiskCreateVol
-> Requires new disk to be present - wait long time
iscsi: ISCSIGetHostNumber
-> called by ISCSIFindLUs which is called by RefreshPool before
opening path and scanning for devices - wait short time
logical: LogicalRefreshPool
-> Runs /usr/sbin/vgs to get report of volume groups - wait short time
logical: LogicalDeleteVol
-> Requires synchronization prior to deletion of logical volume -
wait long time
mpath: MpathRefreshPool
-> Use DM_DEVICE_LIST to get list of devices - wait short time
scsi: SCSIFindLUs
-> called by SCSIRefreshPool and ISCSIFindLUs (not sure why) prior
to opening path and scanning for available devices - wait short time
scsi: createVport
-> called by SCSIStartPool after the port is managed - there's no
search for any devices and nothing depends on or checks if the
port was created (although next call in sequence would be to
refresh the pool, which would wait again) - wait short time
Signed-off-by: John Ferlan <jferlan(a)redhat.com>
---
src/node_device/node_device_driver.c | 4 ++--
src/storage/storage_backend_disk.c | 4 ++--
src/storage/storage_backend_iscsi.c | 2 +-
src/storage/storage_backend_logical.c | 4 ++--
src/storage/storage_backend_mpath.c | 2 +-
src/storage/storage_backend_scsi.c | 4 ++--
src/util/virfile.h | 2 +-
src/util/virutil.c | 45 +++++++++++++++++++++++++++--------
8 files changed, 46 insertions(+), 21 deletions(-)
diff --git a/src/node_device/node_device_driver.c b/src/node_device/node_device_driver.c
index 6906463..fda4104 100644
--- a/src/node_device/node_device_driver.c
+++ b/src/node_device/node_device_driver.c
@@ -458,7 +458,7 @@ get_time(time_t *t)
* possible for udev not to realize that it has work to do before we
* get here. We thus keep trying to find the new device we just
* created for up to LINUX_NEW_DEVICE_WAIT_TIME. Note that udev's
- * default settle time is 180 seconds, so once udev realizes that it
+ * default settle time is 120 seconds, so once udev realizes that it
* has work to do, it might take that long for the udev wait to
* return. Thus the total maximum time for this function to return is
* the udev settle time plus LINUX_NEW_DEVICE_WAIT_TIME.
@@ -485,7 +485,7 @@ find_new_device(virConnectPtr conn, const char *wwnn, const char *wwpn)
while ((now - start) < LINUX_NEW_DEVICE_WAIT_TIME) {
- virFileWaitForDevices();
+ virFileWaitForDevices(true);
dev = nodeDeviceLookupSCSIHostByWWN(conn, wwnn, wwpn, 0);
diff --git a/src/storage/storage_backend_disk.c b/src/storage/storage_backend_disk.c
index 9cebcca..8678a24 100644
--- a/src/storage/storage_backend_disk.c
+++ b/src/storage/storage_backend_disk.c
@@ -323,7 +323,7 @@ virStorageBackendDiskRefreshPool(virConnectPtr conn ATTRIBUTE_UNUSED,
VIR_FREE(pool->def->source.devices[0].freeExtents);
pool->def->source.devices[0].nfreeExtent = 0;
- virFileWaitForDevices();
+ virFileWaitForDevices(false);
if (!virFileExists(pool->def->source.devices[0].path)) {
virReportError(VIR_ERR_INVALID_ARG,
@@ -660,7 +660,7 @@ virStorageBackendDiskCreateVol(virConnectPtr conn ATTRIBUTE_UNUSED,
goto cleanup;
/* wait for device node to show up */
- virFileWaitForDevices();
+ virFileWaitForDevices(true);
/* Blow away free extent info, as we're about to re-populate it */
VIR_FREE(pool->def->source.devices[0].freeExtents);
diff --git a/src/storage/storage_backend_iscsi.c b/src/storage/storage_backend_iscsi.c
index 881159b..5680bcb 100644
--- a/src/storage/storage_backend_iscsi.c
+++ b/src/storage/storage_backend_iscsi.c
@@ -96,7 +96,7 @@ virStorageBackendISCSIGetHostNumber(const char *sysfs_path,
VIR_DEBUG("Finding host number from '%s'", sysfs_path);
- virFileWaitForDevices();
+ virFileWaitForDevices(false);
sysdir = opendir(sysfs_path);
diff --git a/src/storage/storage_backend_logical.c b/src/storage/storage_backend_logical.c
index ed3a012..868fdce 100644
--- a/src/storage/storage_backend_logical.c
+++ b/src/storage/storage_backend_logical.c
@@ -586,7 +586,7 @@ virStorageBackendLogicalRefreshPool(virConnectPtr conn ATTRIBUTE_UNUSED,
virCommandPtr cmd = NULL;
int ret = -1;
- virFileWaitForDevices();
+ virFileWaitForDevices(false);
/* Get list of all logical volumes */
if (virStorageBackendLogicalFindLVs(pool, NULL) < 0)
@@ -689,7 +689,7 @@ virStorageBackendLogicalDeleteVol(virConnectPtr conn ATTRIBUTE_UNUSED,
virCheckFlags(0, -1);
- virFileWaitForDevices();
+ virFileWaitForDevices(true);
lvchange_cmd = virCommandNewArgList(LVCHANGE, "-aln", vol->target.path, NULL);
lvremove_cmd = virCommandNewArgList(LVREMOVE, "-f", vol->target.path, NULL);
diff --git a/src/storage/storage_backend_mpath.c b/src/storage/storage_backend_mpath.c
index f0ed189..216863b 100644
--- a/src/storage/storage_backend_mpath.c
+++ b/src/storage/storage_backend_mpath.c
@@ -274,7 +274,7 @@ virStorageBackendMpathRefreshPool(virConnectPtr conn ATTRIBUTE_UNUSED,
pool->def->allocation = pool->def->capacity = pool->def->available = 0;
- virFileWaitForDevices();
+ virFileWaitForDevices(false);
virStorageBackendGetMaps(pool);
diff --git a/src/storage/storage_backend_scsi.c b/src/storage/storage_backend_scsi.c
index c448d7f..7335433 100644
--- a/src/storage/storage_backend_scsi.c
+++ b/src/storage/storage_backend_scsi.c
@@ -430,7 +430,7 @@ virStorageBackendSCSIFindLUs(virStoragePoolObjPtr pool,
VIR_DEBUG("Discovering LUs on host %u", scanhost);
- virFileWaitForDevices();
+ virFileWaitForDevices(false);
devicedir = opendir(device_path);
@@ -591,7 +591,7 @@ createVport(virStoragePoolSourceAdapter adapter)
adapter.data.fchost.wwnn, VPORT_CREATE) < 0)
return -1;
- virFileWaitForDevices();
+ virFileWaitForDevices(false);
return 0;
}
diff --git a/src/util/virfile.h b/src/util/virfile.h
index 46ef781..63d3f05 100644
--- a/src/util/virfile.h
+++ b/src/util/virfile.h
@@ -253,7 +253,7 @@ int virFileOpenTty(int *ttymaster,
char *virFileFindMountPoint(const char *type);
-void virFileWaitForDevices(void);
+void virFileWaitForDevices(bool quiesce);
/* NB: this should be combined with virFileBuildPath */
# define virBuildPath(path, ...) \
diff --git a/src/util/virutil.c b/src/util/virutil.c
index 9be1590..030e5ec 100644
--- a/src/util/virutil.c
+++ b/src/util/virutil.c
@@ -1447,29 +1447,54 @@ virSetUIDGIDWithCaps(uid_t uid, gid_t gid, gid_t *groups, int ngroups,
#if defined(UDEVADM) || defined(UDEVSETTLE)
-void virFileWaitForDevices(void)
+void virFileWaitForDevices(bool quiesce)
{
# ifdef UDEVADM
- const char *const settleprog[] = { UDEVADM, "settle", NULL };
+ const char *const settleprog = UDEVADM;
# else
- const char *const settleprog[] = { UDEVSETTLE, NULL };
+ const char *const settleprog = UDEVSETTLE;
# endif
- int exitstatus;
+ int exitstatus = 0;
+ virCommandPtr settlecmd = NULL;
- if (access(settleprog[0], X_OK) != 0)
+ if (access(settleprog, X_OK) != 0)
return;
+ /* In less critical paths, rather than possibly wait for a two
+ * minute timeout for settle to just return failure because of
+ * some issue in the udevadm database, let's wait for 5 seconds.
+ * If there's an issue with the udevadm database, the timeout
+ * will cause a non zero return which we can check and issue a
+ * warning message.
+ */
+ settlecmd = virCommandNew(settleprog);
+# ifdef UDEVADM
+ virCommandAddArg(settlecmd, "settle");
+# endif
+ if (!quiesce) {
+ virCommandAddArg(settlecmd, "--timeout");
+ virCommandAddArg(settlecmd, "5");
+ }
+
/*
- * NOTE: we ignore errors here; this is just to make sure that any device
+ * NOTE: we mostly ignore errors; this is just to make sure that any device
* nodes that are being created finish before we try to scan them.
- * If this fails for any reason, we still have the backup of polling for
- * 5 seconds for device nodes.
*/
- if (virRun(settleprog, &exitstatus) < 0)
+ if (virCommandRun(settlecmd, &exitstatus) < 0)
{}
+
+ if (exitstatus) {
+ char *cmd_str = virCommandToString(settlecmd);
+ VIR_WARN("Timed out running '%s' indicating a possible issue "
+ "with udev event queue.",
+ cmd_str ? cmd_str : settleprog);
+ VIR_FREE(cmd_str);
+ }
+
+ virCommandFree(settlecmd);
}
#else
-void virFileWaitForDevices(void)
+void virFileWaitForDevices(bool quiesce ATTRIBUTE_UNUSED)
{}
#endif
--
1.9.0
10 years, 7 months
[libvirt] [PATCH v2] storage: netfs: Handle backend errors
by John Ferlan
Commit id '18642d10' caused a virt-test regression for NFS backend
storage error path checks when running the command:
'virsh find-storage-pool-sources-as netfs Unknown '
when the host did not have Gluster installed. Prior to the commit,
the test would fail with the error:
error: internal error: Child process (/usr/sbin/showmount --no-headers
--exports Unknown) unexpected exit status 1: clnt_create: RPC: Unknown host
After the commit, the error would be ignored, the call would succeed,
and an empty list of pool sources returned. This was tucked into the
commit message as an expected outcome.
When the target host does not have a GLUSTER_CLI this is a regression
over the previous release. Furthermore, even if Gluster CLI was present,
but had a failure to get devices, the API would return a failure even if
the NFS backend had found devices.
Modify the logic to return failure when the NFS backend check fails and
there's no GLUSTER_CLI or when both backend checks fail.
If either returns success and GLUSTER_CLI is defined, then fetch and return
a list of source devices even if it's empty
Signed-off-by: John Ferlan <jferlan(a)redhat.com>
---
src/storage/storage_backend_fs.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/src/storage/storage_backend_fs.c b/src/storage/storage_backend_fs.c
index be963d4..fa5eba1 100644
--- a/src/storage/storage_backend_fs.c
+++ b/src/storage/storage_backend_fs.c
@@ -234,9 +234,11 @@ virStorageBackendFileSystemNetFindPoolSourcesFunc(char **const groups,
}
-static void
+static int
virStorageBackendFileSystemNetFindNFSPoolSources(virNetfsDiscoverState *state)
{
+ int ret = -1;
+
/*
* # showmount --no-headers -e HOSTNAME
* /tmp *
@@ -263,9 +265,13 @@ virStorageBackendFileSystemNetFindNFSPoolSources(virNetfsDiscoverState *state)
if (virCommandRunRegex(cmd, 1, regexes, vars,
virStorageBackendFileSystemNetFindPoolSourcesFunc,
state, NULL) < 0)
- virResetLastError();
+ goto cleanup;
+ ret = 0;
+
+ cleanup:
virCommandFree(cmd);
+ return ret;
}
@@ -285,6 +291,7 @@ virStorageBackendFileSystemNetFindPoolSources(virConnectPtr conn ATTRIBUTE_UNUSE
virStoragePoolSourcePtr source = NULL;
char *ret = NULL;
size_t i;
+ int retNFS = -1, retGluster = -1;
virCheckFlags(0, NULL);
@@ -306,11 +313,16 @@ virStorageBackendFileSystemNetFindPoolSources(virConnectPtr conn ATTRIBUTE_UNUSE
state.host = source->hosts[0].name;
- virStorageBackendFileSystemNetFindNFSPoolSources(&state);
+ retNFS = virStorageBackendFileSystemNetFindNFSPoolSources(&state);
- if (virStorageBackendFindGlusterPoolSources(state.host,
+# ifdef GLUSTER_CLI
+ retGluster =
+ virStorageBackendFindGlusterPoolSources(state.host,
VIR_STORAGE_POOL_NETFS_GLUSTERFS,
- &state.list) < 0)
+ &state.list);
+# endif
+ /* If both fail, then we won't return an empty list - return an error */
+ if (retNFS < 0 && retGluster < 0)
goto cleanup;
if (!(ret = virStoragePoolSourceListFormat(&state.list)))
--
1.9.0
10 years, 7 months
[libvirt] [PATCH 0/5] network: fix active status & usage of inactive networks
by Laine Stump
All 5 of these patches are required to address
https://bugzilla.redhat.com/show_bug.cgi?id=880483
The reported problem is that an interface using a currently-inactive
macvtap/hostdev network can be attached to a domain (addressed by
5/5). In order to make fixing that problem less of a problem, the
automatic inactivation of all macvtap/hostdev networks any time
libvirtd is restart must be addressed, and that is done in 1/5-4/5.
Part of fixing the latter problem is changing the network driver to
use /var/run for its status XML rather than /var/lib (2/5), which
causes problems during upgrade (addressed in 3/5) and in a more
limited sense downgrade (see the comments in 3/5 for why I haven't
addressed those problems)
10 years, 7 months
[libvirt] how to use libvirt hook scripts to modify domain xml while its start
by cokabug
i want add sound card to domain xml when guest start,after google,i found
libvirt provide hook function.
i follow the instruction to make a python,and boot a guest,but after use
"virsh edit" its still dont have sound card config
/etc/libvirt/hooks/qemu:
#!/usr/bin/python
import sys
import re
import os
hooklog = '/tmp/hook.log'
log = open(hooklog, 'w')
stdinxml = sys.stdin.readlines()
if sys.argv[2] == 'start':
log.write("hook start,domain name: %s \n" % sys.argv[1])
maxslot=-1
for line in stdinxml:
slotm = re.search("slot='(?P<slotnum>0x[0-9a-fA-F]{2})'", line)
if slotm:
slotnum = int(slotm.group('slotnum'),0)
if slotnum > maxslot:
maxslot = slotnum
if '</devices>' in line:
log.write("insert sound card config \n")
slotnum = maxslot + 1
line = " <sound model='ich6'>\n"
line = line + " <address type='pci' domain='0x0000' bus='0x00'
slot='0x%0.2X' function='0x0'/>\n" % slotnum
line = line + " </sound>\n"
line = line + " </devices>\n"
stdoutxml = ''.join(line)
sys.stdout.write(stdoutxml)
tail -f /tmp/hook.log:
hook start,domain name: instance-00000222
insert sound card config
tail /var/log/libvirt/libvirtd.log:
2014-04-16 07:13:39.159+0000: 52199: warning : qemuDomainObjTaint:1377 :
Domain id=81 name='instance-00000222'
uuid=974d62b7-f316-4f20-a91c-d11cb85980fe is tainted: high-privileges
in these log,hook script seems to executed,but not effective.
can anyone tell me where is wrong?
10 years, 7 months
[libvirt] VNC sharePolicy not working as expected
by Kekane, Abhishek
Hi All,
Greetings!!!
We are using KVM hypervisor driver for running OpenStack IaaS. Couple of months back we have reported one security issue [1] in OS.
Basically we want to limit on the number of vnc client connections that can be opened by users for a given VM.
>From libvirt 1.0.6 version onwards share policy feature is supported to control the way consoles are accessed by the user.
Presently it is possible to configure share policy for vnc in 3 different ways:-
1. allow-exclusive, allows clients to ask for exclusive access by dropping other connections
2. force-share, This is the default value, It allows multiple clients to connect to the console in parallel sharing the same session
3. ignore, welcomes every connection unconditionally
In openstack nova for libvirt driver I am able to configure the sharePolicy value to graphics element of domain's xml.
<graphics type="vnc" autoport="yes" keymap="en-us" listen="127.0.0.1" sharePolicy="force-shared">
<listen type='address' address='127.0.0.1'/>
</graphics>
<graphics type="vnc" autoport="yes" keymap="en-us" listen="127.0.0.1" sharePolicy="allow-exclusive">
<listen type='address' address='127.0.0.1'/>
</graphics>
<graphics type="vnc" autoport="yes" keymap="en-us" listen="127.0.0.1" sharePolicy="ignore">
<listen type='address' address='127.0.0.1'/>
</graphics>
But while testing I am not able to get expected results for allow-exclusive and ignore sharePolicy.
For allow-exclusive sharePolicy previous connections are not getting dropped and console contents are getting shared among all open consoles.
For ignore sharePolicy also contents are getting shared among all open consoles.
I am using libvirt version 1.1.1 and qemu version is 1.5.0.
We want to restrict only single authorized user to connect to the console dropping previously connected users automatically by using allow-exclusive sharePolicy.
Please let me know what else is required to get this worked successfully.
[1] : https://bugs.launchpad.net/nova/+bug/1227575
Thanks,
Abhishek
______________________________________________________________________
Disclaimer:This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding
10 years, 7 months
[libvirt] libvirt Hyper-v 2012 r2 fix
by vikhyath reddy
Hello everyone,
Thanks for libvirt. I did see on the libvirt page that hyper-v 2008 is
supported (which it does). So I tried running it on Hyper-v 2012 r2 but
virsh fails to connect by an error that says
*"error: internal error: SOAP fault during enumeration: code 's:Sender',
subcode 'n:CannotProcessFilter', reason 'The data source could not process
the filter. The filter might be missing or it might be invalid. Change the
filter and try the request again. ', detail '500The specified class does
not exist in the given namespace. HRESULT 0x8033801a0052150858778HRESULTThe
specified class does not exist in the given namespace. ' "*
Upon looking into the Hyper-v 2012 r2 server event logs, I found that this
is going on
[ Source: WMI-Activity
Event ID: 5898
Microsoft-Windows-WMI-Activity/Operational ]
*Id = {62D480B2-58EF-0000-E580-D462EF58CF01}; ClientMachine = VIKHYPERV;
User = VIKHYPERV\Administrator; ClientProcessId = 884;Component = Unknown;
Operation = Start IWbemServices::ExecQuery - root\virtualization : select *
from Msvm_ComputerSystem whereDescription = "Microsoft Hosting Computer
System" ; ResultCode = 0x80041010; PossibleCause = Unknown*
Note that* Msvm_ComputerSystem *is missing in the namespace
*root\virtualization.
*Upon further investigation, I found that the new namespace where
*Msvm_ComputerSystem
*is located is at root\virtualization\*v2*
Is (it possible?) there some way I can know where in the source code,
libvirt is specifying the namespace so that I can try patching it up and
see if it fixes things?
Thanks for all your help,
Vik.
10 years, 7 months
[libvirt] why doesn't libvirt let qemu autostart live-migrated VMs?
by Chris Friesen
Hi,
I've been digging through the libvirt code and something that struck me
was that it appears that when using qemu libvirt will migrate the
instance with autostart disabled, then sit on the source host
periodically polling for migration completion, then once the host
detects that migration is completed it will tell the destination to
start up the VM.
Why don't we let the destination autostart the VM once migration is
complete?
Chris
10 years, 7 months
[libvirt] RFC: Exposing backing chains in <domain> XML
by Eric Blake
tl;dr:
I am working on a series of patches to expose backing chain information
in <domain> XML. Comments are welcome, to make sure my XML design is on
the right track.
Purpose
=======
Among other things, this will help us support Peter's proposal of
enhancing the block-pull and block-commit actions to specify a
destination by relative depth in the backing chain (where "vda[0]"
represents the active image, "vda[1]" represents the backing file of the
active image, and so on).
It will also help debug situations where libvirt and qemu disagree on
what constitutes a backing chain, and therefore causes sVirt labeling
discrepancies or prohibits block-pull/block-commit actions. For
example, given the chain "base <- mid <- top", if top forgot the
backing_fmt attribute, and /etc/libvirt/qemu.conf
allow_disk_format_probing=0 (which it is by default for security
resasons), libvirt treats 'mid' as a raw file and refuses to acknowledge
that 'base' is part of the chain, while qemu would happily treat mid as
qcow2 and therefore use 'base' if permissions allow it to. I have
helped debug this scenario several times on IRC or in bugzilla reports.
This feature is being driven in part by
https://bugzilla.redhat.com/show_bug.cgi?id=1069407
Existing design
===============
Note that libvirt can already expose backing file details (but only one
layer; it is not recursive) when using virStorageVolGetXMLDesc(); for
example:
# virsh vol-dumpxml --pool gluster img3
<volume type='network'>
<name>img3</name>
<key>vol1/img3</key>
...
<target>
<path>gluster://localhost/vol1/img3</path>
<format type='qcow2'/>
...
</target>
<backingStore>
<path>gluster://localhost/vol1/img2</path>
<format type='qcow2'/>
<permissions>
<mode>00</mode>
<owner>0</owner>
<group>0</group>
</permissions>
</backingStore>
</volume>
In the current volume representation, if a <backingStore> element is
present, it gives the <path> to the backing file. But this
representation is a bit limited: it is rather hard-coded to the
assumption that there is only one backing file, and does not do a good
job when the backing image is not in the same storage pool as the volume
it is describing. Some of the enhancements I'm proposing for <domain>
should also be applied to the information output by <volume> XML, which
means I have to be careful that the design I'm proposing will mesh well
with the storage xml to maximize code reuse.
The volume approach is a bit painful to users trying to track the
backing chain of a disk tied to a <domain> because it necessitates
creating a storage pool and making multiple calls to follow the chain,
so we need to expose the backing chain directly in the <disk> element of
a domain, and recursively show the entire chain. Furthermore, there are
some formats that require multiple resources: for example, both qemu
2.0's new quorum driver and HyperV VHDX images can have multiple backing
files, and where these files can in turn have more backing images.
Thus, any proper representation of disk resources needs to show a full
tree of relationships. Thankfully, circular references in backing files
would form an invalid image (all known virtual disk image formats
require a DAG of relationships).
With existing API, we still have not fully implemented 'virsh
snapshot-delete' of external snapshots. So our current advice is for
people to manually use qemu-img to alter backing chains, then update
libvirt to match. Once libvirt starts tracking backing chains, it
becomes all the more important to provide two new actions in libvirt: we
need a validation mode (check that what is recorded on disk matches what
is recorded in XML and flag an error if they differ) and a correction
mode (ignore what is recorded in XML and regenerate it to match what is
actually on disk).
Proposal
========
For each <disk> of a domain, I will be adding a new <backingStore>
element. The element is optional on input, which allows libvirt to
continue to understand input from older versions, but will always be
present on output, to show what libvirt is tracking as the backing chain.
For a file with no backing store (including raw file format), the usage
is simple:
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/path/to/somewhere'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
</disk>
The new explicit <backingStore/> makes it clear that there is no backing
chain.
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/top.qcow2'/>
<backingStore type='file'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/mid.qcow2'/>
<backingStore type='file'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/base.qcow2'/>
<backingStore/>
</backingStore>
</backingStore>
<target dev='vda' bus='virtio'/>
</disk>
Note that this is intentionally nested, so that for file formats that
support more than one backing resource, it can list parallel
<backingStore> as siblings to describe those related resources (thus
leaving the door open to expose a qemu quorum as a <disk type='quorum'>
with no direct <source> but instead with three <backingStore> sibling
elements for each member of the quorum, and where each member of the
quorum can further have its own backing chain).
Design wise, the <backingStore> element is either completely empty
(end-of-chain), or has a mandatory type='...' attribute that mirrors the
same type attribute of a <disk>. Then, within the backingStore element,
there is a <source> or other appropriate sub-elements similar to what
<disk> already uses for describing a single host resource. So, for an
example, here would be the output for a 2-element chain on gluster:
<disk type='network' device='disk'>
<driver name='qemu' type='qcow2'/>
<source protocol='gluster' name='vol1/img2'>
<host name='red'/>
</source>
<backingStore type='network'>
<driver name='qemu' type='qcow2'/>
<source protocol='gluster' name='vol1/img1'>
<host name='red'/>
</source>
<backingStore/>
</backingStore>
<target dev='vdb' bus='virtio'/>
</disk>
Or again, but this time using volume references to a storage pool
(assuming 'glusterVol1' is the storage pool wrapping gluster://red/vol1):
<disk type='volume' device='disk'>
<driver name='qemu' type='qcow2'/>
<source pool='glusterVol1' volume='img2'/>
<backingStore type='volume'>
<driver name='qemu' type='qcow2'/>
<source pool='glusterVol1' volume='img1'/>
<backingStore/>
</backingStore>
<target dev='vdb' bus='virtio'/>
</disk>
As can be seen, this design heavily reuses existing <disk type='...'>
handling, which should make it easier to reuse blocks of code both in
libvirt to handle the backing chains, and in clients when processing
backing chains to hand to libvirt up front or in inspecting the dumpxml
results. Management apps like vdsm that use transient domains should
start supplying <backingStore> elements to fully describe chains.
Implementation
==============
The following APIs will be affected:
defining domain XML (whether via define for persistent domains, or
create for transient domains): parse the new element. If the element is
already present, default to trusting the backing chain in that element
instead of reading from the disk files. If the element is absent, read
the disk files and populate the element. It is probably also worth
adding a flag to trigger validation mode: read the disk files to ensure
they match the xml, and refuse the operation if there is a mismatch (as
for updating xml to match reality, the simplest is to edit the XML and
delete the <backingStore> element then try the define again, so I don't
see the need for a flag for that action).
I may also need to figure out if it is worth tainting a domain any time
where libvirt detects that the XML backing chain vs. the disk file read
backing chain have diverged.
Note that defining domain XML includes loading from saved state or from
incoming migration.
dumping domain XML: always output the new element, by default without
consulting disk files. By tracking the chain in memory ever since the
guest is defined, it should already be available for output. I'm
debating whether we need a flag (similar to virsh dumpxml --update-cpu)
that can force libvirt to re-read the disk files at the time of the dump
and regenerate the chain to match reality of any changes made behind
libvirt's back.
creating external snapshots: the <domainsnapshot> XMl will continue to
be the picture of the domain prior to the creation of the snapshot (but
this picture will now include any <backingStore> elements already
present in the chain), but after the snapshot is taken, the <domain> XML
will also be modified to record the updated chain (the old disk source
is now the <backingStore> of the new disk source).
deleting external snapshots is not yet implemented, but the
implementation will have to shrink the backingStore chain to match reality.
block-pull (block-rebase in pull mode), block-commit: at the completion
of the pull, the <backingStore> needs to be updated to reflect the new
shorter state of the chain
block-copy (block-rebase in copy mode): the operation starts out by
creating a mirror, but during the first phase, the mirror is not usable
as an accurate copy of what the guest sees. Right now we fudge by
saying that block copy can only be done on transient domains; but even
with that, we still track a <mirror> element in the <disk> XML to track
that a block copy is underway (so that the operation survives a libvirtd
restart). The <mirror> element will now need to be taught a
<backingStore>, particularly if the user passes in a pre-existing file
to be reused as the copy destination. Then, when the second phase is
complete and the mirroring is ended, the <disk> will need another update
to select which side of the backing chain is now in force
virsh domblklist: should be taught a new flag to show the backing chain
in a tree format, since the command already exists to extract <disk>
information from a domain into a nicer human format
sVirt security labeling: right now, we are read the disk files to both
label and remove labels on a backing chain - obviously, once the chain
is tracked natively as part of the <disk>, we should be labeling without
having to read disk files
storage volumes - investigate how much of the backing chain code can be
reused in enhancing storage volume xml output
anything else you can think of in the code base that will be impacted?
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
10 years, 7 months