[libvirt] [PATCH] Don't allow two or more disks to be mapped to the same image file

If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file. --- src/conf/domain_conf.c | 22 ++++++++++++++++++++++ src/conf/domain_conf.h | 2 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_driver.c | 6 ++++++ 4 files changed, 31 insertions(+), 0 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 1b02c25..93a081d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5455,6 +5455,14 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, if (!disk) goto error; + if (virDomainDiskConflict(disk, def)) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), + disk->src, + _("is already mapped to another device, " + "skip this device.")); + continue; + } def->disks[def->ndisks++] = disk; } VIR_FREE(nodes); @@ -9088,3 +9096,17 @@ cleanup: return ret; } + +bool virDomainDiskConflict(virDomainDiskDefPtr disk, virDomainDefPtr def) +{ + int i; + + if (!disk->src) + return false; + + for (i = 0; i < def->ndisks; i++) { + if (STREQ(disk->src, def->disks[i]->src)) + return true; + } + return false; +} diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 9f595d6..78b2f95 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1350,6 +1350,8 @@ int virDomainDiskDefForeachPath(virDomainDiskDefPtr disk, bool ignoreOpenFailure, virDomainDiskDefPathIterator iter, void *opaque); +bool virDomainDiskConflict(virDomainDiskDefPtr disk, + virDomainDefPtr def); typedef const char* (*virLifecycleToStringFunc)(int type); typedef int (*virLifecycleFromStringFunc)(const char *type); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index b4b6c63..17e2ec4 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -237,6 +237,7 @@ virDomainDeviceTypeToString; virDomainDiskBusTypeToString; virDomainDiskCacheTypeFromString; virDomainDiskCacheTypeToString; +virDomainDiskConflict; virDomainDiskDefAssignAddress; virDomainDiskDefForeachPath; virDomainDiskDefFree; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6f296c9..6de08d3 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4058,6 +4058,12 @@ static int qemudDomainAttachDevice(virDomainPtr dom, break; case VIR_DOMAIN_DISK_DEVICE_DISK: + if (virDomainDiskConflict(dev->data.disk, vm->def)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), dev->data.disk->src, + _("is already mapped to another device.")); + break; + } if (dev->data.disk->bus == VIR_DOMAIN_DISK_BUS_USB) { ret = qemuDomainAttachUsbMassstorageDevice(driver, vm, dev->data.disk, qemuCaps); -- 1.7.3.1

At 03/24/2011 03:02 PM, Hu Tao Write:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
If two disks are mapped to the same image and these two disks are read only, operating on these two disks at the same is OK. Should we allow this behavior?
--- src/conf/domain_conf.c | 22 ++++++++++++++++++++++ src/conf/domain_conf.h | 2 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_driver.c | 6 ++++++ 4 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 1b02c25..93a081d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5455,6 +5455,14 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, if (!disk) goto error;
+ if (virDomainDiskConflict(disk, def)) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), + disk->src, + _("is already mapped to another device, " + "skip this device.")); + continue; + } def->disks[def->ndisks++] = disk; } VIR_FREE(nodes); @@ -9088,3 +9096,17 @@ cleanup:
return ret; } + +bool virDomainDiskConflict(virDomainDiskDefPtr disk, virDomainDefPtr def) +{ + int i; + + if (!disk->src) + return false; + + for (i = 0; i < def->ndisks; i++) { + if (STREQ(disk->src, def->disks[i]->src))
If disk->src is '/var/lib/libvirt/images/test.img' and def->disk[i]->src is '/var/libv//libvirt/images/test.img', they are the same images, but STREQ can not work. Comparing device id and inode number is a better way to check whether two images are the same.
+ return true; + } + return false; +} diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 9f595d6..78b2f95 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1350,6 +1350,8 @@ int virDomainDiskDefForeachPath(virDomainDiskDefPtr disk, bool ignoreOpenFailure, virDomainDiskDefPathIterator iter, void *opaque); +bool virDomainDiskConflict(virDomainDiskDefPtr disk, + virDomainDefPtr def);
typedef const char* (*virLifecycleToStringFunc)(int type); typedef int (*virLifecycleFromStringFunc)(const char *type); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index b4b6c63..17e2ec4 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -237,6 +237,7 @@ virDomainDeviceTypeToString; virDomainDiskBusTypeToString; virDomainDiskCacheTypeFromString; virDomainDiskCacheTypeToString; +virDomainDiskConflict; virDomainDiskDefAssignAddress; virDomainDiskDefForeachPath; virDomainDiskDefFree; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6f296c9..6de08d3 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4058,6 +4058,12 @@ static int qemudDomainAttachDevice(virDomainPtr dom, break;
case VIR_DOMAIN_DISK_DEVICE_DISK: + if (virDomainDiskConflict(dev->data.disk, vm->def)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), dev->data.disk->src, + _("is already mapped to another device.")); + break; + } if (dev->data.disk->bus == VIR_DOMAIN_DISK_BUS_USB) { ret = qemuDomainAttachUsbMassstorageDevice(driver, vm, dev->data.disk, qemuCaps);

If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file. changes: v2: - allow it for read-only disks - compare source files by inode number --- src/conf/domain_conf.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 2 + src/libvirt_private.syms | 1 + src/qemu/qemu_driver.c | 6 +++++ 4 files changed, 63 insertions(+), 0 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 1b02c25..b43dc4a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5455,6 +5455,14 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, if (!disk) goto error; + if (virDomainDiskConflict(disk, def)) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), + disk->src, + _("is already mapped to another device, " + "skip this device.")); + continue; + } def->disks[def->ndisks++] = disk; } VIR_FREE(nodes); @@ -9088,3 +9096,49 @@ cleanup: return ret; } + +bool virDomainDiskConflict(virDomainDiskDefPtr disk, virDomainDefPtr def) +{ + struct stat stat1, stat2; + int i; + + if (!disk->src) + return false; + + if (stat(disk->src, &stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */ + return true; + } + } + + for (i = 0; i < def->ndisks; i++) { + if (disk->readonly && def->disks[i]->readonly) + continue; + + if (stat(def->disks[i]->src, &stat2)) { + if (errno != ENOENT) { + /* Can't stat file, shouldn't happen but for safety treate + * it as conflicted */ + return true; + } + } + + if (S_ISREG(stat1.st_mode) && S_ISREG(stat2.st_mode) + && (stat1.st_ino == stat2.st_ino) + && (stat1.st_dev == stat2.st_dev)) { + return true; + } + + if (S_ISCHR(stat1.st_mode) && S_ISCHR(stat2.st_mode) + && (stat1.st_rdev == stat2.st_rdev)) { + return true; + } + + if (S_ISBLK(stat1.st_mode) && S_ISBLK(stat2.st_mode) + && (stat1.st_rdev == stat2.st_rdev)) { + return true; + } + } + return false; +} diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 9f595d6..78b2f95 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1350,6 +1350,8 @@ int virDomainDiskDefForeachPath(virDomainDiskDefPtr disk, bool ignoreOpenFailure, virDomainDiskDefPathIterator iter, void *opaque); +bool virDomainDiskConflict(virDomainDiskDefPtr disk, + virDomainDefPtr def); typedef const char* (*virLifecycleToStringFunc)(int type); typedef int (*virLifecycleFromStringFunc)(const char *type); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index b4b6c63..17e2ec4 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -237,6 +237,7 @@ virDomainDeviceTypeToString; virDomainDiskBusTypeToString; virDomainDiskCacheTypeFromString; virDomainDiskCacheTypeToString; +virDomainDiskConflict; virDomainDiskDefAssignAddress; virDomainDiskDefForeachPath; virDomainDiskDefFree; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6f296c9..6de08d3 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4058,6 +4058,12 @@ static int qemudDomainAttachDevice(virDomainPtr dom, break; case VIR_DOMAIN_DISK_DEVICE_DISK: + if (virDomainDiskConflict(dev->data.disk, vm->def)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), dev->data.disk->src, + _("is already mapped to another device.")); + break; + } if (dev->data.disk->bus == VIR_DOMAIN_DISK_BUS_USB) { ret = qemuDomainAttachUsbMassstorageDevice(driver, vm, dev->data.disk, qemuCaps); -- 1.7.3.1 -- Thanks, Hu Tao 2011/03/24

On Thu, 24 Mar 2011 16:46:14 +0800 Hu Tao <hutao@cn.fujitsu.com> wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
It doesn't seem that virDomainDiskConflict observes the "shared" property of a disk definition. If all the disks are marked as shared, we should say there is no conflict.
--- src/conf/domain_conf.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 2 + src/libvirt_private.syms | 1 + src/qemu/qemu_driver.c | 6 +++++ 4 files changed, 63 insertions(+), 0 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 1b02c25..b43dc4a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5455,6 +5455,14 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, if (!disk) goto error;
+ if (virDomainDiskConflict(disk, def)) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), + disk->src, + _("is already mapped to another device, " + "skip this device.")); + continue; + } def->disks[def->ndisks++] = disk; } VIR_FREE(nodes); @@ -9088,3 +9096,49 @@ cleanup:
return ret; } + +bool virDomainDiskConflict(virDomainDiskDefPtr disk, virDomainDefPtr def) +{ + struct stat stat1, stat2; + int i; + + if (!disk->src) + return false; + + if (stat(disk->src, &stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */ + return true; + } + } + + for (i = 0; i < def->ndisks; i++) { + if (disk->readonly && def->disks[i]->readonly) + continue; + + if (stat(def->disks[i]->src, &stat2)) { + if (errno != ENOENT) { + /* Can't stat file, shouldn't happen but for safety treate + * it as conflicted */ + return true; + } + } + + if (S_ISREG(stat1.st_mode) && S_ISREG(stat2.st_mode) + && (stat1.st_ino == stat2.st_ino) + && (stat1.st_dev == stat2.st_dev)) { + return true; + } + + if (S_ISCHR(stat1.st_mode) && S_ISCHR(stat2.st_mode) + && (stat1.st_rdev == stat2.st_rdev)) { + return true; + } + + if (S_ISBLK(stat1.st_mode) && S_ISBLK(stat2.st_mode) + && (stat1.st_rdev == stat2.st_rdev)) { + return true; + } + } + return false; +} diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 9f595d6..78b2f95 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1350,6 +1350,8 @@ int virDomainDiskDefForeachPath(virDomainDiskDefPtr disk, bool ignoreOpenFailure, virDomainDiskDefPathIterator iter, void *opaque); +bool virDomainDiskConflict(virDomainDiskDefPtr disk, + virDomainDefPtr def);
typedef const char* (*virLifecycleToStringFunc)(int type); typedef int (*virLifecycleFromStringFunc)(const char *type); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index b4b6c63..17e2ec4 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -237,6 +237,7 @@ virDomainDeviceTypeToString; virDomainDiskBusTypeToString; virDomainDiskCacheTypeFromString; virDomainDiskCacheTypeToString; +virDomainDiskConflict; virDomainDiskDefAssignAddress; virDomainDiskDefForeachPath; virDomainDiskDefFree; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6f296c9..6de08d3 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4058,6 +4058,12 @@ static int qemudDomainAttachDevice(virDomainPtr dom, break;
case VIR_DOMAIN_DISK_DEVICE_DISK: + if (virDomainDiskConflict(dev->data.disk, vm->def)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s %s %s", + _("source"), dev->data.disk->src, + _("is already mapped to another device.")); + break; + } if (dev->data.disk->bus == VIR_DOMAIN_DISK_BUS_USB) { ret = qemuDomainAttachUsbMassstorageDevice(driver, vm, dev->data.disk, qemuCaps);

On Thu, Mar 24, 2011 at 11:05:08AM +0000, Nathan Baum wrote:
On Thu, 24 Mar 2011 16:46:14 +0800 Hu Tao <hutao@cn.fujitsu.com> wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
It doesn't seem that virDomainDiskConflict observes the "shared" property of a disk definition.
If all the disks are marked as shared, we should say there is no conflict.
Does the "shared" property mean virtual disks can be shared between domains or the source of virtual disk can be shared between different virtual disks? If it is the former, then the operation of sharing is based on virtual disks, thus no need to care about this case. But if it is the latter, we indeed have to check for the "shared" property. -- Thanks, Hu Tao

On Fri, 25 Mar 2011 14:26:16 +0800 Hu Tao <hutao@cn.fujitsu.com> wrote:
On Thu, Mar 24, 2011 at 11:05:08AM +0000, Nathan Baum wrote:
On Thu, 24 Mar 2011 16:46:14 +0800 Hu Tao <hutao@cn.fujitsu.com> wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
It doesn't seem that virDomainDiskConflict observes the "shared" property of a disk definition.
If all the disks are marked as shared, we should say there is no conflict.
Does the "shared" property mean virtual disks can be shared between domains or the source of virtual disk can be shared between different virtual disks? If it is the former, then the operation of sharing is based on virtual disks, thus no need to care about this case. But if it is the latter, we indeed have to check for the "shared" property.
I don't believe libvirt exposes a mechanism for sharing virtual devices between domains. The only functions I know for adding devices accept an XML description of the device to add, and I don't believe there is a way to reference a disk that has been defined on another domain. Assuming that's true, I don't think the distinction you're asking about is meaningful. If disks can't be shared, then sharedness must apply to sources to be at all meaningful. But I'm happy to be proven wrong. :) My current workflow is that I create domain XML for each domain, which, when the storage is shared, contains identical <disk> stanza with <shareable/> elements. Naturally, I'd like to continue using that workflow.

On 03/24/2011 02:46 AM, Hu Tao wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
+ + if (stat(disk->src, &stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */
s/treate/treat/ Won't this will fail on root-squash NFS from qemu:///system? (Or does root-squash meant that root can still stat() but just not open() a file?) Overall, I'm worried that this patch is repeating some of danpb's bigger efforts to integrate a sanlock disk contention avoidance [1]. If a resource manager is properly hooked to all disks, then we can prevent contention between domains (and not limit ourself to just single-domain contention, as in this patch). On the other hand, this seems like an easy enough check to do for a single domain whether or not we get the sanlock code completed (that is, timing wise this looks like it could be ready prior to 0.9.0 while Dan's work is bigger in scope and probably missed the feature freeze for this month's release). So I'm not sure whether to ack this. [1] https://www.redhat.com/archives/libvir-list/2011-January/msg00963.html -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 03/24/2011 02:45 PM, Eric Blake wrote:
On 03/24/2011 02:46 AM, Hu Tao wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
+ + if (stat(disk->src, &stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */
s/treate/treat/
Won't this will fail on root-squash NFS from qemu:///system? (Or does root-squash meant that root can still stat() but just not open() a file?)
This won't work for network disks, which aren't files. To check for network disk conflicts, you'd need to check that whether any host and port are the same as well. This won't be perfect, since hosts and ports can be implicit or referred to by different names, but it won't have false positives.

On Thu, Mar 24, 2011 at 04:22:38PM -0700, Josh Durgin wrote:
On 03/24/2011 02:45 PM, Eric Blake wrote:
On 03/24/2011 02:46 AM, Hu Tao wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
+ + if (stat(disk->src, &stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */
s/treate/treat/
Won't this will fail on root-squash NFS from qemu:///system? (Or does root-squash meant that root can still stat() but just not open() a file?)
This won't work for network disks, which aren't files. To check for network disk conflicts, you'd need to check that whether any host and port are the same as well. This won't be perfect, since hosts and ports can be implicit or referred to by different names, but it won't have false positives.
Is there a perfect way to solve this problem? However I will try your way first.

On 03/25/2011 12:06 AM, Hu Tao wrote:
On Thu, Mar 24, 2011 at 04:22:38PM -0700, Josh Durgin wrote:
This won't work for network disks, which aren't files. To check for network disk conflicts, you'd need to check that whether any host and port are the same as well. This won't be perfect, since hosts and ports can be implicit or referred to by different names, but it won't have false positives.
Is there a perfect way to solve this problem? However I will try your way first.
I'm not sure there is a perfect way for all types of network disks using the information libvirt has, since multiple hostnames/ip addresses may refer to the same storage location. A more robust solution would use a lower level interface to implement locking. I'm not sure how this would work for nbd or sheepdog, but for rbd the watch/notify mechanism in librados can be used to do this. It sounds like this would be easier to do later within Dan's lock manager framework. It doesn't help much to only stop duplicate mappings on one domain (or host) when your disks aren't local. If libvirt ends up with a check that only works on one host or domain, skipping network disks might be better than providing a false sense of security.

On Thu, Mar 24, 2011 at 03:45:26PM -0600, Eric Blake wrote:
On 03/24/2011 02:46 AM, Hu Tao wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
+ + if (stat(disk->src, &stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */
s/treate/treat/
Won't this will fail on root-squash NFS from qemu:///system? (Or does root-squash meant that root can still stat() but just not open() a file?)
I think it won't. man stat says: No permissions are required on the file itself (to stat it) But needs 'r' bit to open().
Overall, I'm worried that this patch is repeating some of danpb's bigger efforts to integrate a sanlock disk contention avoidance [1]. If a resource manager is properly hooked to all disks, then we can prevent contention between domains (and not limit ourself to just single-domain contention, as in this patch). On the other hand, this seems like an easy enough check to do for a single domain whether or not we get the sanlock code completed (that is, timing wise this looks like it could be ready prior to 0.9.0 while Dan's work is bigger in scope and probably missed the feature freeze for this month's release). So I'm not sure whether to ack this.
Once danpb's sanlock patch is ready we can easily migrate from the method of this patch to the sanlock. But what is more important is to protect users' data. Although the chance is rare that a user attaches a disk using an image file that is already in use, but if it happens and important data is lost, it will become a disaster (especially for a productive environment)
[1] https://www.redhat.com/archives/libvir-list/2011-January/msg00963.html
-- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 03/25/2011 03:00 AM, Hu Tao wrote:
On Thu, Mar 24, 2011 at 03:45:26PM -0600, Eric Blake wrote:
On 03/24/2011 02:46 AM, Hu Tao wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
+ + if (stat(disk->src,&stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */ s/treate/treat/
Won't this will fail on root-squash NFS from qemu:///system? (Or does root-squash meant that root can still stat() but just not open() a file?) I think it won't. man stat says:
No permissions are required on the file itself (to stat it)
But needs 'r' bit to open().
root-squash means that root can only do what user "nobody" could do with the filesystem. Although it doesn't need read access on the file itself, stat() *will* fail if the current user isn't able to read the directory containing the file being stat'ed. So, if user nobody doesn't have access to the parent directory (ie if permissions are xx0, which is very common), then root will not be able to stat the file - it will just return -1. To make this work properly for root-squash, you'll need to either 1) fork a child that does setuid to the qemu user:group and does the stat() (a real pain, since you'll need to pass the results back to the parent, and can't reasonably log errors while in the child == not recommended), or 2) use virFileOpenAs() to open the file as the qemu user (virFileOpenAs() uses SCM_RIGHTS and a child process to do this), then do fstat() of the file to get the info you need, and close it. Keeping libvirt working properly with save images, disk images, and pools on root-squash NFS is a never-ending battle (and can lead to an intense hatred of NFS!) Basically any time new code is added that needs to access any of these things, operation on root-squash is broken. Now that we have virFileOpenAs(), it should be much more straightforward to code so that root-squash scenarios keep working. If anyone is adding some code that does anything with files on disk, they don't have a root-squash NFS setup, and they want some testing to make sure they haven't broken it, please point out the patch to me, and I'll make the time to apply it locally and try it on my NFS setup. (I know I should automatically just see it, but the volume on libvir-list has increased *a lot* lately, and I frequently find myself several days behind).
Overall, I'm worried that this patch is repeating some of danpb's bigger efforts to integrate a sanlock disk contention avoidance [1]. If a resource manager is properly hooked to all disks, then we can prevent contention between domains (and not limit ourself to just single-domain contention, as in this patch). On the other hand, this seems like an easy enough check to do for a single domain
My opinion is that it's probably much more likely that someone would mistakenly use the same disk in two different domains (due to cut-paste of XML) rather than using it twice in the same domain (that's much easier to notice since the definitions are close to each other in the same file). So beyond the fact that this patch can't eliminate all erroneous duplicate usage, it's not looking for the category that is most likely, and thus it will probably have more an effect of providing a false sense of security, rather than of catching any duplicate usage. That makes me think this may actually do more harm than good.
whether or not we get the sanlock code completed (that is, timing wise this looks like it could be ready prior to 0.9.0 while Dan's work is bigger in scope and probably missed the feature freeze for this month's release). So I'm not sure whether to ack this.
Assuming this patch (or something similar) was accepted, would anybody (aside from those of us using upstream builds directly) ever get a release that contained this patch, but didn't contain Dan's? If the answer is no, then I think that's another reason to NACK this and wait for the Dan's patch.

On Fri, Mar 25, 2011 at 10:35:13AM -0400, Laine Stump wrote:
On 03/25/2011 03:00 AM, Hu Tao wrote:
On Thu, Mar 24, 2011 at 03:45:26PM -0600, Eric Blake wrote:
On 03/24/2011 02:46 AM, Hu Tao wrote:
If two or more disks are mapped to the same image file, operating on these disks at the same time may corrupt data stored in the image file.
changes:
v2:
- allow it for read-only disks - compare source files by inode number
+ + if (stat(disk->src,&stat1)) { + if (errno != ENOENT) { + /* Can't stat file, for safety treate it as conflicted */ s/treate/treat/
Won't this will fail on root-squash NFS from qemu:///system? (Or does root-squash meant that root can still stat() but just not open() a file?) I think it won't. man stat says:
No permissions are required on the file itself (to stat it)
But needs 'r' bit to open().
root-squash means that root can only do what user "nobody" could do with the filesystem. Although it doesn't need read access on the file itself, stat() *will* fail if the current user isn't able to read the directory containing the file being stat'ed. So, if user nobody doesn't have access to the parent directory (ie if permissions are xx0, which is very common), then root will not be able to stat the file - it will just return -1.
To make this work properly for root-squash, you'll need to either 1) fork a child that does setuid to the qemu user:group and does the stat() (a real pain, since you'll need to pass the results back to the parent, and can't reasonably log errors while in the child == not recommended), or 2) use virFileOpenAs() to open the file as the qemu user (virFileOpenAs() uses SCM_RIGHTS and a child process to do this), then do fstat() of the file to get the info you need, and close it.
Keeping libvirt working properly with save images, disk images, and pools on root-squash NFS is a never-ending battle (and can lead to an intense hatred of NFS!) Basically any time new code is added that needs to access any of these things, operation on root-squash is broken. Now that we have virFileOpenAs(), it should be much more straightforward to code so that root-squash scenarios keep working.
If anyone is adding some code that does anything with files on disk, they don't have a root-squash NFS setup, and they want some testing to make sure they haven't broken it, please point out the patch to me, and I'll make the time to apply it locally and try it on my NFS setup. (I know I should automatically just see it, but the volume on libvir-list has increased *a lot* lately, and I frequently find myself several days behind).
Overall, I'm worried that this patch is repeating some of danpb's bigger efforts to integrate a sanlock disk contention avoidance [1]. If a resource manager is properly hooked to all disks, then we can prevent contention between domains (and not limit ourself to just single-domain contention, as in this patch). On the other hand, this seems like an easy enough check to do for a single domain
My opinion is that it's probably much more likely that someone would mistakenly use the same disk in two different domains (due to cut-paste of XML) rather than using it twice in the same domain (that's much easier to notice since the definitions are close to each other in the same file).
Agreed.
So beyond the fact that this patch can't eliminate all erroneous duplicate usage, it's not looking for the category that is most likely, and thus it will probably have more an effect of providing a false sense of security, rather than of catching any duplicate usage. That makes me think this may actually do more harm than good.
Agreed.
whether or not we get the sanlock code completed (that is, timing wise this looks like it could be ready prior to 0.9.0 while Dan's work is bigger in scope and probably missed the feature freeze for this month's release). So I'm not sure whether to ack this.
Assuming this patch (or something similar) was accepted, would anybody (aside from those of us using upstream builds directly) ever get a release that contained this patch, but didn't contain Dan's? If the answer is no, then I think that's another reason to NACK this and wait for the Dan's patch.
And thanks to all people involved in this thread for your applies. Now it's clear that danp's snalock patch is a much better solution, so plz skip this one. -- Thanks, Hu Tao
participants (6)
-
Eric Blake
-
Hu Tao
-
Josh Durgin
-
Laine Stump
-
Nathan Baum
-
Wen Congyang