Two questions about NVDIMM devices

Hi, I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly. The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs? The second problem is that a VM fails to start with a backing NVDIMM in devdax mode due to SELinux preventing access to the /dev/dax* device (it doesn't happen with any other NVDIMM modes). Who should be responsible for handling the SELinux label appropriately in that case? libvirt, the system administrator, anybody else? Using <seclabel> in NVDIMM's source doesn't seem to be accepted by the domain XML schema. Thanks, Milan

On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config. Libvirt rounds up to 1 MB, essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes. If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
The second problem is that a VM fails to start with a backing NVDIMM in devdax mode due to SELinux preventing access to the /dev/dax* device (it doesn't happen with any other NVDIMM modes). Who should be responsible for handling the SELinux label appropriately in that case? libvirt, the system administrator, anybody else? Using <seclabel> in NVDIMM's source doesn't seem to be accepted by the domain XML schema.
The expectation is that out of the box SELinux will "just work". So anything that is broken is a bug in either libvirt or selinux policy. There is no expectation/requirement to use <seclabel> unless you want to setup non-default behaviour which isn't the case here. IOW this sounds like a genuine bug. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config.
Libvirt rounds up to 1 MB,
Actually 2 MB, at least in my case, apparently in qemuDomainGetMemoryModuleSizeAlignment. But it's just a detail.
essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes.
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
Yes, OK. I also wonder how exactly label size is counted in. It's added to the aligned value in qemuDomainNVDimmAlignSizePseries with the argument that label size is mandatory on ppc. But it's also permitted on other architectures and I can't see a similar adjustment for them. I think QEMU handles it fine in either case (by subtracting label size from the overall size and aligning the result down) and I guess the special handling of ppc in libvirt is just not to waste 256 MB unnecessarily. Still, all the size shuffling scares me and I can only hope that I compute my target sizes for the domain XML correctly to make everything working well...
The second problem is that a VM fails to start with a backing NVDIMM in devdax mode due to SELinux preventing access to the /dev/dax* device (it doesn't happen with any other NVDIMM modes). Who should be responsible for handling the SELinux label appropriately in that case? libvirt, the system administrator, anybody else? Using <seclabel> in NVDIMM's source doesn't seem to be accepted by the domain XML schema.
The expectation is that out of the box SELinux will "just work". So anything that is broken is a bug in either libvirt or selinux policy.
There is no expectation/requirement to use <seclabel> unless you want to setup non-default behaviour which isn't the case here.
IOW this sounds like a genuine bug.
OK, I'll try to find out what and where is the problem exactly. Thank you for the clarifications, Milan

Milan Zamazal <mzamazal@redhat.com> writes:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
The second problem is that a VM fails to start with a backing NVDIMM in devdax mode due to SELinux preventing access to the /dev/dax* device (it doesn't happen with any other NVDIMM modes). Who should be responsible for handling the SELinux label appropriately in that case? libvirt, the system administrator, anybody else? Using <seclabel> in NVDIMM's source doesn't seem to be accepted by the domain XML schema.
The expectation is that out of the box SELinux will "just work". So anything that is broken is a bug in either libvirt or selinux policy.
There is no expectation/requirement to use <seclabel> unless you want to setup non-default behaviour which isn't the case here.
IOW this sounds like a genuine bug.
OK, I'll try to find out what and where is the problem exactly.
The problem apparently is that /dev/dax* is a character device rather than a block device (such as /dev/pmem*), which is not expected by SELinux policy rules. This is an NVDIMM in fsdax mode: # ls -lZ /dev/pmem0 brw-rw----. 1 root disk system_u:object_r:device_t:s0 259, 0 Jul 9 11:39 /dev/pmem0 This is the same NVDIMM reconfigured as devdax: # ls -lZ /dev/dax0.0 crw-------. 1 root root system_u:object_r:device_t:s0 252, 5 Jul 9 11:43 /dev/dax0.0 (Unix permissions are different, but when I change them to `disk' group and 660, the same problem still occurs.) audit.log reports the following when starting a VM with an NVDIMM device in devdax mode: type=AVC msg=audit(1594144691.758:913): avc: denied { map } for pid=21659 comm="qemu-kvm" path="/dev/dax0.0" dev="tmpfs" ino=1521557 scontext=system_u:system_r:svirt_t:s0:c216,c981 tcontext=system_u:object_r:svirt_image_t:s0:c216,c981 tclass=chr_file permissive=0 type=AVC msg=audit(1594144691.758:914): avc: denied { map } for pid=21659 comm="qemu-kvm" path="/dev/dax0.0" dev="tmpfs" ino=1521557 scontext=system_u:system_r:svirt_t:s0:c216,c981 tcontext=system_u:object_r:svirt_image_t:s0:c216,c981 tclass=chr_file permissive=0 Indeed, svirt_t map access to svirt_image_t is allowed only for files and block devices: # sesearch -A -p map -s svirt_t -t svirt_image_t ... allow svirt_t svirt_image_t:blk_file map; allow svirt_t svirt_image_t:file map; What to do about it? Do I handle the NVDIMM in a wrong way or should sVirt policies be fixed? Thanks, Milan

On Thu, Jul 09, 2020 at 11:53:19AM +0200, Milan Zamazal wrote:
Milan Zamazal <mzamazal@redhat.com> writes:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
The second problem is that a VM fails to start with a backing NVDIMM in devdax mode due to SELinux preventing access to the /dev/dax* device (it doesn't happen with any other NVDIMM modes). Who should be responsible for handling the SELinux label appropriately in that case? libvirt, the system administrator, anybody else? Using <seclabel> in NVDIMM's source doesn't seem to be accepted by the domain XML schema.
The expectation is that out of the box SELinux will "just work". So anything that is broken is a bug in either libvirt or selinux policy.
There is no expectation/requirement to use <seclabel> unless you want to setup non-default behaviour which isn't the case here.
IOW this sounds like a genuine bug.
OK, I'll try to find out what and where is the problem exactly.
The problem apparently is that /dev/dax* is a character device rather than a block device (such as /dev/pmem*), which is not expected by SELinux policy rules.
This is an NVDIMM in fsdax mode:
# ls -lZ /dev/pmem0 brw-rw----. 1 root disk system_u:object_r:device_t:s0 259, 0 Jul 9 11:39 /dev/pmem0
This is the same NVDIMM reconfigured as devdax:
# ls -lZ /dev/dax0.0 crw-------. 1 root root system_u:object_r:device_t:s0 252, 5 Jul 9 11:43 /dev/dax0.0
(Unix permissions are different, but when I change them to `disk' group and 660, the same problem still occurs.)
audit.log reports the following when starting a VM with an NVDIMM device in devdax mode:
type=AVC msg=audit(1594144691.758:913): avc: denied { map } for pid=21659 comm="qemu-kvm" path="/dev/dax0.0" dev="tmpfs" ino=1521557 scontext=system_u:system_r:svirt_t:s0:c216,c981 tcontext=system_u:object_r:svirt_image_t:s0:c216,c981 tclass=chr_file permissive=0 type=AVC msg=audit(1594144691.758:914): avc: denied { map } for pid=21659 comm="qemu-kvm" path="/dev/dax0.0" dev="tmpfs" ino=1521557 scontext=system_u:system_r:svirt_t:s0:c216,c981 tcontext=system_u:object_r:svirt_image_t:s0:c216,c981 tclass=chr_file permissive=0
Indeed, svirt_t map access to svirt_image_t is allowed only for files and block devices:
# sesearch -A -p map -s svirt_t -t svirt_image_t ... allow svirt_t svirt_image_t:blk_file map; allow svirt_t svirt_image_t:file map;
What to do about it? Do I handle the NVDIMM in a wrong way or should sVirt policies be fixed?
File a BZ against the policy as this looks like a valid use case Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config.
Libvirt rounds up to 1 MB, essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes.
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as S == aligned_size + label_size and that size is used for the QEMU device; while on x86_64 one can specify any size S and align_up(S) will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working. For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working. Thanks, Milan

On Thu, Sep 10, 2020 at 04:26:40PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config.
Libvirt rounds up to 1 MB, essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes.
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as
S == aligned_size + label_size
and that size is used for the QEMU device; while on x86_64 one can specify any size S and
align_up(S)
will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working.
For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working.
I don't know what changes you are referring to here, but if they were in libvirt I'd consider that a bug - we shouldn't break a previously working configuration by increasing required alignment. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Sep 10, 2020 at 04:26:40PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config.
Libvirt rounds up to 1 MB, essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes.
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as
S == aligned_size + label_size
and that size is used for the QEMU device; while on x86_64 one can specify any size S and
align_up(S)
will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working.
For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working.
I don't know what changes you are referring to here, but if they were in libvirt I'd consider that a bug - we shouldn't break a previously working configuration by increasing required alignment.
I mean disabling the auto alignment in https://gitlab.com/libvirt/libvirt/-/commit/07de813924caf37e535855541c0c1183... and replacing it with validation in https://gitlab.com/libvirt/libvirt/-/commit/0ccceaa57c50e5ee528f7073fa8723af... That change can cause a VM fail to start but after (manually) adjusting the device size, all should work all right. Changes that would actually change sizes would be more dangerous. Regards, Milan

On Thu, Sep 10, 2020 at 04:54:08PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Sep 10, 2020 at 04:26:40PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config.
Libvirt rounds up to 1 MB, essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes.
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as
S == aligned_size + label_size
and that size is used for the QEMU device; while on x86_64 one can specify any size S and
align_up(S)
will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working.
For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working.
I don't know what changes you are referring to here, but if they were in libvirt I'd consider that a bug - we shouldn't break a previously working configuration by increasing required alignment.
I mean disabling the auto alignment in https://gitlab.com/libvirt/libvirt/-/commit/07de813924caf37e535855541c0c1183... and replacing it with validation in https://gitlab.com/libvirt/libvirt/-/commit/0ccceaa57c50e5ee528f7073fa8723af...
That change can cause a VM fail to start but after (manually) adjusting the device size, all should work all right. Changes that would actually change sizes would be more dangerous.
Sigh, that second commit even calls out the fact that it breaks existing guests. This needs to be reverted, as that is not acceptable. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 9/10/20 4:56 PM, Daniel P. Berrangé wrote:
On Thu, Sep 10, 2020 at 04:54:08PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Sep 10, 2020 at 04:26:40PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
On Thu, Jul 02, 2020 at 01:21:15PM +0200, Milan Zamazal wrote:
Hi,
I've met two situations with NVDIMM support in libvirt where I'm not sure all the parties (libvirt & I) do the things correctly.
The first problem is with memory alignment and size changes. In addition to the size changes applied to NVDIMMs by QEMU, libvirt also makes some NVDIMM size changes for better alignments, in qemuDomainMemoryDeviceAlignSize. This can lead to the size being rounded up, exceeding the size of the backing device and QEMU failing to start the VM for that reason (I've experienced that actually). I work with emulated NVDIMM devices, not a bare metal hardware, so one might argue that in practice the device sizes should already be aligned, but I'm not sure it must be always the case considering labels or whatever else the user decides to set up. And I still don't feel very comfortable that I have to count with two internal size adjustments (libvirt & QEMU) to the `size' value I specify, with the ultimate goal of getting the VM started and having the NVDIMM aligned properly to make (non-NVDIMM) memory hot plug working. Is the size alignment performed by libvirt, especially rounding up, completely correct for NVDIMMs?
The comment on the function says QEMU aligns to "page size", which is something that can vary depending not only on architecture, and also the build config options for the kernel on that architecture. eg aarch64 has different page size in RHEL than other distros because of different choice of page size in kernel config.
Libvirt rounds up to 1 MB, essentially so that the size works no matter what architecture or build options were used. I think this is quite compelling as I don't think mgmt apps are likely to care enough about non-x86 architectures to pick the right rounded sizes.
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as
S == aligned_size + label_size
and that size is used for the QEMU device; while on x86_64 one can specify any size S and
align_up(S)
will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working.
For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working.
I don't know what changes you are referring to here, but if they were in libvirt I'd consider that a bug - we shouldn't break a previously working configuration by increasing required alignment.
I mean disabling the auto alignment in https://gitlab.com/libvirt/libvirt/-/commit/07de813924caf37e535855541c0c1183... and replacing it with validation in https://gitlab.com/libvirt/libvirt/-/commit/0ccceaa57c50e5ee528f7073fa8723af...
That change can cause a VM fail to start but after (manually) adjusting the device size, all should work all right. Changes that would actually change sizes would be more dangerous.
Sigh, that second commit even calls out the fact that it breaks existing guests. This needs to be reverted, as that is not acceptable.
Thing is, on PPC it was never working IIRC. I remember discussing this with Andrea. So from my POV, there wasn't really anything to break. Michal

On Thu, 2020-09-10 at 20:53 +0200, Michal Privoznik wrote:
On 9/10/20 4:56 PM, Daniel P. Berrangé wrote:
On Thu, Sep 10, 2020 at 04:54:08PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
If we're enforcing this 1 MB rounding though, we really should be documenting it clearly, so that apps can pick the right backing file size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as
S == aligned_size + label_size
and that size is used for the QEMU device; while on x86_64 one can specify any size S and
align_up(S)
will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working.
The ppc64-specific requirements were documented with commit 8f474ceea05aec349be19726e394a62e300efe77 Author: Daniel Henrique Barboza <danielhb413@gmail.com> Date: Mon Jul 20 13:51:46 2020 -0300 formatdomain.html.in: mention pSeries NVDIMM 'align down' mechanic The reason why we align down the guest area (total-size - label-size) is explained in the body of qemuDomainNVDimmAlignSizePseries(). This behavior must also be documented in the user docs. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Andrea Bolognani <abologna@redhat.com> but later reverted. See below.
For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working.
I don't know what changes you are referring to here, but if they were in libvirt I'd consider that a bug - we shouldn't break a previously working configuration by increasing required alignment.
I mean disabling the auto alignment in https://gitlab.com/libvirt/libvirt/-/commit/07de813924caf37e535855541c0c1183... and replacing it with validation in https://gitlab.com/libvirt/libvirt/-/commit/0ccceaa57c50e5ee528f7073fa8723af...
That change can cause a VM fail to start but after (manually) adjusting the device size, all should work all right. Changes that would actually change sizes would be more dangerous.
Sigh, that second commit even calls out the fact that it breaks existing guests. This needs to be reverted, as that is not acceptable.
Thing is, on PPC it was never working IIRC. I remember discussing this with Andrea. So from my POV, there wasn't really anything to break.
This is my fault for not keeping a close enough eye on the patch series when it was being posted and reviewed upstream. Sorry :( While it is true that QEMU will reject sizes that are not compliant with the ppc64-specific alignment requirements, libvirt was already aware of this and behaved accordingly: so a configuration like <memory model='nvdimm'> <source> <path>/var/lib/libvirt/images/nvdimm.raw</path> </source> <target> <size unit='KiB'>524288</size> <label> <size unit='KiB'>128</size> </label> </target> </memory> can't work because the usable area (512 MiB - 128 KiB) is not aligned to 256 MiB, however libvirt will align it down and generate a QEMU command line that looks like -object memory-backend-file,[...],size=268566528 -device nvdimm,[...],label-size=131072 where 131072 is 128 KiB and 268566528 is 265 MiB + 128 KiB. In this scenario, assuming the user has allocated a backing file that's 512 MiB in size, ~256 MiB will never be used and will go to waste, but other than that the sizing requirements will be satisfied from QEMU's point of view and the guest will happily start. So ultimately this change caused existing guests to stop working and I agree with Dan that it should be reverted. -- Andrea Bolognani / Red Hat / Virtualization

On 9/14/20 1:40 PM, Andrea Bolognani wrote:
On Thu, 2020-09-10 at 20:53 +0200, Michal Privoznik wrote:
On 9/10/20 4:56 PM, Daniel P. Berrangé wrote:
On Thu, Sep 10, 2020 at 04:54:08PM +0200, Milan Zamazal wrote:
Daniel P. Berrangé <berrange@redhat.com> writes:
> If we're enforcing this 1 MB rounding though, we really should be > documenting it clearly, so that apps can pick the right backing file > size. I think we dropped the ball on docs.
I still can't see it in the documentation, would it be possible to be clear about it in the docs, please? For first, it's not very intuitive to figure out that (if I've figured out it correctly) on POWER one *must* specify the NVDIMM size S as
S == aligned_size + label_size
and that size is used for the QEMU device; while on x86_64 one can specify any size S and
align_up(S)
will be used for the QEMU device (and label size doesn't influence the value). And additional alignment may be required for having any memory hot plug working.
The ppc64-specific requirements were documented with
commit 8f474ceea05aec349be19726e394a62e300efe77 Author: Daniel Henrique Barboza <danielhb413@gmail.com> Date: Mon Jul 20 13:51:46 2020 -0300
formatdomain.html.in: mention pSeries NVDIMM 'align down' mechanic
The reason why we align down the guest area (total-size - label-size) is explained in the body of qemuDomainNVDimmAlignSizePseries(). This behavior must also be documented in the user docs.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Andrea Bolognani <abologna@redhat.com>
but later reverted. See below.
For second, and more importantly, I'm afraid that without documenting it, future changes may break the current behavior without warning. For example, the recent changes regarding POWER alignment in 6.7.0 are for good IMHO and one can use the same size with both 6.7 and 6.6 versions, but they could still cause pre-6.7 sizes stop working.
I don't know what changes you are referring to here, but if they were in libvirt I'd consider that a bug - we shouldn't break a previously working configuration by increasing required alignment.
I mean disabling the auto alignment in https://gitlab.com/libvirt/libvirt/-/commit/07de813924caf37e535855541c0c1183... and replacing it with validation in https://gitlab.com/libvirt/libvirt/-/commit/0ccceaa57c50e5ee528f7073fa8723af...
That change can cause a VM fail to start but after (manually) adjusting the device size, all should work all right. Changes that would actually change sizes would be more dangerous.
Sigh, that second commit even calls out the fact that it breaks existing guests. This needs to be reverted, as that is not acceptable.
Thing is, on PPC it was never working IIRC. I remember discussing this with Andrea. So from my POV, there wasn't really anything to break.
Yes, this is correct. However, this is a Libvirt design violation, regardless of whether there are existing guests to break or not, and this is why I have already agreed with the revert and 'll post patches soon.
This is my fault for not keeping a close enough eye on the patch series when it was being posted and reviewed upstream. Sorry :(
Nah. I'm the one that posted the patches ignoring the fact that this was breaking the intended design. Let's not blame Brno for a Brazilian mess up :P We'ĺl get this reverted, tidy it up what was there before to make the size consistent between what QEMU and domain XML sees (without breaking guests) and get it all wrapped up for the next Libvirt release. Thanks, DHB
participants (5)
-
Andrea Bolognani
-
Daniel Henrique Barboza
-
Daniel P. Berrangé
-
Michal Privoznik
-
Milan Zamazal