Zoned storage support in libvirt

Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this: --blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0 Note that regular --blockdev host_device will not work. For now the virtio-blk device is the only one that supports zoned blockdevs. This brings to mind a few questions: 1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none". 2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes. The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration. We still have time to decide on the QEMU command-line syntax for QEMU 8.0, so I wanted to raise this now. Thanks, Stefan

On Tue, Jan 10, 2023 at 10:19:51AM -0500, Stefan Hajnoczi wrote:
Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this:
--blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0
Note that regular --blockdev host_device will not work.
For now the virtio-blk device is the only one that supports zoned blockdevs.
Does the virtio-blk device expowsed guest ABI differ at all when connected zoned_host_device instead of host_device ?
This brings to mind a few questions:
1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none".
2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes.
The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration.
What happens with existing QEMU impls if you use --blockdev host_device pointing to a /dev/$BDEV that is a zoned device ? If it succeeds and works correctly, then we likely need to continue to support that. This would push towards needing a new XML element. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Jan 10, 2023 at 03:29:47PM +0000, Daniel P. Berrangé wrote:
On Tue, Jan 10, 2023 at 10:19:51AM -0500, Stefan Hajnoczi wrote:
Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this:
--blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0
Note that regular --blockdev host_device will not work.
For now the virtio-blk device is the only one that supports zoned blockdevs.
Does the virtio-blk device expowsed guest ABI differ at all when connected zoned_host_device instead of host_device ?
Yes. There is a VIRTIO feature bit, some configuration space fields, etc. virtio-blk-pci detects when the blockdev is zoned and enables the feature bit.
This brings to mind a few questions:
1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none".
2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes.
The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration.
What happens with existing QEMU impls if you use --blockdev host_device pointing to a /dev/$BDEV that is a zoned device ? If it succeeds and works correctly, then we likely need to continue to support that. This would push towards needing a new XML element.
Pointing host_device at a zoned device doesn't result in useful behavior because the guest is unaware that this is a zoned device. The guest won't be able to access the device correctly (i.e. sequential writes only). Write requests will fail eventually. I would consider zoned devices totally unsupported in QEMU today and we don't need to worry about preserving any kind of backwards compatibility with --blockdev host_device,filename=/dev/my_zoned_device. Stefan

On Wed, Jan 11, 2023 at 10:24:30AM -0500, Stefan Hajnoczi wrote:
On Tue, Jan 10, 2023 at 03:29:47PM +0000, Daniel P. Berrangé wrote:
On Tue, Jan 10, 2023 at 10:19:51AM -0500, Stefan Hajnoczi wrote:
Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this:
--blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0
Note that regular --blockdev host_device will not work.
For now the virtio-blk device is the only one that supports zoned blockdevs.
Does the virtio-blk device expowsed guest ABI differ at all when connected zoned_host_device instead of host_device ?
Yes. There is a VIRTIO feature bit, some configuration space fields, etc. virtio-blk-pci detects when the blockdev is zoned and enables the feature bit.
I get a general sense of unease when frontend device ABI sensitive features get secretly toggled based on features exposed by the backend. When trying to validate ABI compatibility of guest configs, libvirt would generally compare frontend properties to look for differences. There are a small set of cases where backends affect frontend features, but it is not that common to see. Consider what happens if we have a guest running no zoned storage, and we need to evacuate the host to a machine without zoned storage available. Could we replace the stroage backend on the target host with a raw/qcow2 backend but keep pretending it is zoned storage to the guest. The guest would continue making its I/O ops be batched for the zoned storage, which would be redundant for raw/qcow2, but presumbly should still work. If this is possible it would suggest the need to have explicit settings for zoned storage on the virtio-blk frontend. QEMU would "merely" validate that these settings are turned on, if the host storage is zoned too.
This brings to mind a few questions:
1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none".
2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes.
The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration.
What happens with existing QEMU impls if you use --blockdev host_device pointing to a /dev/$BDEV that is a zoned device ? If it succeeds and works correctly, then we likely need to continue to support that. This would push towards needing a new XML element.
Pointing host_device at a zoned device doesn't result in useful behavior because the guest is unaware that this is a zoned device. The guest won't be able to access the device correctly (i.e. sequential writes only). Write requests will fail eventually.
I would consider zoned devices totally unsupported in QEMU today and we don't need to worry about preserving any kind of backwards compatibility with --blockdev host_device,filename=/dev/my_zoned_device.
So I guess I'm not so worried about host_device vs zoned_host_device, if we have explicit settings for controlled zoned behaviour on the virtio-blk frontend. I feel like we should have something explicit somewhere though, as this is a pretty significant difference in the storage stack, that I think mgmt apps should be aware of, as it has implications for how you manage the VMs on an ongoing basis. We could still have it "do what I mean" by default though. eg the virtio-blk setting defaults could imply "match the host", so we get effectively a tri-state (zoned=on/off/auto) With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 1/30/23 21:21, Daniel P. Berrangé wrote:
On Wed, Jan 11, 2023 at 10:24:30AM -0500, Stefan Hajnoczi wrote:
On Tue, Jan 10, 2023 at 03:29:47PM +0000, Daniel P. Berrangé wrote:
On Tue, Jan 10, 2023 at 10:19:51AM -0500, Stefan Hajnoczi wrote:
Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this:
--blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0
Note that regular --blockdev host_device will not work.
For now the virtio-blk device is the only one that supports zoned blockdevs.
Does the virtio-blk device expowsed guest ABI differ at all when connected zoned_host_device instead of host_device ?
Yes. There is a VIRTIO feature bit, some configuration space fields, etc. virtio-blk-pci detects when the blockdev is zoned and enables the feature bit.
I get a general sense of unease when frontend device ABI sensitive features get secretly toggled based on features exposed by the backend.
When trying to validate ABI compatibility of guest configs, libvirt would generally compare frontend properties to look for differences.
There are a small set of cases where backends affect frontend features, but it is not that common to see.
Consider what happens if we have a guest running no zoned storage, and we need to evacuate the host to a machine without zoned storage available. Could we replace the stroage backend on the target host with a raw/qcow2 backend but keep pretending it is zoned storage to the guest. The guest would continue making its I/O ops be batched for the zoned storage, which would be redundant for raw/qcow2, but presumbly should still work. If this is possible it would suggest the need to have explicit settings for zoned storage on the virtio-blk frontend. QEMU would "merely" validate that these settings are turned on, if the host storage is zoned too.
This brings to mind a few questions:
1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none".
2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes.
The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration.
What happens with existing QEMU impls if you use --blockdev host_device pointing to a /dev/$BDEV that is a zoned device ? If it succeeds and works correctly, then we likely need to continue to support that. This would push towards needing a new XML element.
Pointing host_device at a zoned device doesn't result in useful behavior because the guest is unaware that this is a zoned device. The guest won't be able to access the device correctly (i.e. sequential writes only). Write requests will fail eventually.
I would consider zoned devices totally unsupported in QEMU today and we don't need to worry about preserving any kind of backwards compatibility with --blockdev host_device,filename=/dev/my_zoned_device.
So I guess I'm not so worried about host_device vs zoned_host_device, if we have explicit settings for controlled zoned behaviour on the virtio-blk frontend.
I feel like we should have something explicit somewhere though, as this is a pretty significant difference in the storage stack, that I think mgmt apps should be aware of, as it has implications for how you manage the VMs on an ongoing basis.
We could still have it "do what I mean" by default though. eg the virtio-blk setting defaults could imply "match the host", so we get effectively a tri-state (zoned=on/off/auto)
What would zoned=on mean ? If the backend is not zoned, virtio will expose a regular block device to the guest as it should. For zoned=auto, same, I am not sure what that would achieve. If the backend is zoned, it will be seen as zoned by the guest. If the backend is a regular disk, it will be exposed as a regular disk. So what would this option achieve ? And for zoned=off, I guess you would want to ignore a backend drive if it is zoned ?
With regards, Daniel
-- Damien Le Moal Western Digital Research

On Mon, Jan 30, 2023 at 09:30:40PM +0900, Damien Le Moal wrote:
On 1/30/23 21:21, Daniel P. Berrangé wrote:
On Wed, Jan 11, 2023 at 10:24:30AM -0500, Stefan Hajnoczi wrote:
On Tue, Jan 10, 2023 at 03:29:47PM +0000, Daniel P. Berrangé wrote:
On Tue, Jan 10, 2023 at 10:19:51AM -0500, Stefan Hajnoczi wrote:
Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this:
--blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0
Note that regular --blockdev host_device will not work.
For now the virtio-blk device is the only one that supports zoned blockdevs.
Does the virtio-blk device expowsed guest ABI differ at all when connected zoned_host_device instead of host_device ?
Yes. There is a VIRTIO feature bit, some configuration space fields, etc. virtio-blk-pci detects when the blockdev is zoned and enables the feature bit.
I get a general sense of unease when frontend device ABI sensitive features get secretly toggled based on features exposed by the backend.
When trying to validate ABI compatibility of guest configs, libvirt would generally compare frontend properties to look for differences.
There are a small set of cases where backends affect frontend features, but it is not that common to see.
Consider what happens if we have a guest running no zoned storage, and we need to evacuate the host to a machine without zoned storage available. Could we replace the stroage backend on the target host with a raw/qcow2 backend but keep pretending it is zoned storage to the guest. The guest would continue making its I/O ops be batched for the zoned storage, which would be redundant for raw/qcow2, but presumbly should still work. If this is possible it would suggest the need to have explicit settings for zoned storage on the virtio-blk frontend. QEMU would "merely" validate that these settings are turned on, if the host storage is zoned too.
This brings to mind a few questions:
1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none".
2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes.
The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration.
What happens with existing QEMU impls if you use --blockdev host_device pointing to a /dev/$BDEV that is a zoned device ? If it succeeds and works correctly, then we likely need to continue to support that. This would push towards needing a new XML element.
Pointing host_device at a zoned device doesn't result in useful behavior because the guest is unaware that this is a zoned device. The guest won't be able to access the device correctly (i.e. sequential writes only). Write requests will fail eventually.
I would consider zoned devices totally unsupported in QEMU today and we don't need to worry about preserving any kind of backwards compatibility with --blockdev host_device,filename=/dev/my_zoned_device.
So I guess I'm not so worried about host_device vs zoned_host_device, if we have explicit settings for controlled zoned behaviour on the virtio-blk frontend.
I feel like we should have something explicit somewhere though, as this is a pretty significant difference in the storage stack, that I think mgmt apps should be aware of, as it has implications for how you manage the VMs on an ongoing basis.
We could still have it "do what I mean" by default though. eg the virtio-blk setting defaults could imply "match the host", so we get effectively a tri-state (zoned=on/off/auto)
What would zoned=on mean ? If the backend is not zoned, virtio will expose a regular block device to the guest as it should.
Sorry, I should have expanded further, I didn't mean that alone. It would also need to expose the related settings of the virtio-blk device:
+ virtio_stl_p(vdev, &blkcfg.zoned.zone_sectors, + bs->bl.zone_size / 512); + virtio_stl_p(vdev, &blkcfg.zoned.max_active_zones, + bs->bl.max_active_zones); + virtio_stl_p(vdev, &blkcfg.zoned.max_open_zones, + bs->bl.max_open_zones); + virtio_stl_p(vdev, &blkcfg.zoned.write_granularity, blk_size); + virtio_stl_p(vdev, &blkcfg.zoned.max_append_sectors, + bs->bl.max_append_sectors);
so eg -device virtio-blk,zoned=on,zone_sectors=NN,max_active_zones=NN,max_open_zones=NN.... So the guest would be honouring thuese zone constraints, even though they are not required by a raw/qcow2 file. in this world -device virtio-blk,zoned=on would be a short hand to say get the rest of the tunables from the backend device or error, if the backend doesn't support them. -device virtio-blk,zoned=auto would be a short hand to say "do the right thing" regardless of whether the backend is zoned or non-zoned.
For zoned=auto, same, I am not sure what that would achieve. If the backend is zoned, it will be seen as zoned by the guest. If the backend is a regular disk, it will be exposed as a regular disk. So what would this option achieve ?
And for zoned=off, I guess you would want to ignore a backend drive if it is zoned ?
It would explicitly report an error, since IIUC from Stefan's reply, this scenario would eventually end in I/O failures. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Jan 30, 2023 at 12:53:22PM +0000, Daniel P. Berrangé wrote:
On Mon, Jan 30, 2023 at 09:30:40PM +0900, Damien Le Moal wrote:
On 1/30/23 21:21, Daniel P. Berrangé wrote:
On Wed, Jan 11, 2023 at 10:24:30AM -0500, Stefan Hajnoczi wrote:
On Tue, Jan 10, 2023 at 03:29:47PM +0000, Daniel P. Berrangé wrote:
On Tue, Jan 10, 2023 at 10:19:51AM -0500, Stefan Hajnoczi wrote:
Hi Peter, Zoned storage support (https://zonedstorage.io/docs/introduction/zoned-storage) is being added to QEMU. Given a zoned host block device, the QEMU syntax will look like this:
--blockdev zoned_host_device,node-name=drive0,filename=/dev/$BDEV,... --device virtio-blk-pci,drive=drive0
Note that regular --blockdev host_device will not work.
For now the virtio-blk device is the only one that supports zoned blockdevs.
Does the virtio-blk device expowsed guest ABI differ at all when connected zoned_host_device instead of host_device ?
Yes. There is a VIRTIO feature bit, some configuration space fields, etc. virtio-blk-pci detects when the blockdev is zoned and enables the feature bit.
I get a general sense of unease when frontend device ABI sensitive features get secretly toggled based on features exposed by the backend.
When trying to validate ABI compatibility of guest configs, libvirt would generally compare frontend properties to look for differences.
There are a small set of cases where backends affect frontend features, but it is not that common to see.
Consider what happens if we have a guest running no zoned storage, and we need to evacuate the host to a machine without zoned storage available. Could we replace the stroage backend on the target host with a raw/qcow2 backend but keep pretending it is zoned storage to the guest. The guest would continue making its I/O ops be batched for the zoned storage, which would be redundant for raw/qcow2, but presumbly should still work. If this is possible it would suggest the need to have explicit settings for zoned storage on the virtio-blk frontend. QEMU would "merely" validate that these settings are turned on, if the host storage is zoned too.
This brings to mind a few questions:
1. Does libvirt need domain XML syntax for zoned storage? Alternatively, it could probe /sys/block/$BDEV/queue/zoned and generate the correct QEMU command-line arguments for zoned devices when the contents of the file are not "none".
2. Should QEMU --blockdev host_device detected zoned devices so that --blockdev zoned_host_device is not necessary? That way libvirt would automatically support zoned storage without any domain XML syntax or libvirt code changes.
The drawbacks I see when QEMU detects zoned storage automatically: - You can't easiy tell if a blockdev is zoned from the command-line. - It's possible to mismatch zoned and non-zoned devices across live migration.
What happens with existing QEMU impls if you use --blockdev host_device pointing to a /dev/$BDEV that is a zoned device ? If it succeeds and works correctly, then we likely need to continue to support that. This would push towards needing a new XML element.
Pointing host_device at a zoned device doesn't result in useful behavior because the guest is unaware that this is a zoned device. The guest won't be able to access the device correctly (i.e. sequential writes only). Write requests will fail eventually.
I would consider zoned devices totally unsupported in QEMU today and we don't need to worry about preserving any kind of backwards compatibility with --blockdev host_device,filename=/dev/my_zoned_device.
So I guess I'm not so worried about host_device vs zoned_host_device, if we have explicit settings for controlled zoned behaviour on the virtio-blk frontend.
I feel like we should have something explicit somewhere though, as this is a pretty significant difference in the storage stack, that I think mgmt apps should be aware of, as it has implications for how you manage the VMs on an ongoing basis.
We could still have it "do what I mean" by default though. eg the virtio-blk setting defaults could imply "match the host", so we get effectively a tri-state (zoned=on/off/auto)
What would zoned=on mean ? If the backend is not zoned, virtio will expose a regular block device to the guest as it should.
Sorry, I should have expanded further, I didn't mean that alone. It would also need to expose the related settings of the virtio-blk device:
+ virtio_stl_p(vdev, &blkcfg.zoned.zone_sectors, + bs->bl.zone_size / 512); + virtio_stl_p(vdev, &blkcfg.zoned.max_active_zones, + bs->bl.max_active_zones); + virtio_stl_p(vdev, &blkcfg.zoned.max_open_zones, + bs->bl.max_open_zones); + virtio_stl_p(vdev, &blkcfg.zoned.write_granularity, blk_size); + virtio_stl_p(vdev, &blkcfg.zoned.max_append_sectors, + bs->bl.max_append_sectors);
so eg
-device virtio-blk,zoned=on,zone_sectors=NN,max_active_zones=NN,max_open_zones=NN....
So the guest would be honouring thuese zone constraints, even though they are not required by a raw/qcow2 file.
in this world
-device virtio-blk,zoned=on
would be a short hand to say get the rest of the tunables from the backend device or error, if the backend doesn't support them.
-device virtio-blk,zoned=auto
would be a short hand to say "do the right thing" regardless of whether the backend is zoned or non-zoned.
For zoned=auto, same, I am not sure what that would achieve. If the backend is zoned, it will be seen as zoned by the guest. If the backend is a regular disk, it will be exposed as a regular disk. So what would this option achieve ?
And for zoned=off, I guess you would want to ignore a backend drive if it is zoned ?
It would explicitly report an error, since IIUC from Stefan's reply, this scenario would eventually end in I/O failures.
What you've described sounds good to me: 1. By default it exposes the device, no questions asked. 2. Management tools like libvirt can explicitly request zoned=on/off, zone_sectors=..., etc to prevent misconfiguration. Best of both worlds. Stefan
participants (3)
-
Damien Le Moal
-
Daniel P. Berrangé
-
Stefan Hajnoczi