[libvirt] RFC: exposing qemu's block-set-write-threshold

I'm trying to wire up libvirt to do event reporting based on qemu 2.3's BLOCK_WRITE_THRESHOLD event. Doing this will allow management applications to do event-based notification on when to enlarge LVM (or other) storage underlying a qcow2 volume, rather than their current requirement to frequently poll block statistics. But I'm stuck on the best way to expose the new parameter: One idea is to treat it as part of the domain XML, and have virDomainSetBlockIoTune add one more typed parameter for a disk's current write threshold. Doing this could allow setting a threshold even for an offline domain (although the threshold is only meaningful for a running domain), but might get weird because qemu's event is one-shot (you have to re-arm a new threshold every time an existing threshold fires - so every time it fires, the domain XML is rewritten, even though it is not guest-visible ABI that was changing). At least with this approach, it is also easy for a client to poll the current setting of the threshold, via virDomainGetBlockIoTune. But the threshold isn't quite a tuning parameter (it isn't throttling how fast the guest can write to the block device, only how full the host side can get in order to allow transparent resizing of the host storage prior to running out of space). Another idea is to add a completely new API, maybe named virDomainBlockSetWriteThreshold(virDomainPtr dom, const char *disk, long long int threshold, unsigned int flags) (with threshold in bytes). Here, virDomainBlockStatsFlags() could be a way to query the current threshold. And if desired, we could add a flag value to treat threshold as a percentage instead of a byte value (but is 1% too large of granularity, and how would you scale the percentage to anything finer while still keeping the parameter as long long int rather than double?) Of course, I'd want virConnectGetAllDomainStats() to list the current threshold setting (0 if no threshold or if the event has already fired, non-zero if the threshold is still set waiting to fire), so that clients can query thresholds for multiple domains and multiple disks per domain in one API call. But I don't know if we have any good way to set multiple thresholds in one call (at least virDomainSetBlockIoTune must be called once per disk; it might be possible for my proposed virDomainBlockStatsFlags() to set a threshold for multiple disks if the disk name is passed as NULL - but then we're back to the question of what happens if the guest has multiple disks of different sizes; it's better to set per-disk thresholds than to assume all disks must be at the same byte or percentage threshold). I'm also worried about what happens across libvirtd restarts - if the qemu event fires while libvirtd is unconnected, should libvirt be tracking that a threshold was registered in the XML, and upon reconnection check if qemu still has the threshold? If qemu no longer has a threshold, then libvirt can assume it missed the event, and generate one as part of reconnecting to the domain. Thoughts? If I don't hear anything, I'll base my first implementation on adding a new API, rather than reusing virDomainSetBlockIoTune. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
I'm trying to wire up libvirt to do event reporting based on qemu 2.3's BLOCK_WRITE_THRESHOLD event. Doing this will allow management applications to do event-based notification on when to enlarge LVM (or other) storage underlying a qcow2 volume, rather than their current requirement to frequently poll block statistics. But I'm stuck on the best way to expose the new parameter:
One idea is to treat it as part of the domain XML, and have virDomainSetBlockIoTune add one more typed parameter for a disk's current write threshold. Doing this could allow setting a threshold
Since virDomainSetBlockIoTune operates on disk-level and the event will need to be registered on a backing-chain element level, using virDomainSetBlockIoTune won't be a good choice, IMO.
even for an offline domain (although the threshold is only meaningful for a running domain), but might get weird because qemu's event is one-shot (you have to re-arm a new threshold every time an existing threshold fires - so every time it fires, the domain XML is rewritten, even though it is not guest-visible ABI that was changing). At least
Having the configuration exposed in the XML might make sense. Although currently it won't allow to specify them for anything but the top image, which will require users of libvirt to register the event additionally to the backing chain elements. Since the allocation of the backing chain elements can change only once a block job is started it should be safe enough to allow that only using the API and once libvirt will track the full backing chain we can use the XML config too.
with this approach, it is also easy for a client to poll the current setting of the threshold, via virDomainGetBlockIoTune. But the threshold isn't quite a tuning parameter (it isn't throttling how fast the guest can write to the block device, only how full the host side can get in order to allow transparent resizing of the host storage prior to running out of space).
Again, virDomainGetBlockIoTune won't work on individual elements and using it that way would also be impractical.
Another idea is to add a completely new API, maybe named virDomainBlockSetWriteThreshold(virDomainPtr dom, const char *disk, long long int threshold, unsigned int flags) (with threshold in bytes).
Why is the treshold a signed value? I can't imagine a use case where negative values could be used. As for the @disk parameter it will need to take the target with the index argument since I know that oVirt is using the same approach also for backing chain sub-elements hosted on LVM when doing snapshot merging via the block job APIs. This also implies another required thing for this to be actually usable. Since the block jobs happening on the backing chain can trigger the event on a member of the backing chain, the returned event will need to contain the disk identification in a way that is unique across backing chain alterations. While for local files we could again opt to use the path this won't be scalable to non-local devices. Thus I think the bes way will be to include also the disk target with index. This though will require using node-names for tracking and/or generating the indexes in the backing store in a deterministic way.
Here, virDomainBlockStatsFlags() could be a way to query the current threshold. And if desired, we could add a flag value to treat thresholda
virDomainBlockStatsFlags does not operate on backing chain subelements, so it would need to be instrumented to do so.
as a percentage instead of a byte value (but is 1% too large of granularity, and how would you scale the percentage to anything finer while still keeping the parameter as long long int rather than double?)
You can use a proportional unit with a larger fractional part: promile, parts per million, parts per billion etc.
Of course, I'd want virConnectGetAllDomainStats() to list the current threshold setting (0 if no threshold or if the event has already fired, non-zero if the threshold is still set waiting to fire), so that clients can query thresholds for multiple domains and multiple disks per domain in one API call. But I don't know if we have any good way to set
Not only disks but for separate backing chain elements too.
multiple thresholds in one call (at least virDomainSetBlockIoTune must be called once per disk; it might be possible for my proposed virDomainBlockStatsFlags() to set a threshold for multiple disks if the disk name is passed as NULL - but then we're back to the question of what happens if the guest has multiple disks of different sizes; it's better to set per-disk thresholds than to assume all disks must be at the same byte or percentage threshold).
That is just usage-sugar for the users. I'd rather avoid doing this on multiple disks simultaneously.
I'm also worried about what happens across libvirtd restarts - if the qemu event fires while libvirtd is unconnected, should libvirt be tracking that a threshold was registered in the XML, and upon reconnection check if qemu still has the threshold? If qemu no longer has a threshold, then libvirt can assume it missed the event, and generate one as part of reconnecting to the domain.
Libvirt should have enough information to actually check if the event happened and should be able to decide that it in fact missed the event and it should be emitted by libvirt. The new block copy API should also add a new typed parameter config that will allow to set the write treshold once you are using it in a similar way with a LV as backing. Peter

On 05/19/2015 05:52 AM, Peter Krempa wrote:
On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
I'm trying to wire up libvirt to do event reporting based on qemu 2.3's BLOCK_WRITE_THRESHOLD event. Doing this will allow management applications to do event-based notification on when to enlarge LVM (or other) storage underlying a qcow2 volume, rather than their current requirement to frequently poll block statistics. But I'm stuck on the best way to expose the new parameter:
One idea is to treat it as part of the domain XML, and have virDomainSetBlockIoTune add one more typed parameter for a disk's current write threshold. Doing this could allow setting a threshold
Since virDomainSetBlockIoTune operates on disk-level and the event will need to be registered on a backing-chain element level, using virDomainSetBlockIoTune won't be a good choice, IMO.
Oh, I wasn't even thinking about backing-chain levels. You are right - it is entirely possible for someone to want to set a threshold reporting for "vda[1]" (the backing element of "vda") - of course, it would only kick in during a blockcommit to vda[1], but it is still important to allow. But the BlockIoTune XML is not (yet) geared for backing images. On the other hand, it might make sense to allow BlockIoTune on backing chains - for a difference in throttling between the main image and its backing image. That is, I could possibly see a case where a local image is based on top of a network backing file, and where we want to read the local image with no throttling, but read the backing file with rate limiting in effect to avoid saturating the network; in such a setup, the user is likely going to do a blockpull to move data off the network onto the local copy, but doesn't want the pull to affect performance. Or conversely, someone could have a setup where the backing file has no rate limit, but the active file is rate-limited (and thus the guest performs faster the closer it is to the original backing file, as a way of measuring how much the guest differs from the golden image). Of course, we're still waiting for per-node throttling to land in qemu: https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg01196.html
even for an offline domain (although the threshold is only meaningful for a running domain), but might get weird because qemu's event is one-shot (you have to re-arm a new threshold every time an existing threshold fires - so every time it fires, the domain XML is rewritten, even though it is not guest-visible ABI that was changing). At least
Having the configuration exposed in the XML might make sense. Although currently it won't allow to specify them for anything but the top image, which will require users of libvirt to register the event additionally to the backing chain elements. Since the allocation of the backing chain elements can change only once a block job is started it should be safe enough to allow that only using the API and once libvirt will track the full backing chain we can use the XML config too.
with this approach, it is also easy for a client to poll the current setting of the threshold, via virDomainGetBlockIoTune. But the threshold isn't quite a tuning parameter (it isn't throttling how fast the guest can write to the block device, only how full the host side can get in order to allow transparent resizing of the host storage prior to running out of space).
Again, virDomainGetBlockIoTune won't work on individual elements and using it that way would also be impractical.
Okay, I'm fairly convinced that reusing virDomainGetBlockIoTune is not the right approach, and therefore adding a new API (and requiring the .so bump) is going to be required.
Another idea is to add a completely new API, maybe named virDomainBlockSetWriteThreshold(virDomainPtr dom, const char *disk, long long int threshold, unsigned int flags) (with threshold in bytes).
Why is the treshold a signed value? I can't imagine a use case where negative values could be used.
I've swapped to unsigned long long in my current patches, but see below [1].
As for the @disk parameter it will need to take the target with the index argument since I know that oVirt is using the same approach also for backing chain sub-elements hosted on LVM when doing snapshot merging via the block job APIs.
Good thing we already have code for resolving backing chain index in disk names.
This also implies another required thing for this to be actually usable. Since the block jobs happening on the backing chain can trigger the event on a member of the backing chain, the returned event will need to contain the disk identification in a way that is unique across backing chain alterations.
Right now, I'm planning on the event looking like: typedef void (*virConnectDomainEventWriteThresholdCallback) (virConnectPtr conn, virDomainPtr dom, const char *devAlias, unsigned long long threshold, unsigned long long length, void *opaque); Remember, the event callback can only be registered once per domain, so it HAS to include disk information (whether "vda" or "vdb" at the top level), and it is not that much harder to make it include indexed disk information "vda[1]" if the event triggered to due a commit to a backing file.
While for local files we could again opt to use the path this won't be scalable to non-local devices. Thus I think the bes way will be to include also the disk target with index.
You are right that the event must be tied to the disk alias and not the path name, as not all disks have a unique local path name.
This though will require using node-names for tracking and/or generating the indexes in the backing store in a deterministic way.
Here, virDomainBlockStatsFlags() could be a way to query the current threshold. And if desired, we could add a flag value to treat thresholda
virDomainBlockStatsFlags does not operate on backing chain subelements, so it would need to be instrumented to do so.
Hmm, more work. At least that work is not going to affect .so versioning, like the new API for actually setting the threshold will have to do.
as a percentage instead of a byte value (but is 1% too large of granularity, and how would you scale the percentage to anything finer while still keeping the parameter as long long int rather than double?)
You can use a proportional unit with a larger fractional part: promile, parts per million, parts per billion etc.
[1] Indeed, we can add more and more flags, but we'll see if it makes sense. It might also be nice to allow a negative threshold, as in setting a threshold of -1*1024*1024*1024 to trigger when the disk reaches within 1 gigabyte of space, regardless of how many gigabytes it currently contains (easier than calling virDomainBlockInfo and doing the computation myself). That could be done by allowing a signed threshold, or by keeping threshold unsigned but adding a flag that says the threshold is relative to the tail of the file rather than the beginning. But adding flags can be done later; my first implementation will not define any flags (bytes only, no percentage or relative-to-end values).
Of course, I'd want virConnectGetAllDomainStats() to list the current threshold setting (0 if no threshold or if the event has already fired, non-zero if the threshold is still set waiting to fire), so that clients can query thresholds for multiple domains and multiple disks per domain in one API call. But I don't know if we have any good way to set
Not only disks but for separate backing chain elements too.
Thankfully GetAllDomainStats is already wired to report backing chain element details.
multiple thresholds in one call (at least virDomainSetBlockIoTune must be called once per disk; it might be possible for my proposed virDomainBlockStatsFlags() to set a threshold for multiple disks if the disk name is passed as NULL - but then we're back to the question of what happens if the guest has multiple disks of different sizes; it's better to set per-disk thresholds than to assume all disks must be at the same byte or percentage threshold).
That is just usage-sugar for the users. I'd rather avoid doing this on multiple disks simultaneously.
Good - then I won't worry about it; the new API will make disk name mandatory. (Setting to a percentage or to a relative-to-tail might make more sense across multiple disks, but on the other hand, setting a threshold will be a rare thing; and while first starting the domain has to set a threshold on all disks, later re-arming of the trigger will be on one disk at a time as events happen; making the startup case more efficient is not going to be the bottleneck in management).
I'm also worried about what happens across libvirtd restarts - if the qemu event fires while libvirtd is unconnected, should libvirt be tracking that a threshold was registered in the XML, and upon reconnection check if qemu still has the threshold? If qemu no longer has a threshold, then libvirt can assume it missed the event, and generate one as part of reconnecting to the domain.
Libvirt should have enough information to actually check if the event happened and should be able to decide that it in fact missed the event and it should be emitted by libvirt.
The new block copy API should also add a new typed parameter config that will allow to set the write treshold once you are using it in a similar way with a LV as backing.
Ah, as in arm a threshold in the same API that starts a block job. Makes sense. But won't require a .so bump, so doesn't have to be done in my first posting of the series.
Peter
Thanks for the ideas; now for me to crank out my proof-of-concept code. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 05/19/2015 06:42 AM, Eric Blake wrote:
On the other hand, it might make sense to allow BlockIoTune on backing chains - for a difference in throttling between the main image and its backing image. That is, I could possibly see a case where a local image is based on top of a network backing file, and where we want to read the local image with no throttling, but read the backing file with rate limiting in effect to avoid saturating the network; in such a setup, the user is likely going to do a blockpull to move data off the network onto the local copy, but doesn't want the pull to affect performance. Or conversely, someone could have a setup where the backing file has no rate limit, but the active file is rate-limited (and thus the guest performs faster the closer it is to the original backing file, as a way of measuring how much the guest differs from the golden image). Of course, we're still waiting for per-node throttling to land in qemu: https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg01196.html
And _while I was typing_, that got bumped from v7 to v8: https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg03716.html -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

(CCing Nir, from the oVirt/RHEV storage team) ----- Original Message -----
From: "Eric Blake" <eblake@redhat.com> To: "Peter Krempa" <pkrempa@redhat.com> Cc: libvir-list@redhat.com Sent: Tuesday, May 19, 2015 2:42:16 PM Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold
Hi, sorry for pretty late joining. Let me add a few notes from the perspective of a consumer (although very interested! :) of the API.
On 05/19/2015 05:52 AM, Peter Krempa wrote:
On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
I'm trying to wire up libvirt to do event reporting based on qemu 2.3's BLOCK_WRITE_THRESHOLD event. Doing this will allow management applications to do event-based notification on when to enlarge LVM (or other) storage underlying a qcow2 volume, rather than their current requirement to frequently poll block statistics. But I'm stuck on the best way to expose the new parameter:
I read the thread and I'm pretty sure this will be a silly question, but I want to make sure I am on the same page and I'm not somehow confused by the terminology. Let's consider the simplest of the situation we face in oVirt: (thin provisioned qcow2 disk on LV) vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID] Isn't the LV here the 'backing file' (actually, backing block device) of the disk? because nowadays we are interested exactly in the events from the LV, hence, IIUC vda[1] below.
One idea is to treat it as part of the domain XML, and have virDomainSetBlockIoTune add one more typed parameter for a disk's current write threshold. Doing this could allow setting a threshold
Since virDomainSetBlockIoTune operates on disk-level and the event will need to be registered on a backing-chain element level, using virDomainSetBlockIoTune won't be a good choice, IMO.
Oh, I wasn't even thinking about backing-chain levels. You are right - it is entirely possible for someone to want to set a threshold reporting for "vda[1]" (the backing element of "vda") - of course, it would only kick in during a blockcommit to vda[1], but it is still important to allow. But the BlockIoTune XML is not (yet) geared for backing images.
Agreed about SetBlockIoTune not looking the best choice yet.
As for the @disk parameter it will need to take the target with the index argument since I know that oVirt is using the same approach also for backing chain sub-elements hosted on LVM when doing snapshot merging via the block job APIs.
That's correct
This also implies another required thing for this to be actually usable. Since the block jobs happening on the backing chain can trigger the event on a member of the backing chain, the returned event will need to contain the disk identification in a way that is unique across backing chain alterations.
Right now, I'm planning on the event looking like:
typedef void (*virConnectDomainEventWriteThresholdCallback) (virConnectPtr conn, virDomainPtr dom, const char *devAlias, unsigned long long threshold, unsigned long long length, void *opaque);
Remember, the event callback can only be registered once per domain, so it HAS to include disk information (whether "vda" or "vdb" at the top level), and it is not that much harder to make it include indexed disk information "vda[1]" if the event triggered to due a commit to a backing file.
I'm not sure how a client can match the index "vda[1]" with the chain node name. Maybe I'm missing some context here? (RTFM welcome if contains pointers to the manual :))
as a percentage instead of a byte value (but is 1% too large of granularity, and how would you scale the percentage to anything finer while still keeping the parameter as long long int rather than double?)
You can use a proportional unit with a larger fractional part: promile, parts per million, parts per billion etc.
[1] Indeed, we can add more and more flags, but we'll see if it makes sense. It might also be nice to allow a negative threshold, as in setting a threshold of -1*1024*1024*1024 to trigger when the disk reaches within 1 gigabyte of space, regardless of how many gigabytes it currently contains (easier than calling virDomainBlockInfo and doing the computation myself). That could be done by allowing a signed threshold, or by keeping threshold unsigned but adding a flag that says the threshold is relative to the tail of the file rather than the beginning.
But adding flags can be done later; my first implementation will not define any flags (bytes only, no percentage or relative-to-end values).
BTW, in VDSM we set high water marks based purely on percentage regardless the disk size. We aren't too concerned by the granularity at this stage https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=vdsm/virt/vmdevices/stor... However, larger fractional part and percentage-based threshold looks really cool, and should easily allow automatic rearming of the event, which is even cooler ;)
Of course, I'd want virConnectGetAllDomainStats() to list the current threshold setting (0 if no threshold or if the event has already fired, non-zero if the threshold is still set waiting to fire), so that clients can query thresholds for multiple domains and multiple disks per domain in one API call. But I don't know if we have any good way to set
Looks nice
multiple thresholds in one call (at least virDomainSetBlockIoTune must be called once per disk; it might be possible for my proposed virDomainBlockStatsFlags() to set a threshold for multiple disks if the disk name is passed as NULL - but then we're back to the question of what happens if the guest has multiple disks of different sizes; it's better to set per-disk thresholds than to assume all disks must be at the same byte or percentage threshold).
That is just usage-sugar for the users. I'd rather avoid doing this on multiple disks simultaneously.
Good - then I won't worry about it; the new API will make disk name mandatory. (Setting to a percentage or to a relative-to-tail might make more sense across multiple disks, but on the other hand, setting a threshold will be a rare thing; and while first starting the domain has to set a threshold on all disks, later re-arming of the trigger will be on one disk at a time as events happen; making the startup case more efficient is not going to be the bottleneck in management).
I agree
I'm also worried about what happens across libvirtd restarts - if the qemu event fires while libvirtd is unconnected, should libvirt be tracking that a threshold was registered in the XML, and upon reconnection check if qemu still has the threshold? If qemu no longer has a threshold, then libvirt can assume it missed the event, and generate one as part of reconnecting to the domain.
Libvirt should have enough information to actually check if the event happened and should be able to decide that it in fact missed the event and it should be emitted by libvirt.
That would be awesome. There are flows (live storage migration?) on which we'll probably still need to poll disks, but definitely the more we (libvirt API consumer) can depend on reliable delivery of the event, the better. The point here is to avoid racy checks in the management application as much as it is possible. Thanks and bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

On Thu, May 21, 2015 at 09:49:43 -0400, Francesco Romani wrote:
(CCing Nir, from the oVirt/RHEV storage team)
----- Original Message -----
From: "Eric Blake" <eblake@redhat.com> To: "Peter Krempa" <pkrempa@redhat.com> Cc: libvir-list@redhat.com Sent: Tuesday, May 19, 2015 2:42:16 PM Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold
Hi, sorry for pretty late joining. Let me add a few notes from the perspective of a consumer (although very interested! :) of the API.
On 05/19/2015 05:52 AM, Peter Krempa wrote:
On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
I'm trying to wire up libvirt to do event reporting based on qemu 2.3's BLOCK_WRITE_THRESHOLD event. Doing this will allow management applications to do event-based notification on when to enlarge LVM (or other) storage underlying a qcow2 volume, rather than their current requirement to frequently poll block statistics. But I'm stuck on the best way to expose the new parameter:
I read the thread and I'm pretty sure this will be a silly question, but I want to make sure I am on the same page and I'm not somehow confused by the terminology.
Let's consider the simplest of the situation we face in oVirt:
(thin provisioned qcow2 disk on LV)
vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID]
Isn't the LV here the 'backing file' (actually, backing block device) of the disk?
because nowadays we are interested exactly in the events from the LV, hence, IIUC vda[1] below.
Technically yes. We use the "backing file" term for anything below the top image that is actually backing the block device.
One idea is to treat it as part of the domain XML, and have virDomainSetBlockIoTune add one more typed parameter for a disk's current write threshold. Doing this could allow setting a threshold
Since virDomainSetBlockIoTune operates on disk-level and the event will need to be registered on a backing-chain element level, using virDomainSetBlockIoTune won't be a good choice, IMO.
Oh, I wasn't even thinking about backing-chain levels. You are right - it is entirely possible for someone to want to set a threshold reporting for "vda[1]" (the backing element of "vda") - of course, it would only kick in during a blockcommit to vda[1], but it is still important to allow. But the BlockIoTune XML is not (yet) geared for backing images.
Agreed about SetBlockIoTune not looking the best choice yet.
As for the @disk parameter it will need to take the target with the index argument since I know that oVirt is using the same approach also for backing chain sub-elements hosted on LVM when doing snapshot merging via the block job APIs.
That's correct
This also implies another required thing for this to be actually usable. Since the block jobs happening on the backing chain can trigger the event on a member of the backing chain, the returned event will need to contain the disk identification in a way that is unique across backing chain alterations.
Right now, I'm planning on the event looking like:
typedef void (*virConnectDomainEventWriteThresholdCallback) (virConnectPtr conn, virDomainPtr dom, const char *devAlias, unsigned long long threshold, unsigned long long length, void *opaque);
Remember, the event callback can only be registered once per domain, so it HAS to include disk information (whether "vda" or "vdb" at the top level), and it is not that much harder to make it include indexed disk information "vda[1]" if the event triggered to due a commit to a backing file.
I'm not sure how a client can match the index "vda[1]" with the chain node name. Maybe I'm missing some context here? (RTFM welcome if contains pointers to the manual :))
A live domain XML contains the backing chain information as perceived by libvirt. Every entry in the backing chain has the 'index' element which can be used in the square brackets. Currently the IDs are sequential but you can't rely on that since it will change with node names, where the index will be unique for every backing chain member and they will be kept static after block tree manipulating operations.
as a percentage instead of a byte value (but is 1% too large of granularity, and how would you scale the percentage to anything finer while still keeping the parameter as long long int rather than double?)
You can use a proportional unit with a larger fractional part: promile, parts per million, parts per billion etc.
[1] Indeed, we can add more and more flags, but we'll see if it makes sense. It might also be nice to allow a negative threshold, as in setting a threshold of -1*1024*1024*1024 to trigger when the disk reaches within 1 gigabyte of space, regardless of how many gigabytes it currently contains (easier than calling virDomainBlockInfo and doing the computation myself). That could be done by allowing a signed threshold, or by keeping threshold unsigned but adding a flag that says the threshold is relative to the tail of the file rather than the beginning.
But adding flags can be done later; my first implementation will not define any flags (bytes only, no percentage or relative-to-end values).
BTW, in VDSM we set high water marks based purely on percentage regardless the disk size. We aren't too concerned by the granularity at this stage
https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=vdsm/virt/vmdevices/stor...
However, larger fractional part and percentage-based threshold looks really cool, and should easily allow automatic rearming of the event, which is even cooler ;)
The automatic re-arming depends on the qemu implementation. If qemu accepts only an absolute value, we will not be able to do that without excessive hacking since libvirt will not know the moment once the LV was resized and thus could re-arm the event.
Of course, I'd want virConnectGetAllDomainStats() to list the current threshold setting (0 if no threshold or if the event has already fired, non-zero if the threshold is still set waiting to fire), so that clients can query thresholds for multiple domains and multiple disks per domain in one API call. But I don't know if we have any good way to set
Looks nice
multiple thresholds in one call (at least virDomainSetBlockIoTune must be called once per disk; it might be possible for my proposed virDomainBlockStatsFlags() to set a threshold for multiple disks if the disk name is passed as NULL - but then we're back to the question of what happens if the guest has multiple disks of different sizes; it's better to set per-disk thresholds than to assume all disks must be at the same byte or percentage threshold).
That is just usage-sugar for the users. I'd rather avoid doing this on multiple disks simultaneously.
Good - then I won't worry about it; the new API will make disk name mandatory. (Setting to a percentage or to a relative-to-tail might make more sense across multiple disks, but on the other hand, setting a threshold will be a rare thing; and while first starting the domain has to set a threshold on all disks, later re-arming of the trigger will be on one disk at a time as events happen; making the startup case more efficient is not going to be the bottleneck in management).
I agree
I'm also worried about what happens across libvirtd restarts - if the qemu event fires while libvirtd is unconnected, should libvirt be tracking that a threshold was registered in the XML, and upon reconnection check if qemu still has the threshold? If qemu no longer has a threshold, then libvirt can assume it missed the event, and generate one as part of reconnecting to the domain.
Libvirt should have enough information to actually check if the event happened and should be able to decide that it in fact missed the event and it should be emitted by libvirt.
That would be awesome. There are flows (live storage migration?) on which we'll probably still need to poll disks, but definitely the more we (libvirt API consumer) can depend on reliable delivery of the event, the better.
Hmm, theoretically the event could be armed on the destination of the migration once the live storage migration code starts (if we allow to do so) and then you'd be able to receive an event there too.
The point here is to avoid racy checks in the management application as much as it is possible.
Thanks and bests,
-- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

[adding qemu] On 05/21/2015 07:49 AM, Francesco Romani wrote:
(CCing Nir, from the oVirt/RHEV storage team)
----- Original Message -----
From: "Eric Blake" <eblake@redhat.com> To: "Peter Krempa" <pkrempa@redhat.com> Cc: libvir-list@redhat.com Sent: Tuesday, May 19, 2015 2:42:16 PM Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold
Hi, sorry for pretty late joining. Let me add a few notes from the perspective of a consumer (although very interested! :) of the API.
Since you wrote the qemu side with the intent of being the end user, I welcome the feedback. (qemu commit e2462113, for those joining the conversation)
I read the thread and I'm pretty sure this will be a silly question, but I want to make sure I am on the same page and I'm not somehow confused by the terminology.
Let's consider the simplest of the situation we face in oVirt:
(thin provisioned qcow2 disk on LV)
vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID]
Isn't the LV here the 'backing file' (actually, backing block device) of the disk?
Restating what you wrote into libvirt terminology, I think this means that you have a <disk> where: <driver> is qcow2 <source> is a local file name <device> names vda <backingStore index='1'> describes the backing LV: <driver> is also qcow2 (as polling allocation growth in order to resize on demand only makes sense for qcow2 format) <source> is /dev/mapper/$UUID then indeed, "vda" is the local qcow2 file, and "vda[1]" is the backing file on the LV storage. Normally, you only care about the write threshold at the active layer (the local file, with name "vda"), because that is the only image that will normally be allocating sectors. But in the case of active commit, where you are taking the thin-provisioned local file and writing its clusters back into the backing LV, the action of commit can allocate sectors in the backing file. Thus, libvirt wants to let you set a write-threshold on both parts of the backing chain (the active wrapper, and the LV backing file), where the event could fire on either node first. The existing libvirt virConnectDomainGetAllStats() can already be used to poll allocation growth (the block.N.allocation statistic in libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would let you drop polling. However, while starting to code the libvirt side of things, I've hit a couple of snags with interacting with the qemu design. First, the 'block-set-write-threshold' command is allowed to set a threshold by 'node-name' (any BDS, whether active or backing), but libvirt is not yet setting 'node-name' for backing files (so even though libvirt knows how to resolve "vda[1]" to the backing chain, it does not yet have a way to tell qemu to set the threshold on that BDS until libvirt starts naming all nodes). Second, querying for the current threshold value is only possible in struct 'BlockDeviceInfo', which is reported as the top-level of each disk in 'query-block', and also for 'query-named-block-nodes'. However, when it comes to collecting block.N.allocation, libvirt is instead getting information from the sub-struct 'ImageInfo', which is reported recursively for BlockDeviceInfo in 'query-block' but not reported for 'query-named-block-nodes'. So it is that much harder to call 'query-named-block-nodes' and then correlate that information back into the tree of information for anything but the active image. So it may be a while before thresholds on "vda[1]" actually work for block commit; my initial implementation will just focus on the active image "vda". I'm wondering if qemu can make it easier by duplicating threshold information into 'ImageInfo' rather than just 'BlockDeviceInfo', so that a single call to 'query-block' rather than a second call to 'query-named-block-nodes' can scrape the threshold information for every BDS in the chain. Then again, I know there is work underway to refactor qemu block throttling, to allow throttle values on backing images that differ from the active image; and throttling is currently reported in 'BlockDeviceInfo'. So I'm not sure yet if adding redundant information in 'ImageInfo' would help anything, or get in the way of the throttle group work.
Since virDomainSetBlockIoTune operates on disk-level and the event will need to be registered on a backing-chain element level, using virDomainSetBlockIoTune won't be a good choice, IMO.
Oh, I wasn't even thinking about backing-chain levels. You are right - it is entirely possible for someone to want to set a threshold reporting for "vda[1]" (the backing element of "vda") - of course, it would only kick in during a blockcommit to vda[1], but it is still important to allow. But the BlockIoTune XML is not (yet) geared for backing images.
Agreed about SetBlockIoTune not looking the best choice yet.
Except that SetBlockIoTune is the one libvirt API that is currently using the throttle information in 'BlockDeviceInfo', and will soon have to be extended to manage throttle groups and throttle information on backing files anyways.
As for the @disk parameter it will need to take the target with the index argument since I know that oVirt is using the same approach also for backing chain sub-elements hosted on LVM when doing snapshot merging via the block job APIs.
That's correct
Hopefully I can get the initial support for "vda" events in, then we can tackle the additional work to add "vda[1]" events. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

----- Original Message -----
From: "Eric Blake" <eblake@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: libvir-list@redhat.com, "Nir Soffer" <nsoffer@redhat.com>, "Peter Krempa" <pkrempa@redhat.com>, qemu-devel@nongnu.org Sent: Friday, May 22, 2015 6:33:01 AM Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold
[adding qemu]
I read the thread and I'm pretty sure this will be a silly question, but I want to make sure I am on the same page and I'm not somehow confused by the terminology.
Let's consider the simplest of the situation we face in oVirt:
(thin provisioned qcow2 disk on LV)
vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID]
Isn't the LV here the 'backing file' (actually, backing block device) of the disk?
Restating what you wrote into libvirt terminology, I think this means
that you have a <disk> where: <driver> is qcow2 <source> is a local file name <device> names vda <backingStore index='1'> describes the backing LV: <driver> is also qcow2 (as polling allocation growth in order to resize on demand only makes sense for qcow2 format) <source> is /dev/mapper/$UUID
Yes, exactly my point. I just want to be 100% sure that the three (slightly) different parlances of the three groups (oVirt/libvirt/QEMU) are aligned on the same meaning, and that we're not getting anything lost in translation
that you have a <disk> where: <driver> is qcow2 <source> is a local file name <device> names vda <backingStore index='1'> describes the backing LV: <driver> is also qcow2 (as polling allocation growth in order to resize on demand only makes sense for qcow2 format) <source> is /dev/mapper/$UUID
For the final confirmation, here's the actual XML we produce: <disk device="disk" snapshot="no" type="block"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x05" type="pci"/> <source dev="/rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5"/> <target bus="virtio" dev="vda"/> <serial>ee1295ee-7ddc-4030-be5e-4557538bc4d2</serial> <boot order="1"/> <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2"/> </disk> For the sake of completeness: $ ls -lh /rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 lrwxrwxrwx. 1 vdsm kvm 78 May 22 08:49 /rhev/data-center/00000002-0002-0002-0002-00000000014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 -> /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/05a88a94-5bd6-4698-be69-39e78c84e1a5 $ ls -lh /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/ total 0 lrwxrwxrwx. 1 root root 8 May 22 08:49 05a88a94-5bd6-4698-be69-39e78c84e1a5 -> ../dm-11 lrwxrwxrwx. 1 root root 8 May 22 08:49 54673e6d-207d-4a66-8f0d-3f5b3cda78e5 -> ../dm-12 lrwxrwxrwx. 1 root root 9 May 22 08:49 ids -> ../dm-606 lrwxrwxrwx. 1 root root 9 May 22 08:49 inbox -> ../dm-607 lrwxrwxrwx. 1 root root 9 May 22 08:49 leases -> ../dm-605 lrwxrwxrwx. 1 root root 9 May 22 08:49 master -> ../dm-608 lrwxrwxrwx. 1 root root 9 May 22 08:49 metadata -> ../dm-603 lrwxrwxrwx. 1 root root 9 May 22 08:49 outbox -> ../dm-604 lvs | grep 05a88a94 05a88a94-5bd6-4698-be69-39e78c84e1a5 12f68692-2a5a-4e48-af5e-4679bca7fd44 -wi-ao---- 14.12g
then indeed, "vda" is the local qcow2 file, and "vda[1]" is the backing file on the LV storage.
Normally, you only care about the write threshold at the active layer (the local file, with name "vda"), because that is the only image that will normally be allocating sectors. But in the case of active commit, where you are taking the thin-provisioned local file and writing its clusters back into the backing LV, the action of commit can allocate sectors in the backing file.
Right
Thus, libvirt wants to let you set a write-threshold on both parts of the backing chain (the active wrapper, and the LV backing file), where the event could fire on either node first. The existing libvirt virConnectDomainGetAllStats() can already be used to poll allocation growth (the block.N.allocation statistic in libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would let you drop polling.
Yes, exactly the intent
However, while starting to code the libvirt side of things, I've hit a couple of snags with interacting with the qemu design. First, the 'block-set-write-threshold' command is allowed to set a threshold by 'node-name' (any BDS, whether active or backing),
Yes, this emerged during the review of my patch. I first took the simplest approach (probably simplistic, in retrospect), but -IIRC- was pointed out that setting by node-name grants the most flexible approach, hence was required. See: http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02503.html http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02580.html http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02831.html
but libvirt is not yet setting 'node-name' for backing files (so even though libvirt knows how to resolve "vda[1]" to the backing chain,
I had vague memories of this, hence my clumsy and poorly worded question about how to resolve 'vda[1]' before :\
it does not yet have a way to tell qemu to set the threshold on that BDS until libvirt starts naming all nodes). Second, querying for the current threshold value is only possible in struct 'BlockDeviceInfo', which is reported as the top-level of each disk in 'query-block', and also for 'query-named-block-nodes'. However, when it comes to collecting block.N.allocation, libvirt is instead getting information from the sub-struct 'ImageInfo', which is reported recursively for BlockDeviceInfo in 'query-block' but not reported for 'query-named-block-nodes'.
IIRC 'query-named-block-nodes' was the preferred way to extract this information (see also http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02944.html )
So it is that much harder to call 'query-named-block-nodes' and then correlate that information back into the tree of information for anything but the active image. So it may be a while before thresholds on "vda[1]" actually work for block commit; my initial implementation will just focus on the active image "vda".
I'm wondering if qemu can make it easier by duplicating threshold information into 'ImageInfo' rather than just 'BlockDeviceInfo', so that a single call to 'query-block' rather than a second call to 'query-named-block-nodes' can scrape the threshold information for every BDS in the chain.
I think I've just not explored this option back in time, just had vague feeling that was better not to duplicate information, but I can't recall real solid reason on my side. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

On Mon, May 18, 2015 at 02:28:09PM -0600, Eric Blake wrote:
I'm trying to wire up libvirt to do event reporting based on qemu 2.3's BLOCK_WRITE_THRESHOLD event. Doing this will allow management applications to do event-based notification on when to enlarge LVM (or other) storage underlying a qcow2 volume, rather than their current requirement to frequently poll block statistics. But I'm stuck on the best way to expose the new parameter:
One idea is to treat it as part of the domain XML, and have virDomainSetBlockIoTune add one more typed parameter for a disk's current write threshold. Doing this could allow setting a threshold even for an offline domain (although the threshold is only meaningful for a running domain), but might get weird because qemu's event is one-shot (you have to re-arm a new threshold every time an existing threshold fires - so every time it fires, the domain XML is rewritten, even though it is not guest-visible ABI that was changing). At least with this approach, it is also easy for a client to poll the current setting of the threshold, via virDomainGetBlockIoTune. But the threshold isn't quite a tuning parameter (it isn't throttling how fast the guest can write to the block device, only how full the host side can get in order to allow transparent resizing of the host storage prior to running out of space).
The issue of re-arming strongly suggests to me that setting in the XML is not appropriate, /unless/ we have it somehow described such that libvirt itself is able to automatically re-arm, but I don't see an obvious way to do that without encoding a policy in libvirt. So it feels more like something that should merely be a runtime tunable with APIs to set/get against a running guest, and leave the XML out of it.
I'm also worried about what happens across libvirtd restarts - if the qemu event fires while libvirtd is unconnected, should libvirt be tracking that a threshold was registered in the XML, and upon reconnection check if qemu still has the threshold? If qemu no longer has a threshold, then libvirt can assume it missed the event, and generate one as part of reconnecting to the domain.
If libvirtd restarts, applications have to reconnect to libvirt. We could say that the application is required to re-register any thresholds it wants, even if the guest is still running, but it might be more pleasant for libvirt to take care of this and emit any missed event itself. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (4)
-
Daniel P. Berrange
-
Eric Blake
-
Francesco Romani
-
Peter Krempa