On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
I'm trying to wire up libvirt to do event reporting based on qemu
2.3's
BLOCK_WRITE_THRESHOLD event. Doing this will allow management
applications to do event-based notification on when to enlarge LVM (or
other) storage underlying a qcow2 volume, rather than their current
requirement to frequently poll block statistics. But I'm stuck on the
best way to expose the new parameter:
One idea is to treat it as part of the domain XML, and have
virDomainSetBlockIoTune add one more typed parameter for a disk's
current write threshold. Doing this could allow setting a threshold
Since virDomainSetBlockIoTune operates on disk-level and the event will
need to be registered on a backing-chain element level, using
virDomainSetBlockIoTune won't be a good choice, IMO.
even for an offline domain (although the threshold is only
meaningful
for a running domain), but might get weird because qemu's event is
one-shot (you have to re-arm a new threshold every time an existing
threshold fires - so every time it fires, the domain XML is rewritten,
even though it is not guest-visible ABI that was changing). At least
Having the configuration exposed in the XML might make sense. Although
currently it won't allow to specify them for anything but the top image,
which will require users of libvirt to register the event additionally
to the backing chain elements. Since the allocation of the backing chain
elements can change only once a block job is started it should be safe
enough to allow that only using the API and once libvirt will track the
full backing chain we can use the XML config too.
with this approach, it is also easy for a client to poll the current
setting of the threshold, via virDomainGetBlockIoTune. But the
threshold isn't quite a tuning parameter (it isn't throttling how fast
the guest can write to the block device, only how full the host side can
get in order to allow transparent resizing of the host storage prior to
running out of space).
Again, virDomainGetBlockIoTune won't work on individual elements and
using it that way would also be impractical.
Another idea is to add a completely new API, maybe named
virDomainBlockSetWriteThreshold(virDomainPtr dom, const char *disk, long
long int threshold, unsigned int flags) (with threshold in bytes).
Why is the treshold a signed value? I can't imagine a use case where
negative values could be used.
As for the @disk parameter it will need to take the target with the
index argument since I know that oVirt is using the same approach also
for backing chain sub-elements hosted on LVM when doing snapshot
merging via the block job APIs.
This also implies another required thing for this to be actually usable.
Since the block jobs happening on the backing chain can trigger the
event on a member of the backing chain, the returned event will need to
contain the disk identification in a way that is unique across backing
chain alterations.
While for local files we could again opt to use the path this won't be
scalable to non-local devices. Thus I think the bes way will be to
include also the disk target with index.
This though will require using node-names for tracking and/or generating
the indexes in the backing store in a deterministic way.
Here, virDomainBlockStatsFlags() could be a way to query the current
threshold. And if desired, we could add a flag value to treat thresholda
virDomainBlockStatsFlags does not operate on backing chain subelements,
so it would need to be instrumented to do so.
as a percentage instead of a byte value (but is 1% too large of
granularity, and how would you scale the percentage to anything finer
while still keeping the parameter as long long int rather than double?)
You can use a proportional unit with a larger fractional part: promile,
parts per million, parts per billion etc.
Of course, I'd want virConnectGetAllDomainStats() to list the current
threshold setting (0 if no threshold or if the event has already fired,
non-zero if the threshold is still set waiting to fire), so that clients
can query thresholds for multiple domains and multiple disks per domain
in one API call. But I don't know if we have any good way to set
Not only disks but for separate backing chain elements too.
multiple thresholds in one call (at least virDomainSetBlockIoTune
must
be called once per disk; it might be possible for my proposed
virDomainBlockStatsFlags() to set a threshold for multiple disks if the
disk name is passed as NULL - but then we're back to the question of
what happens if the guest has multiple disks of different sizes; it's
better to set per-disk thresholds than to assume all disks must be at
the same byte or percentage threshold).
That is just usage-sugar for the users. I'd rather avoid doing this on
multiple disks simultaneously.
I'm also worried about what happens across libvirtd restarts - if the
qemu event fires while libvirtd is unconnected, should libvirt be
tracking that a threshold was registered in the XML, and upon
reconnection check if qemu still has the threshold? If qemu no longer
has a threshold, then libvirt can assume it missed the event, and
generate one as part of reconnecting to the domain.
Libvirt should have enough information to actually check if the event
happened and should be able to decide that it in fact missed the event
and it should be emitted by libvirt.
The new block copy API should also add a new typed parameter config that
will allow to set the write treshold once you are using it in a similar
way with a LV as backing.
Peter