On Thu, May 21, 2015 at 09:49:43 -0400, Francesco Romani wrote:
(CCing Nir, from the oVirt/RHEV storage team)
----- Original Message -----
> From: "Eric Blake" <eblake(a)redhat.com>
> To: "Peter Krempa" <pkrempa(a)redhat.com>
> Cc: libvir-list(a)redhat.com
> Sent: Tuesday, May 19, 2015 2:42:16 PM
> Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold
Hi, sorry for pretty late joining.
Let me add a few notes from the perspective of a consumer (although
very interested! :) of the API.
> On 05/19/2015 05:52 AM, Peter Krempa wrote:
> > On Mon, May 18, 2015 at 14:28:09 -0600, Eric Blake wrote:
> >> I'm trying to wire up libvirt to do event reporting based on qemu
2.3's
> >> BLOCK_WRITE_THRESHOLD event. Doing this will allow management
> >> applications to do event-based notification on when to enlarge LVM (or
> >> other) storage underlying a qcow2 volume, rather than their current
> >> requirement to frequently poll block statistics. But I'm stuck on the
> >> best way to expose the new parameter:
I read the thread and I'm pretty sure this will be a silly question, but I want
to make sure I am on the same page and I'm not somehow confused by the terminology.
Let's consider the simplest of the situation we face in oVirt:
(thin provisioned qcow2 disk on LV)
vda=[format=qcow2] -> lv=[path=/dev/mapper/$UUID]
Isn't the LV here the 'backing file' (actually, backing block device) of the
disk?
because nowadays we are interested exactly in the events from the LV,
hence, IIUC vda[1] below.
Technically yes. We use the "backing file" term for anything below the
top image that is actually backing the block device.
> >> One idea is to treat it as part of the domain XML, and have
> >> virDomainSetBlockIoTune add one more typed parameter for a disk's
> >> current write threshold. Doing this could allow setting a threshold
> >
> > Since virDomainSetBlockIoTune operates on disk-level and the event will
> > need to be registered on a backing-chain element level, using
> > virDomainSetBlockIoTune won't be a good choice, IMO.
>
> Oh, I wasn't even thinking about backing-chain levels. You are right -
> it is entirely possible for someone to want to set a threshold reporting
> for "vda[1]" (the backing element of "vda") - of course, it
would only
> kick in during a blockcommit to vda[1], but it is still important to
> allow. But the BlockIoTune XML is not (yet) geared for backing images.
Agreed about SetBlockIoTune not looking the best choice yet.
> >
> > As for the @disk parameter it will need to take the target with the
> > index argument since I know that oVirt is using the same approach also
> > for backing chain sub-elements hosted on LVM when doing snapshot
> > merging via the block job APIs.
That's correct
> > This also implies another required thing for this to be actually usable.
> > Since the block jobs happening on the backing chain can trigger the
> > event on a member of the backing chain, the returned event will need to
> > contain the disk identification in a way that is unique across backing
> > chain alterations.
>
> Right now, I'm planning on the event looking like:
>
> typedef void (*virConnectDomainEventWriteThresholdCallback)
> (virConnectPtr conn,
> virDomainPtr dom,
> const char *devAlias,
> unsigned long long threshold,
> unsigned long long length,
> void *opaque);
>
> Remember, the event callback can only be registered once per domain, so
> it HAS to include disk information (whether "vda" or "vdb" at
the top
> level), and it is not that much harder to make it include indexed disk
> information "vda[1]" if the event triggered to due a commit to a backing
> file.
I'm not sure how a client can match the index "vda[1]" with the chain
node name. Maybe I'm missing some context here? (RTFM welcome if contains
pointers to the manual :))
A live domain XML contains the backing chain information as perceived by
libvirt. Every entry in the backing chain has the 'index' element which
can be used in the square brackets. Currently the IDs are sequential but
you can't rely on that since it will change with node names, where the
index will be unique for every backing chain member and they will be
kept static after block tree manipulating operations.
> >> as a percentage instead of a byte value (but is 1% too large of
> >> granularity, and how would you scale the percentage to anything finer
> >> while still keeping the parameter as long long int rather than double?)
> >
> > You can use a proportional unit with a larger fractional part: promile,
> > parts per million, parts per billion etc.
>
> [1] Indeed, we can add more and more flags, but we'll see if it makes
> sense. It might also be nice to allow a negative threshold, as in
> setting a threshold of -1*1024*1024*1024 to trigger when the disk
> reaches within 1 gigabyte of space, regardless of how many gigabytes it
> currently contains (easier than calling virDomainBlockInfo and doing the
> computation myself). That could be done by allowing a signed threshold,
> or by keeping threshold unsigned but adding a flag that says the
> threshold is relative to the tail of the file rather than the beginning.
>
> But adding flags can be done later; my first implementation will not
> define any flags (bytes only, no percentage or relative-to-end values).
BTW, in VDSM we set high water marks based purely on percentage regardless
the disk size. We aren't too concerned by the granularity at this stage
https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=vdsm/virt/vmdevices/s...
However, larger fractional part and percentage-based threshold looks really
cool, and should easily allow automatic rearming of the event, which
is even cooler ;)
The automatic re-arming depends on the qemu implementation. If qemu
accepts only an absolute value, we will not be able to do that without
excessive hacking since libvirt will not know the moment once the LV was
resized and thus could re-arm the event.
> >> Of course, I'd want virConnectGetAllDomainStats() to list the current
> >> threshold setting (0 if no threshold or if the event has already fired,
> >> non-zero if the threshold is still set waiting to fire), so that clients
> >> can query thresholds for multiple domains and multiple disks per domain
> >> in one API call. But I don't know if we have any good way to set
Looks nice
> >> multiple thresholds in one call (at least virDomainSetBlockIoTune must
> >> be called once per disk; it might be possible for my proposed
> >> virDomainBlockStatsFlags() to set a threshold for multiple disks if the
> >> disk name is passed as NULL - but then we're back to the question of
> >> what happens if the guest has multiple disks of different sizes; it's
> >> better to set per-disk thresholds than to assume all disks must be at
> >> the same byte or percentage threshold).
> >
> > That is just usage-sugar for the users. I'd rather avoid doing this on
> > multiple disks simultaneously.
>
> Good - then I won't worry about it; the new API will make disk name
> mandatory. (Setting to a percentage or to a relative-to-tail might make
> more sense across multiple disks, but on the other hand, setting a
> threshold will be a rare thing; and while first starting the domain has
> to set a threshold on all disks, later re-arming of the trigger will be
> on one disk at a time as events happen; making the startup case more
> efficient is not going to be the bottleneck in management).
I agree
> >> I'm also worried about what happens across libvirtd restarts - if the
> >> qemu event fires while libvirtd is unconnected, should libvirt be
> >> tracking that a threshold was registered in the XML, and upon
> >> reconnection check if qemu still has the threshold? If qemu no longer
> >> has a threshold, then libvirt can assume it missed the event, and
> >> generate one as part of reconnecting to the domain.
> >
> > Libvirt should have enough information to actually check if the event
> > happened and should be able to decide that it in fact missed the event
> > and it should be emitted by libvirt.
That would be awesome.
There are flows (live storage migration?) on which we'll probably still
need to poll disks, but definitely the more we (libvirt API consumer) can
depend on reliable delivery of the event, the better.
Hmm, theoretically the event could be armed on the destination of the
migration once the live storage migration code starts (if we allow to do
so) and then you'd be able to receive an event there too.
The point here is to avoid racy checks in the management application as much
as it is possible.
Thanks and bests,
--
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani