Re: [libvirt] [Qemu-devel] IO accounting overhaul

5 Sep 2014

      The Friday 05 Sep 2014 à 16:30:31 (+0200), Kevin Wolf wrote :
...
Am 01.09.2014 um 13:41 hat Markus Armbruster geschrieben:
...
Benoît Canet <benoit.canet@irqsave.net> writes:
...
The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
...
Cc'ing libvirt following Stefan's lead.
Benoît Canet <benoit.canet@irqsave.net> writes:
...
/* the following would compute latecies for slices of 1 seconds then toss the
 * result and start a new slice. A weighted sumation of the instant latencies
 * could help to implement this.
 */
1s_read_average_latency
1s_write_average_latency
1s_flush_average_latency
/* the former three numbers could be used to further compute a 1
minute slice value */
1m_read_average_latency
1m_write_average_latency
1m_flush_average_latency
/* the former three numbers could be used to further compute a 1 hours
slice value */
1h_read_average_latency
1h_write_average_latency
1h_flush_average_latency
This is something like "what we added to total_FOO_time in the last
completed 1s / 1m / 1h time slice divided by the number of additions".
Just another way to accumulate the same raw data, thus no worries.
...
/* 1 second average number of requests in flight */
1s_read_queue_depth
1s_write_queue_depth
/* 1 minute average number of requests in flight */
1m_read_queue_depth
1m_write_queue_depth
/* 1 hours average number of requests in flight */
1h_read_queue_depth
1h_write_queue_depth
I asked some input from a cloud provider.
...
I don't think I agree with putting fixed time periods like 1 s/min/h
into qemu. What you need there is policy and we should probably make
it configurable.
yes, baking policy in qemu is bad.
...
Do we need accounting for multiple time periods at the same time or
would it be enough to have one and make its duration an option?
Having multiple time periods (up to 3) at once would be cool.
Having big knobs to turn in the configuration (one per period) would be
preferable than a configuration nightmare of one or more setting per device.
...
...
...
Optionally collecting the same data for each BDS of the graph.
If that's the case, keeping the shared infrastructure in the block layer
makes sense.
BDS member acct then holds I/O stats for the BDS.  We currently use it
for something else: I/O stats of the device model backed by this BDS.
That needs to move elsewhere.  Two places come to mind:
1. BlockBackend, when it's available (I resumed working on it last week
   for a bit).  Superficially attractive, because it's close to what we
   have now, but then we have to deal with what to do when the backend
   gets disconnected from its device model, then connected to another
   one.
2. The device models that actually implement I/O accounting.  Since
   query-blockstats names a backend rather than a device model, we need
   a BlockDevOps callback to fetch the stats.  Fetch fails when the
   callback is null.  Lets us distinguish "no stats yet" and "device
   model can't do stats", thus permits a QMP interface that doesn't lie.
Right now, I like (2) better.
So let's say I have some block device, which is attached to a guest
device for a while, but then I detach it and continue using it in a
different place (maybe another guest device or a block job). Should we
really reset all counters in query-blockstats to 0?
I think as I user I would be surprised about this, because I still refer
to it by the same name (the device_name, which will be in the BB), so
it's the same thing for me and the total requests include everything
that was ever issued against it.
This particular cloud provider think that associating the stats with the
emulated hardware is the less worisome to manage.

So now we need to discuss futher in the community about this.

Best regards

Benoît
...
...
...
-API wize I think about adding
bdrv_acct_invalid() and
bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
Complication: partial success.  Example:
1. Guest requests a read of N sectors.
2. Device model calls
   bdrv_acct_start(s->bs, &req->acct, N * BDRV_SECTOR_SIZE, BDRV_ACCT_READ)
3. Device model examines the request, and deems it valid.
4. Device model passes it to the block layer.
5. Block layer does its thing, but for some reason only M < N sectors
   can be read.  Block layer returns M.
No, it returns -errno.
...
6. What's the device model to do now?  Both bdrv_acct_failed() and
   bdrv_acct_done() would be wrong.
Should the device model account for a read of size M?  This ignores
   the partial failure.
Should it split the read into a successful and a failed part for
   accounting purposes?  This miscounts the number of reads.
I think we should simply account it as a failed request because this is
what it will look like for the guest. If you want the partial data that
was internally issued, you need to look at different statistics
(probably those of bs->file).
Kevin
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list