Re: [libvirt] [Qemu-devel] IO accounting overhaul

Monday, 1 September 2014

Cc'ing libvirt following Stefan's lead.

Benoît Canet <benoit.canet(a)irqsave.net&gt; writes:

...
 Hi,

 I collected some items of a cloud provider wishlist regarding I/O accouting. 
Feedback from real power-users, lovely!

...
 In a cloud I/O accouting can have 3 purpose: billing, helping the
customers
 and doing metrology to help the cloud provider seeks hidden costs.

 I'll cover the two former topic in this mail because they are the most important
 business wize.

 1) prefered place to collect billing IO accounting data:
 --------------------------------------------------------
 For billing purpose the collected data must be as close as possible to what the
 customer would see by using iostats in his vm. 
Good point.

...
 The first conclusion we can draw is that the choice of collecting IO
accouting
 data used for billing in the block devices models is right. 
Slightly rephrasing: doing I/O accounting in the block device models is
right for billing.

There may be other uses for I/O accounting, with different preferences.
For instance, data on how exactly guest I/O gets translated to host I/O
as it flows through the nodes in the block graph could be useful.

Doesn't diminish the need for accurate billing information, of course.

...
 2) what to do with occurences of rare events:
 ---------------------------------------------

 Another point is that QEMU developpers agree that they don't know which policy
 to apply to some I/O accounting events.
 Must QEMU discard invalid I/O write IO or account them as done ?
 Must QEMU count a failed read I/O as done ?

 When discusting this with a cloud provider the following appears:
 these decisions
 are really specific to each cloud provider and QEMU should not implement them. 
Good point, consistent with the old advice to avoid baking policy into
inappropriately low levels of the stack.

...
 The right thing to do is to add accouting counters to collect these
events.

 Moreover these rare events are precious troubleshooting data so it's
 an additional
 reason not to toss them. 
Another good point.

...
 3) list of block I/O accouting metrics wished for billing and
helping
 the customers
 -----------------------------------------------------------------------------------

 Basic I/O accouting data will end up making the customers bills.
 Extra I/O accouting informations would be a precious help for the cloud provider
 to implement a monitoring panel like Amazon Cloudwatch. 
These are the first two from your list of three purposes, i.e. the ones
you promised to cover here.

...
 Here is the list of counters and statitics I would like to help
 implement in QEMU.

 This is the most important part of the mail and the one I would like
 the community
 review the most.

 Once this list is settled I would proceed to implement the required
 infrastructure
 in QEMU before using it in the device models. 
For context, let me recap how I/O accounting works now.

The BlockDriverState abstract data type (short: BDS) can hold the
following accounting data:

    uint64_t nr_bytes[BDRV_MAX_IOTYPE];
    uint64_t nr_ops[BDRV_MAX_IOTYPE];
    uint64_t total_time_ns[BDRV_MAX_IOTYPE];
    uint64_t wr_highest_sector;

where BDRV_MAX_IOTYPE enumerates read, write, flush.

wr_highest_sector is a high watermark updated by the block layer as it
writes sectors.

The other three are *not* touched by the block layer.  Instead, the
block layer provides a pair of functions for device models to update
them:

    void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
            int64_t bytes, enum BlockAcctType type);
    void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie);

bdrv_acct_start() initializes cookie for a read, write, or flush
operation of a certain size.  The size of a flush is always zero.

bdrv_acct_done() adds the operations to the BDS's accounting data.
total_time_ns is incremented by the time between _start() and _done().

You may call _start() without calling _done().  That's a feature.
Device models use it to avoid accounting some requests.

Device models are not supposed to mess with cookie directly, only
through these two functions.

Some device models implement accounting, some don't.  The ones that do
don't agree on how to count invalid guest requests (the ones not passed
to block layer) and failed requests (passed to block layer and failed
there).  It's a mess in part caused by us never writing down what
exactly device models are expected to do.

Accounting data is used by "query-blockstats", and nothing else.

Corollary: even though every BDS holds accounting data, only the ones in
"top" BDSes ever get used.  This is a common block layer blemish, and
we're working on cleaning it up.

If a device model doesn't implement accounting, query-blockstats lies.
Fortunately, its lies are pretty transparent (everything's zero) as long
as you don't do things like connecting a backend to a device model that
doesn't implement accounting after disconnecting it from a device model
that does.  Still, I'd welcome a more honest QMP interface.

For me, this accounting data belongs to the device model, not the BDS.
Naturally, the block device models should use common infrastructure.  I
guess they use the block layer only because it's obvious infrastructure
they share.  Clumsy design.

...
 /* volume of data transfered by the IOs */
 read_bytes
 write_bytes 
This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE].

nr_bytes[BDRV_ACCT_FLUSH] is always zero.

Should this count only actual I/O, i.e. accumulated size of successful
operations?

...
 /* operation count */
 read_ios
 write_ios
 flush_ios

 /* how many invalid IOs the guest submit */
 invalid_read_ios
 invalid_write_ios
 invalid_flush_ios

 /* how many io error happened */
 read_ios_error
 write_ios_error
 flush_ios_error 
This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE],
nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed.

...
 /* account the time passed doing IOs */
 total_read_time
 total_write_time
 total_flush_time 
This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE],
total_time_ns[BDRV_ACCT_FLUSH].

I guess this should count both successful and failed I/O.  Could throw
in invalid, too, but it's probably too quick to matter.

Please specify the unit clearly.  Both total_FOO_time_ns or total_FOO_ns
would work for me.

...
 /* since when the volume is iddle */
 qvolume_iddleness_time 
"idle"

The obvious way to maintain this information with the current could
would be saving the value of get_clock() in bdrv_acct_done().

...
 /* the following would compute latecies for slices of 1 seconds then
toss the
  * result and start a new slice. A weighted sumation of the instant latencies
  * could help to implement this.
  */
 1s_read_average_latency
 1s_write_average_latency
 1s_flush_average_latency

 /* the former three numbers could be used to further compute a 1
 minute slice value */
 1m_read_average_latency
 1m_write_average_latency
 1m_flush_average_latency

 /* the former three numbers could be used to further compute a 1 hours
 slice value */
 1h_read_average_latency
 1h_write_average_latency
 1h_flush_average_latency 
This is something like "what we added to total_FOO_time in the last
completed 1s / 1m / 1h time slice divided by the number of additions".
Just another way to accumulate the same raw data, thus no worries.

...
 /* 1 second average number of requests in flight */
 1s_read_queue_depth
 1s_write_queue_depth

 /* 1 minute average number of requests in flight */
 1m_read_queue_depth
 1m_write_queue_depth

 /* 1 hours average number of requests in flight */
 1h_read_queue_depth
 1h_write_queue_depth 
I guess this involves counting bdrv_acct_start() and bdrv_acct_done().
The "you need not call bdrv_acct_done()" feature may get in the way.
Solvable.

Permit me a short detour into the other use for I/O accounting I
mentioned: data on how exactly guest I/O gets translated to host I/O as
it flows through the nodes in the block graph.  Do you think this would
be pretty much the same data, just collected at different points?

...
 4) Making this happen
 -------------------------

 Outscale want to make these IO stat happen and gave me the go to do whatever
 grunt is required to do so.
 That said we could collaborate on some part of the work. 
Cool!

A quick stab at tasks:

* QMP interface, either a compatible extension of query-blockstats or a
  new one.

* Rough idea on how to do the shared infrastructure.

* Implement (can be split up into several tasks if desired)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [Qemu-devel] IO accounting overhaul