Re: [libvirt] [PATCH 3/5] conf, schema, docs: Add support for TSEG size setting

1 Jun 2018

      [...]

First thanks for taking the time to elaborate - it is helpful. Much
better than just stating no because I don't like it ;-).
...
...
...
...
1. Add poll-max-ns property of each iothread:
https://www.redhat.com/archives/libvir-list/2017-February/msg01047.html
This is about tunables.  It might change the performance/latency of QEMU
slightly, but that's about it.
and there are those that would find it useful to have (bz 1545732)... If
Much of what I read (underscore emphasis mine) suggests otherwise:
- "There is a lot of sentiment *against* providing too many low level knobs
   like this _without proper guidance_ on how they should be set."
- "To address this issue QEMU implements self-tuning algorithm that
modifies
   the current polling time to _adapt to different workloads_ and it can
also
   fallback to blocking syscalls."
- "The QEMU commits say the tunables all default to sane parameters so I'm
   inclined to say we ignore them at the libvirt level entirely."
- "I'm fine if libvirt doesn't add a dedicated API for setting <iothread>
   polling parameters.  It's _unlikely_ that users will need to change the
   setting.  In an emergency (e.g. disabling it due to a performance
   regression) _they can_ use <qemu:arg value='-newarg'/>."
The only points for the polling to be enabled were along the lines of:
It _may_ help in _some_ workloads when you want a bit more throughput
for the price of more CPU cycles.
With vague definitions of how much CPU, throughput and without
description of how to find out if a particular workload fits this.  Even
when all of that is there, then you need yet another explanation on how
to calculate the value to be set.  And then it all goes down back to the
fact that QEMU is already doing some automated balancing for this
(because they can, because this is not part of the guest ABI).  That way
you can never actually say if it will help and how much.
So for this one it is a clear "NO".
Another opposing viewpoint is:

https://bugzilla.redhat.com/show_bug.cgi?id=1545732#c8

If it were only a "documentation issue" - someone would have figured out
much earlier how to get beyond that.

FWIW: Without guidance in the Contributor's Guide over what is/isn't
acceptable I have a feeling we'll continue to see patches such as this
and the one below.

Still my point was less the actual feature or details of it, but rather
my feeling is there are more examples where exposing low level knobs has
been panned in the past. From the lack of details/knowledge I have/had
about TSEG during my original review - I saw it as just another low
level knob and while I saw value in the knob, I knew there has been a
high degree of sentiment in previous patches regarding adding such knobs
so I wanted to "be sure" it was desired/necessary adjustment by more
than just my opinion (it is a community after all, right)?

BTW: I'm not in disagreement that I found poll-max-ns as an odd tunable
to add for many of the reasons supplied. Of course I'm the one stuck
with the bz /-| and providing the "bad news" or just continually move
the bz to a future release ;-)
...
...
you don't have enough memory and your VM is paging like crazy, you just
add more memory.  Requires a reboot. Likewise, if your VM doesn't boot
you add/alter the magic TSEG value using some algorithm as described
above. From a 90,000 foot customer view is there a difference? It's just
a knob that the hypervisor has to allow something to be accomplished for
which libvirt provides the attribute to fine tune.
Yeah, you're right.  That's why I think both of them should be exposed. 
Some
small differences to other knobs, just for completeness:
- by the time you realize that the VM doesn't have enough memory, it
might be
  too late as reboot isn't that easy of a thing for some production
workloads
- on the other hand, you have a way to see that happening (compare it to
the
  polling interval above which you have no idea without proper benchmarks)
One more thing that's common to the memory size (and I hope TSEG in the
future)
is that in mgmt apps the TSEG setting already has a place where to live
and it
is exactly where the memory size lives currently.  In templates.  You have
"small vm" teplate and "ginormous vm".  For the latter one you can just
add a
setting of TSEG _once_ per file.  How's TSEG better and easier than
memory?  You
figure it once for the VM settings (and possibly firmware, but that's
not going
to change much) and then it doesn't depend on the workflow, not even a
little
bit.
Anyway.
True TSEG is much more bounded and in your face when it doesn't work.
There still is this voice rumbling around in the back of my head that
says QEMU should be the owner of deciding upon the algorithm for the
value. Unlike a performance knob, it seems there's a solid way to
calculate a 'correct value' to make the boot work. The problem is if
that automatic calculation ended up being wrong at some point, then
there'd be no way to change the value without adding a knob. So, in a
way the knob could be the exception rather than the rule. It's a
mechanism to make sure the guest can boot given outside interference.
...
What I see as the differences between tunables that make it in and
tunables that
don't is that:
- the former are usually understandable and easy to see what they are in
bare
  metal.  Everyone knows what memory is in the hardware, how it looks
like, how
  much is "not enough" and how much is "more that needed".  We are used to
  those things back from the hardware times, even to changing them.
Still the calculation of a proper TSEG value is based on multiple
factors (memory/vcpus). Historically I've found these also need a fudge
factor built in - it's the fudge factor that is the sticking point. On
real hardware you'd be told - well you don't have enough memory, so buy
some more - it's like printing money at that point for the sales guy.
You'll be guided to buy a more expensive and larger piece than you may
need to "ensure future expand-ability". You may not use the entire
thing, but you have it. For software it's a much easier knob.
...
- The latter is usually something we were not able control in HW or
didn't even
  know it existed.  For virtual workloads it might be completely
different, but
  sometimes people are forgetting that.
Sounds like a job for virtuned (or virtunefixd or virhighavaild). Years
ago I worked on a project that would essentially show bottlenecks for
the OS and provide the capability to "fix" those via various means
(whether it was CPU, memory, or disk overutilization... even deadlocks
and cluster quorum hangs).
...
...
...
...
2. Add support for qcow2 cache (many times, but most recently):
https://www.redhat.com/archives/libvir-list/2017-September/msg00553.html
Similarly here, it allows setting something that can be (at least
slightly) abstracted and in the worst case the performance will be
slightly hindered.
This one I understand more why it would be rejected, but still providing
the value allows certain things to work a whole lot better. I also know
Berto has been "fine tuning" the algorithm in later QEMU releases - so
that's like hitting a moving target.
This is very similar, it's just that there is no automatic balancing
done by QEMU.  But it usually is also about how you write the docs.  The
option can make very much sense, but if someone writes "Setting asdf can
allows fine-tuning of the asdf value in the underlying hypervisor", then
no matter how much that value makes sense it is not reflected in the
docs.  That's why I tried to add all the relevant info into the docs so
that it's clear what it is doing, how to set it, to what values and
when.
Apart from the fact that there is a "link" to some file in the QEMU
repository that someone is supposed to read, plus the decision for the
value determination are written there (but not why they are not
automatically calculated, or maybe I missed it), it:
- is not possible to try using the <qemu:arg value='-newarg'/> approach
- the docs say:
  <b>In general you should leave this option alone, unless you
     are very certain you know what you are doing.</b>
So in this particular case I wouldn't be totally against having it
there.  If you don't want to use it, then "just don't touch that" is an
approach that shouldn't hurt anyone.
Search the formatdomain page for 'unless' - there are examples where
knobs have been added that aren't well described and the consumer better
know what they're doing in order to use them. Perhaps another case of
alibistic behaviors (a/k/a CYA).

[...]
...
...
And there are those that could say if the underlying hypervisor knows
that for certain memory sizes and/or vCPU counts that the TSEG will be
too small for specific machine types that then the underlying hypervisor
should be the one to "choose" a value that's programatically appropriate
which to a degree IIUC is the argument being used against allowing a
libvirt knob for the poll-max-ns and qcow2 cache sizes.
And they would be wrong as for TSEG the hypervisor a) doesn't know that
and b) cannot change that once it was started.
I think you lost me here.... From the bz problem statement:

"The necessary size is technically predictable (see bug 1468526 comment
8 point (2a) e.g.), but the formula is neither exact nor easy to
describe, so as a first step, libvirt should please expose this value in
an optional element or attribute."

I read that as a proper size could be calculated by the hypervisor, but
"just in case" let's make sure we have a fallback option. Perfectly
reasonable to me and even more pointed, (so far) only for q35. Of course
it's possible I read it wrong.

John

[...]

Re: [libvirt] [PATCH 3/5] conf, schema, docs: Add support for TSEG size setting

John Ferlan