Re: [libvirt] [PATCH 3/5] conf, schema, docs: Add support for TSEG size setting

31 May 2018

On Wed, May 30, 2018 at 12:00:26PM -0400, John Ferlan wrote:
...
On 05/21/2018 11:00 AM, Martin Kletzander wrote:
...
TSEG (Top of Memory Segment) is one of many regions that SMM (System Management
Mode) can occupy.  This one, however is special, because a) most of the SMM code
lives in TSEG nowadays and b) QEMU just (well, some time ago) added support for
so called 'extended' TSEG.  The difference to the TSEG implemented in real q35's
MCH (Memory Controller Hub) is that it can have any size from 1 MiB up to 65534
MiB in 1 MiB increments.  But more about that in the QEMU patch.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Which some reader, but not this one, may be eager to find ;-)
Still is there a valid range QEMU would accept? Or do we just let QEMU
fail if the range is too high?
I think QEMU has MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_MAX
Rather than promising some value, I adjusted it so that it is no longer false,
no matter what the max is there.
...
...
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
---
 docs/formatdomain.html.in           | 39 +++++++++++++++++++
 docs/schemas/domaincommon.rng       |  5 +++
 src/conf/domain_conf.c              | 60 ++++++++++++++++++++++++++++-
 src/conf/domain_conf.h              |  1 +
 tests/genericxml2xmlindata/tseg.xml | 23 +++++++++++
 tests/genericxml2xmltest.c          |  2 +
 6 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 tests/genericxml2xmlindata/tseg.xml
In the category of I hate it when that happens, git am -3 "merged" in
two chunks incorrectly!  Probably wouldn't have happened if I'd done
You can enable/disable 3-way merges if you do (not) like them.
...
this sooner!  The virDomainDefFeaturesCheckABIStability hunk got merged
into virDomainRedirFilterDefCheckABIStability and the tseg grammar got
merged under "vmport" and just before "gic".  As you can imagine the
results weren't pretty ;-).
Yeah, happened to me as well, I should've resent this, but I forgot about the
merge issue and I also wanted to show that this was posted way before the
freeze.  Anyway, it's pointless to force it now, I'll leave it for later
(meaning "after the release").

Anyway, I keep my branches updated (every now and then) on my github repo [1],
so if you want to check that, you always can.

[1] https://github.com/nertpinx/libvirt
...
...

diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
index 403b638bd4bd..39ebfe398bd7 100644
--- a/docs/formatdomain.html.in
+++ b/docs/formatdomain.html.in
@@ -1897,6 +1897,9 @@
   <ioapic driver='qemu'/>
   <hpt resizing='required'/>
   <vmcoreinfo state='on'/>
+  <smm state='on'>
+    <tseg unit='MiB'>48</tseg>
+  </smm>
 </features>
 ...</pre>
@@ -2056,6 +2059,42 @@
           <code>off</code>, default <code>on</code>) enable or disable
           System Management Mode.
           <span class="since">Since 2.1.0</span>
+
+          Optional sub-element <code>tseg</code> can be used to specify the
+          amount of memory dedicated to SMM TSEG. The size can be specified as a
+          value of that element, optional attribute <code>unit</code> can be
+          used to specify the unit of the aforementioned value (defaults to
+          'MiB').
+
If this is to be a true paragraph break then these paragraphs needs to
be wrapped in <p> ... </p>; otherwise, this just becomes one long run on
(and quite ugly IMO) paragraph.
...
+          This value is configurable due to the fact that the calculation cannot
+          be done right with the guarantee that it will work correctly.  For
+          QEMU TSEG was disabled up to and including <code>pc-q35-2.9</code> (it
+          does not make sense fo any other machine type than q35).
s/fo/for/
This also appears to be a paragraph break...
...
+          From <code>pc-q35-2.10</code> the default value was changed to 16 MiB.
s/From/Starting with/ ??? (not required, just a though)
...
+          That should suffice for up to 272 VCPUs, 5 GiB guest RAM in total, no
+          hotplug memory range, and 32 GiB of 64-bit PCI MMIO aperture.  Or for
+          48 VCPUs, with 1TB of guest RAM, no hotplug DIMM range, and 32GB of
+          64-bit PCI MMIO aperture. The values may also vary based on the loader
+          the VM is using.
+
+          Additional size might be needed for significantly higher VCPU counts
+          or increased address space (that can be memory, maxMemory, 64-bit PCI
+          MMIO aperture size; roughly 8 MiB of TSEG per 1 TiB of address space)
+          which can also be rounded up.
Uh, oh, hmmm... We seem to have this (perhaps more recent) "thing" about
libvirt having to supply some attribute based on some (mostly difficult
to describe) algorithm that QEMU would use in order to create the
"optimum" size or use for some alternate algorithm. Of course, a few
libvir-list patches like that have been NACK'd because of the feeling
that providing a useful algorithm for a user to "decide upon" a
satisfactory attribute value cannot really be done. Off the top of my
head I can come up with two:
It's kind of a different story.  Think of this as a memory size.  You
cannot determine the "right" amount of memory the VM should have.  You
can try to boot with X and double it until the OS installation succeeds.
And hope you won't need to change it later.
...
1. Add poll-max-ns property of each iothread:
https://www.redhat.com/archives/libvir-list/2017-February/msg01047.html
This is about tunables.  It might change the performance/latency of QEMU
slightly, but that's about it.
...
2. Add support for qcow2 cache (many times, but most recently):
https://www.redhat.com/archives/libvir-list/2017-September/msg00553.html
Similarly here, it allows setting something that can be (at least
slightly) abstracted and in the worst case the performance will be
slightly hindered.

Contrast this with TSEG which, in case it is set incorrectly, will
prevent the machine from booting at all.  If we go to the extreme, not
only can you easily try to find out the right amount to set for a
particular machine, but you can even do that programmatically since when
OVMF fails due to small extended TSEG size it will reboot very fast.
And you can get that in form of events.  When I tried it now it even
looks like you get rtc-change event when the domain doesn't reboot
immediately due to small TSEG size.

I will not put it in the docs because I will not guarantee that this is
the right way to go, but this is how events look for default TSEG size
for a guest that needs a lot more (it has 240 possible vCPUs and 256 TiB
of maximum memory, but because it starts with only 1 vCPU and 1GiB
memory I can try it out easily on my machine:

  virsh # event --domain nixos --all --loop --timestamp
  2018-05-31 09:42:21.060+0000: event 'lifecycle' for domain nixos: Resumed Unpaused
  2018-05-31 09:42:21.066+0000: event 'lifecycle' for domain nixos: Started Booted
  2018-05-31 09:42:21.514+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:21.964+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:22.414+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:22.868+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:23.325+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:23.778+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:24.230+0000: event 'reboot' for domain nixos
  2018-05-31 09:42:24.681+0000: event 'reboot' for domain nixos
  ...

you get the point.

And this is how it looks when I start it with increased size:

  virsh # event --domain nixos --all --loop --timestamp
  2018-05-31 09:43:24.578+0000: event 'lifecycle' for domain nixos: Resumed Unpaused
  2018-05-31 09:43:24.584+0000: event 'lifecycle' for domain nixos: Started Booted
  2018-05-31 09:43:31.808+0000: event 'rtc-change' for domain nixos: 0
  2018-05-31 09:43:32.808+0000: event 'rtc-change' for domain nixos: 0

The reasons for this not being done automatically are (from the top of my head):

- The above is just something I figured out myself, but it's not the
	recommended way written anywhere.  Maybe I'm wrong and it doesn't
	really work, but it still can be done manually.

- You cannot change it however you would like automatically, it is part
  of the guest ABI and we are striving for keeping that stable.

- Trying to figure this out by 1 MiB increments might take some time,
  but increasing it faster might be wasteful.

Basically there is no one-size-fits-all value, no easy way to do it
automatically (maybe what I tried), but very good explanation how to do
that manually and very easy way to do that.  Also, from the SW POV, it
doesn't even depend on the guest OS, just on the loader/bios so if you
have two same domains (like a template in OpenStack for example) you try
it out once and then you have the value that just works and will
continue working until you change something for the domain.  And what
it depends on is clearly written in the documentation.
...
...
+
+          Due to the nature of this setting being similar to "how much RAM
+          should the guest have" users are advised to either consult the
+          documentation of the guest OS or loader (if there is any), or test
+          this by trial-and-error changing the value until the VM boots
+          successfully.  Yet another guiding value for users might be the fact
+          that 48 MiB should be enough for pretty large guests (240 VCPUs and
+          4TB guest RAM), but it is on purpose not set as default as 48 MiB of
+          unavailable RAM might be too much for small guests (e.g. with 512 MiB
+          of RAM).
and this is the exact reason why patches like this get NACKd - because
trial and error should not be a 'desired' means to calculate.
It is not.  They are rejected because either a) there is no
documentation how to properly check if the value is the right value when
doing trial-end-error (this is not the case here since you can see if
the machine boots or not) or b) the values being set are too specific
instead of being abstracted -- setting value in KiB between 0 and the
size of a disk instead of "max_performance" or "min_latency" (this is
not the case here, the documentation explains what the size is and why
it is not about few guessable values).

Basically we are NACKing simple pass-through values without
understanding them and adding some documentation for them.  For example
stuff for which we have documentation along the lines of: "Element asdf
can be used to set the asdf of the domain."
...
bz referenced in patch 5 has an incredible amount of data and
calculations that provide even more insight and details that are lost
when we try to summarize in a libvirt meaningful patch.
Let me know what relevant information from the bz you are missing in the
documentation and I'll gladly add it.
...
What it seems is really needed is an attribute that libvirt provides
that tells QEMU to calculate the optimum size.
...
+
+          See <a href="#elementsMemoryAllocation">Memory Allocation</a>
+          for more details about the <code>unit</code> attribute.
+          <span class="since">Since 4.5.0</span> (QEMU only)
haha - see you put 4.5.0 and this is the 4.4.0 release - so it was
ignored until 4.5.0 was "on the clock" ;-)
Ironically this is pointed out as QEMU only; however, genericxml2xmltest
is used/updated.
So?
...
So, I personally don't mind if this attribute is added; however, I think
we become hypocrites to a certain degree if patches continue to be
blocked/NACKed to other subsystems that have attributes that allow
certain amount of control, but don't come with exact sizing references.
Still if this is pushed, then perhaps those others can use this as the
means to provide a reference to other knobs added.
How much more exact would you wanted to be in terms of sizing?  If it
could be any more exact we wouldn't need the tunable at all.  Please
don't compare it to other tunables that we didn't want exposed just
because it "sounds similar".  I hear lot of people just put stuff like
this into "unknown knobs" box and they treat it the same.  But there are
differences and it's not all
...
You can have my :
Reviewed-by: John Ferlan <jferlan@redhat.com>
I wasn't ever against the R-b, but I guess I'm missing the point.  You
disagree with me on the change made here, but then I get a R-b? :) And
then I go further down and read that the R-b has actually no value at
all and I should wait for another one :D Maybe I'm overthinking this,
but it didn't used to happen back when we used ACKs :)
...
with a few adjustments above; however, another R-By should be obtained
here as well as perhaps a policy change so that other similar such
series could be merged... I guess I'm curious what "thoughts" others may
have regarding adding this "knob" while not allowing others.
John