Re: [libvirt] [virt-devel] RFC: Modelling timers / clocks & tick policies in libvirt

On 03/05/2010 04:27 AM, Daniel P. Berrange wrote:
This mail describes how I'm suggesting libvirt addresses Dor's RFE to timers/tick policies in libvirt
https://bugzilla.redhat.com/show_bug.cgi?id=557285
Thanks to those who've answered my previous mail / irc questions. Please point out any mistakes I made in understanding / modelling this problem
Daniel
Virtual machine timer management in libvirt ===========================================
On PC hardware there are a number of terrible timers / clock sources available to operating systems
* PIT Timer with periodic interrupts
* RTC
Time of Day clock, continuous running Timer with periodic interrupts
* Local APIC Timer Timer with periodic interrupts
* ACPI Timer Timer with periodic interrupts
* TSC Read via rdtsc instruction. No interrupts
Never, ever tell anyone this, but it is possible to generate interrupts from the TSC. Doing so is a million orders of perverse because of the problems you list:
Unreliable on some hardware. eg changes frequency. Not synced between cores Different HZ across hosts
* HPET Multiple timers with periodic interrupts Can replace PIT/RTC timers
They all generally suck in real hardware, and this gets worse in virtual machines. Many different approaches to making them suck less in VMWare, Xen& KVM, but there are some reasonably common concepts....
HPET doesn't suck.
Virtual timekeeping problems ----------------------------
Three primary problems / areas to deal with:
* Time of day clock (RTC)
- Initialized to UTC/Localtime/Timezone/UTC+offset - Two modes of operation: 1. Guest wallclock: only runs when guest is executing. ie stopped across save/restore, etc 2. Host wallclock: runs continuously with host wall time.
Windows wants RTC in localtime, Linux can do either but RTC in UTC is easiest to maintain. Guest / host wall clock is a general consideration for all timers, obviously RTC is supposed to be a wallclock, but there is a question: do we allow the guest to see host time exactly or do we hide it?
* Interrupt timers
- Ticks can not always be delivered on time
Policies to deal with "missed" ticks:
1. Deliver at normal rate without catchup 2. Deliver at higher rate to catch up 3. Merge into 1 tick& deliver asap
4. Discard all missed ticks
The issue is actually more complex than just these policies. A naive implementation of the policy leads to a guest DOS of the host. We actually have such a bug, and it demands a policy which merges ticks over a certain threshold and does not deliver ASAP. It's tricky and complex to fix because it means our notion of timers for the guest is wrong, and we need to introduce a higher order scheduling behaviour. In general, there isn't much we can tune here, but what we can tune is whether the other counters (RTC / HPET / TSC / ACPI) stay in sync with ticks delivered. It's not perfect or completely well defined because the tick can't actually be delivered until a fairly complex set of hardware rules is obeyed. This may not be apparent now, because it gets worse as we implement more hardware support for NMIs and SMIs. An ideal solution would sync the other counters when the tick is generated, not when it is injected. However, this leads us back to the DOS attack. There are also problems with SMP timing here (which CPU gets timer interrupts can change, and are they broadcast?). These problems are made worse because we don't gang schedule.
* TSC - rdtsc instruction can be exposed to guests in two ways
1. Trap + emulate (slow, but more reliable) 2. Native (fast, but possibly unreliable)
Optionally also expose a 'rdtscp' instruction
Possiblly set a fixed HZ independant of host.
There is also 3) a mixed approach; trap and emulate only when required, allow native access and offset appropriately at each exit; and 4) a SMP safe approach; trap and emulate always, and interlock SMP access to the clock so it is globally consistent 5) a secure approach; trap and emulate always and hide host time. This precludes the possibility of SMP, as timing differences can be observed since we don't gang schedule. This obviously has implications for the other timers. So this variable is not a simple boolean, but a multi-choice.
VMWare timekeeping ------------------
* All timers run in "apparant time" ie track guest wallclock * Missed tick policy is to deliver at higher rate to catchup * TSC can be switched between native/emulate (virtual_rdtsc=TRUE|FALSE) * TSC can have hardcoded HZ in emulate mode (apparantHZ=VALUE) * RTC time of day is synced to host at startup (rtc.diffFromUTC or rtc.startTime) * VMWare tools reset guest TOD if it gets out of sync
There is also lateness hiding; (timeTracker.hideLateness); adjust TSC to compensate for lateness of injected interrupts (it's the slightly buggy counter compensation at each tick I mention above).
Xen timekeeping ---------------
* Virtual platform timer (VPT) used as source for other timers * VPT has 4 modes
0: delay_for_missed_ticks
Missed ticks are delivered when next scheduled, at the normal rate. RTC runs in guest wallclock, so is delayed. No catchup is attempted
1: no_delay_for_missed_ticks
Missed ticks are delivered when next scheduled, at the normal rate. RTC runs in host wallclock, so is not delayed.
2: no_missed_ticks_pending
Missed ticks are discarded& next tick is delivered normally. RTC runs in host wallclock.
3: one_missed_tick_pending
Missed interrupts are collapsed into a single late tick. RTC run in host wallclock.
* HPET
Optionally enabled
* TSC. Can run in 4 modes
- auto: emulate if host TSC is unstable. native with invariant TSC - native: always native regardless of host TSC stability - emulate: trap + emulate regardless of host TSC invariant - pvrdtsc: native, requiring invariant TSC. Also exposes rdtscp instruction
TSC is complex enough without RDTSCP. Let's consider rdtscp as a host optimization for vendors of hardware with buggy clocks who want fast gettimeofday system calls. We already are compensating to try to keep virtual TSC in sync on KVM and probably don't need this mode.
KVM timekeeping ---------------
* PIT: can be in kernel, or userspace (userspace deprecated for KVM)
Default tick policies differ for both impls
- Userspace: Default: missed ticks are delivered when next scheduled at normal rate
-tdf flag enable tick reinjection to catchup
- Kernel: Default: Missed ticks are delivered at higher rate to catch up
-no-kvm-pit-reinjection to disable tick reinjection catchup
* RTC
TOD clock can run in host or guest wallclock (clock=host|guest)
Default: missed ticks are delivered when next scheduled at normal rate
-rtc-td-hack or -clock driftfix=slew: missed ticks are delivered at a higher rate to catchup
* TSC
- Always runs native.
* HPET
- Optionally enabled/disabled
Mapping in libvirt XML ----------------------
Currently supports setting Time of Day clock via
<clock offset="utc"/>
Always sync to UTC
<clock offset="localtime"/>
Always sync to host timezone
<clock offset="timezone" timezone='Europe/Paris'/>
Sync to arbitrary timezone
<clock offset="variable" adjustment='123456'/>
Sync to UTC + arbitrary offset
Proposal to model all timers policies as sub-elements of this<clock/> In general we wil allow zero or more<timer/> elements following the syntax:
<timer name='platform|pit|rtc|hpet|tsc' wallclock='host|guest' tickpolicy='none|catchup|merge|discard' frequency='123' mode='auto|native|emulate|paravirt' present='yes|no' />
Meaning of 'name':
Names map to regular PC timers / clocks. 'Platform' refers to the (optional) master virtual clock source that may be used to drive policy of "other" clocks (eg used in Xen, which clocks are controlled by the platform clock is to be undefined because it has varied in Xen over time).
Meaning of 'tickpolicy':
none: continue to deliver at normal rate (ie ticks are delayed) catchup: deliver at higher rate to catchup merge: ticks merged into 1 single tick discard: all missed ticks are discarded
Meaning of 'wallclock':
Only valid for name='rtc' or 'platform'
host: RTC wallclock always tracks host time guest: RTC wallclock always tracks host time
Meaning of 'frequency':
Set a fixed frequency in HZ.
NB: Only relevant for TSC. All other timers are fixed (PIT, RTC), or fully guest controlled frequency (HPET)
Actually, the guest doesn't control the HPET base frequency, only the divider. I think. I'd need to recheck the spec.
Meaning of 'mode':
Control how the clock is exposed to guest.
auto: native if safe, otherwise emulate native: always native emulate: always emulate paravirt: native + paravirtualize
NB: Only relevant for TSC. All other timers are always emulated.
auto, native, emulate can map nicely for us, but it would be good to have an smp safe mode. (A secure mode is more of a global setting for all timers).
Meaing of 'present':
Used to override default set of timers visible to the guest. eg to enable or disable the HPET
Mapping to VMWare -----------------
eg with guest config showing
diffFromUTC='123456' apparentHZ='123456' virtual_rdtsc=False
libvirt XML gets:
<clock mode='variable' adjustment='123456'> <timer name='tsc' frequency='123456' mode='native'/> </clock>
Mapping to Xen --------------
eg with guest config showing
timer_mode=3 hpet=1 tsc_mode=2 localtime=1
<clock mode='localtime'> <timer name='platform' tickpolicy='merge' wallclock='host'/> <timer name='hpet'/> <timer name='tsc' mode='native'/> </clock>
Mapping to KVM --------------
eg with guest ARGV showing
-no-kvm-pit-reinjection -clock base=localtime,clock=guest,driftfix=slew -no-hpet
<clock mode='localtime'> <timer name='rtc' tickpolicy='catchup' wallclock='guest'/> <timer name='pit' tickpolicy='none'/> <timer name='hpet' present='no'/> </clock>
Further reading ---------------
VMWare has the best doc:
http://www.vmware.com/pdf/vmware_timekeeping.pdf
Xen:
Docs on 'tsc_mode' at
$SOURCETREE/docs/misc/tscmode.txt
Docs for 'timer_mode' in the source code only:
xen/include/public/hvm/params.h
KVM:
No docs at all. Guess from -help descriptions, reading source code& asking clever people about it :-)
Let me propose an XML mapping a bit later today. I haven't had coffee yet, and we know what that can do. Zach

On Fri, Mar 05, 2010 at 06:50:47AM -1000, Zachary Amsden wrote:
On 03/05/2010 04:27 AM, Daniel P. Berrange wrote:
* HPET Multiple timers with periodic interrupts Can replace PIT/RTC timers
They all generally suck in real hardware, and this gets worse in virtual machines. Many different approaches to making them suck less in VMWare, Xen& KVM, but there are some reasonably common concepts....
HPET doesn't suck.
The VMWare timekeeping docs mentions that it has timeout race conditions, poorly defined spec for timer granularity, drift & speed of access, & bad implementations in the real world which I read as 'sucks' ;-)
* Interrupt timers
- Ticks can not always be delivered on time
Policies to deal with "missed" ticks:
1. Deliver at normal rate without catchup 2. Deliver at higher rate to catch up 3. Merge into 1 tick& deliver asap
4. Discard all missed ticks
The issue is actually more complex than just these policies. A naive implementation of the policy leads to a guest DOS of the host.
We actually have such a bug, and it demands a policy which merges ticks over a certain threshold and does not deliver ASAP. It's tricky and complex to fix because it means our notion of timers for the guest is wrong, and we need to introduce a higher order scheduling behaviour.
In general, there isn't much we can tune here, but what we can tune is whether the other counters (RTC / HPET / TSC / ACPI) stay in sync with ticks delivered. It's not perfect or completely well defined because the tick can't actually be delivered until a fairly complex set of hardware rules is obeyed. This may not be apparent now, because it gets worse as we implement more hardware support for NMIs and SMIs. An ideal solution would sync the other counters when the tick is generated, not when it is injected. However, this leads us back to the DOS attack. There are also problems with SMP timing here (which CPU gets timer interrupts can change, and are they broadcast?). These problems are made worse because we don't gang schedule.
FYI, I wasn't trying to suggest good / bad policies here. I was just attempting to document the policies that I see have been implemented so far. For the libvirt XML the key issue is to identify a way to list possible policies that can be extended as new one appear in hypervisors.
* TSC - rdtsc instruction can be exposed to guests in two ways
1. Trap + emulate (slow, but more reliable) 2. Native (fast, but possibly unreliable)
Optionally also expose a 'rdtscp' instruction
Possiblly set a fixed HZ independant of host.
There is also
3) a mixed approach; trap and emulate only when required, allow native access and offset appropriately at each exit; and
4) a SMP safe approach; trap and emulate always, and interlock SMP access to the clock so it is globally consistent
5) a secure approach; trap and emulate always and hide host time. This precludes the possibility of SMP, as timing differences can be observed since we don't gang schedule. This obviously has implications for the other timers.
So this variable is not a simple boolean, but a multi-choice.
Yep, I captured this increased range of options later after seeing that Xen has 4 possible choices now!
------------------
* All timers run in "apparant time" ie track guest wallclock * Missed tick policy is to deliver at higher rate to catchup * TSC can be switched between native/emulate (virtual_rdtsc=TRUE|FALSE) * TSC can have hardcoded HZ in emulate mode (apparantHZ=VALUE) * RTC time of day is synced to host at startup (rtc.diffFromUTC or rtc.startTime) * VMWare tools reset guest TOD if it gets out of sync
There is also lateness hiding; (timeTracker.hideLateness); adjust TSC to compensate for lateness of injected interrupts (it's the slightly buggy counter compensation at each tick I mention above).
Thanks, I'd not see any reference to that one in the docs.
Xen timekeeping ---------------
* TSC. Can run in 4 modes
- auto: emulate if host TSC is unstable. native with invariant TSC - native: always native regardless of host TSC stability - emulate: trap + emulate regardless of host TSC invariant - pvrdtsc: native, requiring invariant TSC. Also exposes rdtscp instruction
TSC is complex enough without RDTSCP. Let's consider rdtscp as a host optimization for vendors of hardware with buggy clocks who want fast gettimeofday system calls. We already are compensating to try to keep virtual TSC in sync on KVM and probably don't need this mode.
I included rdtscp because it is one of the things that latest Xen 4.0 tree now implements, so we need to be able to represent it in the libvirt XML.
Meaning of 'mode':
Control how the clock is exposed to guest.
auto: native if safe, otherwise emulate native: always native emulate: always emulate paravirt: native + paravirtualize
NB: Only relevant for TSC. All other timers are always emulated.
auto, native, emulate can map nicely for us, but it would be good to have an smp safe mode. (A secure mode is more of a global setting for all timers).
For any of the enumerations I fully expect that we would add further allowed values to the libvirt XML over time. The goal is to get the baseline on current implementations & try to keep it easily extensible for future ideas
Mapping to VMWare -----------------
eg with guest config showing
diffFromUTC='123456' apparentHZ='123456' virtual_rdtsc=False
libvirt XML gets:
<clock mode='variable' adjustment='123456'> <timer name='tsc' frequency='123456' mode='native'/> </clock>
Mapping to Xen --------------
eg with guest config showing
timer_mode=3 hpet=1 tsc_mode=2 localtime=1
<clock mode='localtime'> <timer name='platform' tickpolicy='merge' wallclock='host'/> <timer name='hpet'/> <timer name='tsc' mode='native'/> </clock>
Mapping to KVM --------------
eg with guest ARGV showing
-no-kvm-pit-reinjection -clock base=localtime,clock=guest,driftfix=slew -no-hpet
<clock mode='localtime'> <timer name='rtc' tickpolicy='catchup' wallclock='guest'/> <timer name='pit' tickpolicy='none'/> <timer name='hpet' present='no'/> </clock>
Further reading ---------------
VMWare has the best doc:
http://www.vmware.com/pdf/vmware_timekeeping.pdf
Xen:
Docs on 'tsc_mode' at
$SOURCETREE/docs/misc/tscmode.txt
Docs for 'timer_mode' in the source code only:
xen/include/public/hvm/params.h
KVM:
No docs at all. Guess from -help descriptions, reading source code& asking clever people about it :-)
Let me propose an XML mapping a bit later today. I haven't had coffee yet, and we know what that can do.
Ok, thanks for the feedback so far. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 03/05/2010 07:05 AM, Daniel P. Berrange wrote:
On Fri, Mar 05, 2010 at 06:50:47AM -1000, Zachary Amsden wrote:
On 03/05/2010 04:27 AM, Daniel P. Berrange wrote:
* HPET Multiple timers with periodic interrupts Can replace PIT/RTC timers
They all generally suck in real hardware, and this gets worse in virtual machines. Many different approaches to making them suck less in VMWare, Xen& KVM, but there are some reasonably common concepts....
HPET doesn't suck.
The VMWare timekeeping docs mentions that it has timeout race conditions, poorly defined spec for timer granularity, drift& speed of access,& bad implementations in the real world which I read as 'sucks' ;-)
Which can also be read as nearly perfectly virtualizable due to extreme variations in tolerance and hardened guest code ;)
participants (2)
-
Daniel P. Berrange
-
Zachary Amsden