On Fri, Mar 05, 2010 at 06:50:47AM -1000, Zachary Amsden wrote:
On 03/05/2010 04:27 AM, Daniel P. Berrange wrote:
>
> * HPET
> Multiple timers with periodic interrupts
> Can replace PIT/RTC timers
>
>They all generally suck in real hardware, and this gets worse in virtual
>machines.
>Many different approaches to making them suck less in VMWare, Xen& KVM,
>but there
>are some reasonably common concepts....
>
HPET doesn't suck.
The VMWare timekeeping docs mentions that it has timeout race conditions,
poorly defined spec for timer granularity, drift & speed of access, & bad
implementations in the real world which I read as 'sucks' ;-)
> * Interrupt timers
>
> - Ticks can not always be delivered on time
>
> Policies to deal with "missed" ticks:
>
> 1. Deliver at normal rate without catchup
> 2. Deliver at higher rate to catch up
> 3. Merge into 1 tick& deliver asap
>
> 4. Discard all missed ticks
>
The issue is actually more complex than just these policies. A naive
implementation of the policy leads to a guest DOS of the host.
We actually have such a bug, and it demands a policy which merges ticks
over a certain threshold and does not deliver ASAP. It's tricky and
complex to fix because it means our notion of timers for the guest is
wrong, and we need to introduce a higher order scheduling behaviour.
In general, there isn't much we can tune here, but what we can tune is
whether the other counters (RTC / HPET / TSC / ACPI) stay in sync with
ticks delivered. It's not perfect or completely well defined because
the tick can't actually be delivered until a fairly complex set of
hardware rules is obeyed. This may not be apparent now, because it gets
worse as we implement more hardware support for NMIs and SMIs. An ideal
solution would sync the other counters when the tick is generated, not
when it is injected. However, this leads us back to the DOS attack.
There are also problems with SMP timing here (which CPU gets timer
interrupts can change, and are they broadcast?). These problems are
made worse because we don't gang schedule.
FYI, I wasn't trying to suggest good / bad policies here. I was just
attempting to document the policies that I see have been implemented
so far. For the libvirt XML the key issue is to identify a way to
list possible policies that can be extended as new one appear in
hypervisors.
> * TSC
> - rdtsc instruction can be exposed to guests in two ways
>
> 1. Trap + emulate (slow, but more reliable)
> 2. Native (fast, but possibly unreliable)
>
> Optionally also expose a 'rdtscp' instruction
>
> Possiblly set a fixed HZ independant of host.
>
There is also
3) a mixed approach; trap and emulate only when required, allow native
access and offset appropriately at each exit; and
4) a SMP safe approach; trap and emulate always, and interlock SMP
access to the clock so it is globally consistent
5) a secure approach; trap and emulate always and hide host time. This
precludes the possibility of SMP, as timing differences can be observed
since we don't gang schedule. This obviously has implications for the
other timers.
So this variable is not a simple boolean, but a multi-choice.
Yep, I captured this increased range of options later after seeing that
Xen has 4 possible choices now!
>------------------
>
> * All timers run in "apparant time" ie track guest wallclock
> * Missed tick policy is to deliver at higher rate to catchup
> * TSC can be switched between native/emulate (virtual_rdtsc=TRUE|FALSE)
> * TSC can have hardcoded HZ in emulate mode (apparantHZ=VALUE)
> * RTC time of day is synced to host at startup (rtc.diffFromUTC or
> rtc.startTime)
> * VMWare tools reset guest TOD if it gets out of sync
>
There is also lateness hiding; (timeTracker.hideLateness); adjust TSC to
compensate for lateness of injected interrupts (it's the slightly buggy
counter compensation at each tick I mention above).
Thanks, I'd not see any reference to that one in the docs.
>Xen timekeeping
>---------------
>
> * TSC. Can run in 4 modes
>
> - auto: emulate if host TSC is unstable. native with invariant TSC
> - native: always native regardless of host TSC stability
> - emulate: trap + emulate regardless of host TSC invariant
> - pvrdtsc: native, requiring invariant TSC. Also exposes rdtscp
> instruction
>
TSC is complex enough without RDTSCP. Let's consider rdtscp as a host
optimization for vendors of hardware with buggy clocks who want fast
gettimeofday system calls. We already are compensating to try to keep
virtual TSC in sync on KVM and probably don't need this mode.
I included rdtscp because it is one of the things that latest Xen 4.0 tree
now implements, so we need to be able to represent it in the libvirt XML.
>Meaning of 'mode':
>
> Control how the clock is exposed to guest.
>
> auto: native if safe, otherwise emulate
> native: always native
> emulate: always emulate
> paravirt: native + paravirtualize
>
> NB: Only relevant for TSC. All other timers are always emulated.
>
auto, native, emulate can map nicely for us, but it would be good to
have an smp safe mode. (A secure mode is more of a global setting for
all timers).
For any of the enumerations I fully expect that we would add further allowed
values to the libvirt XML over time. The goal is to get the baseline on
current implementations & try to keep it easily extensible for future ideas
>Mapping to VMWare
>-----------------
>
>eg with guest config showing
>
> diffFromUTC='123456'
> apparentHZ='123456'
> virtual_rdtsc=False
>
>libvirt XML gets:
>
> <clock mode='variable' adjustment='123456'>
> <timer name='tsc' frequency='123456'
mode='native'/>
> </clock>
>
>
>Mapping to Xen
>--------------
>
>eg with guest config showing
>
> timer_mode=3
> hpet=1
> tsc_mode=2
> localtime=1
>
> <clock mode='localtime'>
> <timer name='platform' tickpolicy='merge'
wallclock='host'/>
> <timer name='hpet'/>
> <timer name='tsc' mode='native'/>
> </clock>
>
>
>Mapping to KVM
>--------------
>
>eg with guest ARGV showing
>
> -no-kvm-pit-reinjection
> -clock base=localtime,clock=guest,driftfix=slew
> -no-hpet
>
>
> <clock mode='localtime'>
> <timer name='rtc' tickpolicy='catchup'
wallclock='guest'/>
> <timer name='pit' tickpolicy='none'/>
> <timer name='hpet' present='no'/>
> </clock>
>
>
>
>Further reading
>---------------
>
>VMWare has the best doc:
>
>
http://www.vmware.com/pdf/vmware_timekeeping.pdf
>
>Xen:
>
> Docs on 'tsc_mode' at
>
> $SOURCETREE/docs/misc/tscmode.txt
>
> Docs for 'timer_mode' in the source code only:
>
> xen/include/public/hvm/params.h
>
>KVM:
>
> No docs at all. Guess from -help descriptions, reading source code&
> asking
> clever people about it :-)
>
Let me propose an XML mapping a bit later today. I haven't had coffee
yet, and we know what that can do.
Ok, thanks for the feedback so far.
Regards,
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://deltacloud.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|