On a Monday in 2021, BiaoxiangYe wrote:
> From: BiaoXiang Ye <yebiaoxiang(a)huawei.com>
>
> Setting the system time backward would lead to a
> multiplication overflow in function virKeepAliveStart.
> FunctionvirKeepAliveTimerInternal got the same bug too.
>
> Backtrace below:
> #0 0x0000ffffae898470 in raise () from /usr/lib64/libc.so.6
> #1 0x0000ffffae89981c in abort () from /usr/lib64/libc.so.6
> #2 0x0000ffffaf9a36a8 in __mulvsi3 () from /usr/lib64/libvirt.so.0
> #3 0x0000ffffaf8fd9e8 in virKeepAliveStart (ka=0xaaaaf954ce10,
interval=interval@entry=0,
> count=count@entry=0) at ../../src/rpc/virkeepalive.c:283
> #4 0x0000ffffaf908560 in virNetServerClientStartKeepAlive (client=0xaaaaf954cbe0)
> at ../../src/rpc/virnetserverclient.c:1628
> #5 0x0000aaaac57eb6dc in remoteDispatchConnectSupportsFeature
(server=0xaaaaf95309d0,
> msg=0xaaaaf9549d90, ret=0xffff8c007fc0, args=0xffff8c002e70,
rerr=0xffff9ea054a0,
> client=0xaaaaf954cbe0) at ../../src/remote/remote_daemon_dispatch.c:5063
> #6 remoteDispatchConnectSupportsFeatureHelper (server=0xaaaaf95309d0,
client=0xaaaaf954cbe0,
> msg=0xaaaaf9549d90, rerr=0xffff9ea054a0, args=0xffff8c002e70,
ret=0xffff8c007fc0)
> at ./remote/remote_daemon_dispatch_stubs.h:3503
> #7 0x0000ffffaf9053a4 in virNetServerProgramDispatchCall(msg=0xaaaaf9549d90,
client=0xaaaaf954cbe0,
> server=0x0, prog=0xaaaaf953a170) at ../../src/rpc/virnetserverprogram.c:451
> #8 virNetServerProgramDispatch (prog=0xaaaaf953a170, server=0x0,
server@entry=0xaaaaf95309d0,
> client=0xaaaaf954cbe0, msg=0xaaaaf9549d90) at
../../src/rpc/virnetserverprogram.c:306
> #9 0x0000ffffaf90a6bc in virNetServerProcessMsg (msg=<optimized out>,
prog=<optimized out>,
> client=<optimized out>, srv=0xaaaaf95309d0) at
../../src/rpc/virnetserver.c:137
> #10 virNetServerHandleJob (jobOpaque=0xaaaaf950df80, opaque=0xaaaaf95309d0)
> at ../../src/rpc/virnetserver.c:154
> #11 0x0000ffffaf812e14 in virThreadPoolWorker (opaque=<optimized out>)
> at ../../src/util/virthreadpool.c:163
> #12 0x0000ffffaf81237c in virThreadHelper (data=<optimized out>) at
../../src/util/virthread.c:246
> #13 0x0000ffffaea327ac in ?? () from /usr/lib64/libpthread.so.0
> #14 0x0000ffffae93747c in ?? () from /usr/lib64/libc.so.6
> (gdb) frame 3
> #3 0x0000ffffaf8fd9e8 in virKeepAliveStart (ka=0xaaaaf954ce10,
interval=interval@entry=0,
> count=count@entry=0) at ../../src/rpc/virkeepalive.c:283
> 283 timeout = ka->interval - delay;
> (gdb) list
> 278 now = time(NULL);
> 279 delay = now - ka->lastPacketReceived; <='delay' got a negative
value
> 280 if (delay > ka->interval)
> 281 timeout = 0;
> 282 else
> 283 timeout = ka->interval - delay;
> 284 ka->intervalStart = now - (ka->interval - timeout);
> 285 ka->timer = virEventAddTimeout(timeout * 1000, virKeepAliveTimer, <=
multiplication overflow
> 286 ka, virObjectFreeCallback);
> 287 if (ka->timer < 0)
> (gdb) p now
> $2 = 18288001
> (gdb) p ka->lastPacketReceived
> $3 = 1609430405
>
> Signed-off-by: BiaoXiang Ye <yebiaoxiang(a)huawei.com>
> ---
> src/rpc/virkeepalive.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/src/rpc/virkeepalive.c b/src/rpc/virkeepalive.c
> index 860b91b6b1..e662484ea6 100644
> --- a/src/rpc/virkeepalive.c
> +++ b/src/rpc/virkeepalive.c
> @@ -119,12 +119,25 @@ virKeepAliveTimerInternal(virKeepAlivePtr ka,
> if (ka->interval <= 0 || ka->intervalStart == 0)
> return false;
>
> + if (now < ka->intervalStart) {
> + VIR_WARN("system time jump back detected. now=%ld,"
> + " intervalStart=%ld",
> + now, ka->intervalStart);
> + ka->intervalStart = now;
> + }
> +
Instead of detecting a jump backwards, we could use a monotonic timer
that does not go backwards.
g_get_monotonic_time () from GLib should do that
https://developer.gnome.org/glib/stable/glib-Date-and-Time-Functions.html...
(Although they say the timer might not tick while the machine is
suspended - not sure whether we need that here or not)
Your connections are doomed no matter what if you suspend for any
non-negligible amount of time, as the low level TCP connection
will get timed out by the OS most likely.
IOW regardless of which timer we use for keepalive, we're
going to start counting the keepalive failures from when we
resume. It is probably only going to show a difference for any
keepalive responses that we had received, but not yet processed
at time of suspend.
So I think using monotonic timer would likely be ok here.
Regards,
Daniel
--
|: