On 5/13/19 12:17 PM, Daniel P. Berrangé wrote:
This is a long mail about ENOMEM (OOM) handling in libvirt. The
executive
summary is that it is not worth the maint cost because:
- this code will almost never run on Linux hosts
- if it does run it will likely have bad behaviour silently dropping
data or crashing the process
- apps using libvirt often do so via a non-C language that aborts/exits
the app on OOM regardless, or use other C libraries that abort
- we can build a system that is more reliable when OOM happens by
not catching OOM, instead simply letting apps exit, restart and
carry on where they left off
The long answer follows...
I'm up for dropping OOM handling. Since we're linking with a lot of
libraries I don't think we can be confident that we won't abort() even
now anyway.
The background
==============
Since the first commit libvirt has attempted to handle out of memory (OOM)
errors in the same way it does for any other problem. Namely, a virError will
be raised, and the method will jump to its cleanup/error label. The error
bubbles up the stack and in theory someone or something will catch this and do
something sensible/useful. This has long been considered "best practice" for
most C libraries, especially those used for system services. This mail makes
the argument that it is in fact /not/ "best practice" to try to handle OOM.
OOM handling is very difficult to get correct because it is something that
developers almost never encounter and thus the code paths are rarely run. We
designed our memory allocation APIs such that we get compile time errors if
code forgets to check the return value for failure. This is good as it
eliminates an entire class of bugs. Our error handling goto label pattern
tries to align OOM handling with other general error handling which is more
commonly tested.
We have code in the allocators which lets us run unit tests simulating the
failure of any allocation that is made during the test. Executing this is
extraordinarily time consuming as some of our unit tests have many 1000's or
even 10's of 1000's of allocations. The running time to simulate OOM for each
allocation is O(N^2) which does not scale well. As a result we've probably
only run these OOM tests a handful of times over the years.
Well, I've tried this and so far, the only problems I've found were in
tests where we're ignoring revals of VIR_ALLOC() or its friends (e.g.
vir*New()) or we're checking it after it's used already.
BTW: to make it work I had to disable failing from mocks (obviously).
The tests show we generally do remarkly well at OOM handling, but none the
less we have *always* found bugs in our handling where we fail to report the
OOM (as in, continue to run but with missing data), or we crash when cleaning
up from OOM. Our unit tests don't cover anywhere near all of our codebase
though. We have great coverage of XML parsing/formatting, and sporadic coverage
of everything else.
IOW, despite our clever API designs & coding patterns designed to improve our
OOM handling, we can not have confidence that we will actually operate correctly
in an OOM scenario.
I beg to disagree. The fact that our cleanup paths are in 99% the same
as error paths means that if there are no mem-leaks (and if
transformation to VIR_AUTOFREE is done the chances are low), then we
propably do good on OOM too.
<snip/>
The implementation
==================
Assuming a decision to abort on OOM, libvirt can nwo follow QEMU's lead and
make use of the GLib library.
No, please no. Firstly, glib is a new C dialect that only a few of us
speak. Secondly, glib adds some padding around its objects => libvirt's
memory footprint will grow. Thirdly, it's yet another library to link
with (on my system libvirt links with 53 libraries already).
Michal