On Fri, Feb 02, 2018 at 14:57:54 +0100, Martin Kletzander wrote:
On Fri, Feb 02, 2018 at 02:29:03PM +0100, Peter Krempa wrote:
> On Fri, Feb 02, 2018 at 12:44:03 +0100, Martin Kletzander wrote:
> > Since one of the things in capabilities (info from resctrl updated with data
> > about caches) can be change on the system by remounting the /sys/fs/resctrl
with
> > different options, the capabilities need to be refreshed. There is a better
fix
> > in the works, but it's going to be way bigger than this (hence the XXX
note
> > there), so for the time being let's workaround this. And in order not to
slow
> > down the domain starting, only get the capabilities if there are any
cachetunes.
> >
> > Resolves:
https://bugzilla.redhat.com/show_bug.cgi?id=1540780
>
> This BZ describes a crash if the filesystem is remounted, but you are
> attempting to fix this not by fixing the code that crashed but by
> re-loading the information if possibly somebody remounted it.
>
> This does not seem to be the correct fix since you still have a race
> window, where the options can be changed after the refresh is executed
> and prior to using them in the code where it actually crashed.
Yeah, I'm looking at that as well. It will need a restructuring (moving some
conf code to util - it'll also look nicer), but fix for exactly what is
happening here is enough for now.
Well, I'm not okay with selling this as a fix for the crash described in
the bugzilla. I might be okay with doing this as a mitigation for stale
data, but this is not a fix for the crash in any way.
We have the same kind of issue (minus the crash) with hugetlbfs mount
data or host cpu maps in the numa host description and we don't refresh
the capabilities at every start of those.
So there's the problem that the cache filesystem data is stale and your
host might fail to start which is inconvenient but not a big of a
problem.
The second issue is the crash if the data is stale and this certainly
does not fix that.
I will be okay with this patch (since it's in an unlikely code path) if
you rewrite the commit message and completely drop the mention of the
bugzilla above. This patch simply does not fix that BZ.