On 6/19/23 17:02, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote:
> On 6/19/23 14:31, Daniel P. Berrangé wrote:
>> On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
>>> It's weird that in 2023 there's no reliable and portable way to
>>> parse strings, but okay.
>>>
>>> We started with strtol(). Except, it doesn't work the same on
>>> Linux and Windows. On Windows it behaves a bit different when it
>>> comes to parsing strings with base = 0. So we've switched to
>>> g_ascii_strtoll() which promised to solve this. And it did.
>>> Except, it introduced another problem. I don't really understand
>>> why, but I've seen random test failures and all of them claimed
>>> inability to parse a number (specifically <memory/> from the
>>> hard coded domain XML from test driver, which IMO is just a
>>> coincidence because it's the first number we parse).
>>
>> What platforms are you seeing the failures on ? All platforms or just
>> on Windows, or just some other specific one?
>
> I've tried only Linux so far. Windows is out of question.
snip
>> On the Linux case get_C_locale is
>>
>> static locale_t
>> get_C_locale (void)
>> {
>> static gsize initialized = FALSE;
>> static locale_t C_locale = NULL;
>>
>> if (g_once_init_enter (&initialized))
>> {
>> C_locale = newlocale (LC_ALL_MASK, "C", NULL);
>> g_once_init_leave (&initialized, TRUE);
>> }
>>
>> return C_locale;
>> }
>>
>> and only way that could fail is if newlocale isn't threadsafe, and
>> we have other code in libvirt calling newlocale at exact same time
>> as the *first* call to get_C_locale.
>
> Yeah, that's why I don't understand what's going on. Anyway, try running
> tests repeatedly (you can use oneliner from the commit message) and
> you'll run into the problem.
Yes, I've hit the problem with the error message
error: XML error: Invalid value '8388608' for element or attribute
'./memory[1]'
interestingly, I've also hit a few other error messages, some failures
without error message (I suspect virsh simply crashed with no output),
and most intriguing is that it 'virshtest' simply hangs when I run it
directly.
<snip/>
So the cause of the hang is exceedingly obvious.
Very few glib APIs are safe to uses in between fork & exec, and we are
(yet again) being burnt.
Yep, and switching to plain strtol() fixes this hang. But it does not
solve ...
What I can't explain, however, is why we sometimes get an error message
instead of a hang.
.. this. I bet it has something to do with fork() + exec() because when
I set up logging and run those tests I can see which one is failing
(with that "unable to parse NNN" message), but when I run it manually
with the exact arguments I don't see any hiccups.
If I modify 'virGlobalInit' to call g_ascii_strtoll("0", NULL, 10);
then
neither the hang or error messages occur, but the hang *should* still
occurr as the mutex locking race fundamentally stil exists. I think it
just changes the timing enough to avoid it in our case.
My only thought with the error messages is that somehow the 'locale_t'
is getting its initialization missed somehow.
> I've tried to debug the problem over weekend but was really
> unsuccessful. Firstly, I suspected that glib version of pthread_once()
> was broken. So I've rewritten get_C_locale() to use pthread_once() but
> that didn't help. Which tends to point into direction of glibc,
> supported by the fact that glibc impl of newlocale() does indeed have a
> global RW lock. But I haven't found anything obviously wrong there either.
If glib was actually using pthread_once() we wouldn't have a problem.
Their g_once_init_enter/leave stuff doesn't use pthread_once() and their
impl is not safe due to its mutex usage.
Anyway, I understand why your proposed change here is avoiding problems,
even if I don't understand the full root cause.
Ultimately I think our virExec() impl is verging on broken by design. We
have got too much functionality in it to be safe. Especially since we
switched from gnulib to glib, we're not so concious of what higher level
constructs we're accidentally relying on - eg no one would guess g_ascii_strtol
would acquire locks.
Rather than change our virstring.c impl, I'm inclined to think we need to
be more aggressive and eliminate as much use of glib as possible from the
virExec() and virFork() functions. Limit them to pure POSIX only.
Basically g_new/g_free are the only bits of glib I would entirely
trust in between fork/exec, without analyznig the code of other glib
APIs.
Specifically for this virStrToLong_i problem, I would suggest that we
just directly call strtol from virCommandMassCloseGetFDsLinux(), and
leave the main virStrToLong_i impl unchanged.
I've done this locally and it helped with deadlocks, but not with the
number parsing problem.
Second, we should modify the FreeBSD impl of virCommandMassClose so
that it works on Linux, when close_range() is available. That would
avoid calling the problem code in the first place for modern linux.
Yeah, I remember I wanted to do this when the syscall was introduced to
the Linux kernel, but for some reason didn't. BTW: glibc has closefrom()
which is just a wrapper over close_range(N, ~0U, 0); so we might get the
FreeBSD implementation for nothing.
Third, we should move virExec and virFork and related helpers into a
standalone file, so that it is very clear which bits of code are running
in between fork+exec and thus need careful design to avoid anything
except async safe POSIX.
I've tried to do this, but it's bigger task than I thought. Plenty of
dependencies across the file.
Michal