
On 6/19/23 17:02, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote:
On 6/19/23 14:31, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
It's weird that in 2023 there's no reliable and portable way to parse strings, but okay.
We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse).
What platforms are you seeing the failures on ? All platforms or just on Windows, or just some other specific one?
I've tried only Linux so far. Windows is out of question.
snip
On the Linux case get_C_locale is
static locale_t get_C_locale (void) { static gsize initialized = FALSE; static locale_t C_locale = NULL;
if (g_once_init_enter (&initialized)) { C_locale = newlocale (LC_ALL_MASK, "C", NULL); g_once_init_leave (&initialized, TRUE); }
return C_locale; }
and only way that could fail is if newlocale isn't threadsafe, and we have other code in libvirt calling newlocale at exact same time as the *first* call to get_C_locale.
Yeah, that's why I don't understand what's going on. Anyway, try running tests repeatedly (you can use oneliner from the commit message) and you'll run into the problem.
Yes, I've hit the problem with the error message
error: XML error: Invalid value '8388608' for element or attribute './memory[1]'
interestingly, I've also hit a few other error messages, some failures without error message (I suspect virsh simply crashed with no output), and most intriguing is that it 'virshtest' simply hangs when I run it directly.
<snip/>
So the cause of the hang is exceedingly obvious.
Very few glib APIs are safe to uses in between fork & exec, and we are (yet again) being burnt.
Yep, and switching to plain strtol() fixes this hang. But it does not solve ...
What I can't explain, however, is why we sometimes get an error message instead of a hang.
.. this. I bet it has something to do with fork() + exec() because when I set up logging and run those tests I can see which one is failing (with that "unable to parse NNN" message), but when I run it manually with the exact arguments I don't see any hiccups.
If I modify 'virGlobalInit' to call g_ascii_strtoll("0", NULL, 10); then neither the hang or error messages occur, but the hang *should* still occurr as the mutex locking race fundamentally stil exists. I think it just changes the timing enough to avoid it in our case.
My only thought with the error messages is that somehow the 'locale_t' is getting its initialization missed somehow.
I've tried to debug the problem over weekend but was really unsuccessful. Firstly, I suspected that glib version of pthread_once() was broken. So I've rewritten get_C_locale() to use pthread_once() but that didn't help. Which tends to point into direction of glibc, supported by the fact that glibc impl of newlocale() does indeed have a global RW lock. But I haven't found anything obviously wrong there either.
If glib was actually using pthread_once() we wouldn't have a problem. Their g_once_init_enter/leave stuff doesn't use pthread_once() and their impl is not safe due to its mutex usage.
Anyway, I understand why your proposed change here is avoiding problems, even if I don't understand the full root cause.
Ultimately I think our virExec() impl is verging on broken by design. We have got too much functionality in it to be safe. Especially since we switched from gnulib to glib, we're not so concious of what higher level constructs we're accidentally relying on - eg no one would guess g_ascii_strtol would acquire locks.
Rather than change our virstring.c impl, I'm inclined to think we need to be more aggressive and eliminate as much use of glib as possible from the virExec() and virFork() functions. Limit them to pure POSIX only.
Basically g_new/g_free are the only bits of glib I would entirely trust in between fork/exec, without analyznig the code of other glib APIs.
Specifically for this virStrToLong_i problem, I would suggest that we just directly call strtol from virCommandMassCloseGetFDsLinux(), and leave the main virStrToLong_i impl unchanged.
I've done this locally and it helped with deadlocks, but not with the number parsing problem.
Second, we should modify the FreeBSD impl of virCommandMassClose so that it works on Linux, when close_range() is available. That would avoid calling the problem code in the first place for modern linux.
Yeah, I remember I wanted to do this when the syscall was introduced to the Linux kernel, but for some reason didn't. BTW: glibc has closefrom() which is just a wrapper over close_range(N, ~0U, 0); so we might get the FreeBSD implementation for nothing.
Third, we should move virExec and virFork and related helpers into a standalone file, so that it is very clear which bits of code are running in between fork+exec and thus need careful design to avoid anything except async safe POSIX.
I've tried to do this, but it's bigger task than I thought. Plenty of dependencies across the file. Michal