[PATCH 0/2] virstring: Fix string parsing. Again.

*** BLURB HERE *** Michal Prívozník (2): virstring: Move locale wrappers at the beginning virstring: Prefer strtoll_l() and strtoull_l() whenever possible meson.build | 1 + src/util/virstring.c | 239 ++++++++++++++++++++++++++++--------------- 2 files changed, 158 insertions(+), 82 deletions(-) -- 2.39.3

We will soon need the virLocaleRaw variable. Move it and functions around it at the beginning of the file, so that later code doesn't need to introduce forward declarations. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virstring.c | 157 ++++++++++++++++++++++--------------------- 1 file changed, 79 insertions(+), 78 deletions(-) diff --git a/src/util/virstring.c b/src/util/virstring.c index 6b728ff047..1a13570d30 100644 --- a/src/util/virstring.c +++ b/src/util/virstring.c @@ -35,6 +35,85 @@ VIR_LOG_INIT("util.string"); +/* In case thread-safe locales are available */ +#if WITH_NEWLOCALE + +typedef locale_t virLocale; +static virLocale virLocaleRaw; + +static int +virLocaleOnceInit(void) +{ + virLocaleRaw = newlocale(LC_ALL_MASK, "C", (locale_t)0); + if (!virLocaleRaw) + return -1; + return 0; +} + +VIR_ONCE_GLOBAL_INIT(virLocale); + +/** + * virLocaleSetRaw: + * + * @oldlocale: set to old locale pointer + * + * Sets the locale to 'C' to allow operating on non-localized objects. + * Returns 0 on success -1 on error. + */ +static int +virLocaleSetRaw(virLocale *oldlocale) +{ + if (virLocaleInitialize() < 0) + return -1; + *oldlocale = uselocale(virLocaleRaw); + return 0; +} + +static void +virLocaleRevert(virLocale *oldlocale) +{ + uselocale(*oldlocale); +} + +static void +virLocaleFixupRadix(char **strp G_GNUC_UNUSED) +{ +} + +#else /* !WITH_NEWLOCALE */ + +typedef int virLocale; + +static int +virLocaleSetRaw(virLocale *oldlocale G_GNUC_UNUSED) +{ + return 0; +} + +static void +virLocaleRevert(virLocale *oldlocale G_GNUC_UNUSED) +{ +} + +static void +virLocaleFixupRadix(char **strp) +{ + char *radix, *tmp; + struct lconv *lc; + + lc = localeconv(); + radix = lc->decimal_point; + tmp = strstr(*strp, radix); + if (tmp) { + *tmp = '.'; + if (strlen(radix) > 1) + memmove(tmp + 1, tmp + strlen(radix), strlen(*strp) - (tmp - *strp)); + } +} + +#endif /* !WITH_NEWLOCALE */ + + /* Like strtol with C locale, but produce an "int" result, and check more carefully. Return 0 upon success; return -1 to indicate failure. When END_PTR is NULL, the byte after the final valid digit must be NUL. @@ -233,84 +312,6 @@ virStrToLong_ullp(char const *s, char **end_ptr, int base, return 0; } -/* In case thread-safe locales are available */ -#if WITH_NEWLOCALE - -typedef locale_t virLocale; -static virLocale virLocaleRaw; - -static int -virLocaleOnceInit(void) -{ - virLocaleRaw = newlocale(LC_ALL_MASK, "C", (locale_t)0); - if (!virLocaleRaw) - return -1; - return 0; -} - -VIR_ONCE_GLOBAL_INIT(virLocale); - -/** - * virLocaleSetRaw: - * - * @oldlocale: set to old locale pointer - * - * Sets the locale to 'C' to allow operating on non-localized objects. - * Returns 0 on success -1 on error. - */ -static int -virLocaleSetRaw(virLocale *oldlocale) -{ - if (virLocaleInitialize() < 0) - return -1; - *oldlocale = uselocale(virLocaleRaw); - return 0; -} - -static void -virLocaleRevert(virLocale *oldlocale) -{ - uselocale(*oldlocale); -} - -static void -virLocaleFixupRadix(char **strp G_GNUC_UNUSED) -{ -} - -#else /* !WITH_NEWLOCALE */ - -typedef int virLocale; - -static int -virLocaleSetRaw(virLocale *oldlocale G_GNUC_UNUSED) -{ - return 0; -} - -static void -virLocaleRevert(virLocale *oldlocale G_GNUC_UNUSED) -{ -} - -static void -virLocaleFixupRadix(char **strp) -{ - char *radix, *tmp; - struct lconv *lc; - - lc = localeconv(); - radix = lc->decimal_point; - tmp = strstr(*strp, radix); - if (tmp) { - *tmp = '.'; - if (strlen(radix) > 1) - memmove(tmp + 1, tmp + strlen(radix), strlen(*strp) - (tmp - *strp)); - } -} - -#endif /* !WITH_NEWLOCALE */ - /** * virStrToDouble -- 2.39.3

It's weird that in 2023 there's no reliable and portable way to parse strings, but okay. We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse). Anyway, it's not 100% reproducible and usually occurs when my system is under heavy load, but there's a way to reproduce it: $ meson test -C _build --repeat 2000 virshtest virsh-vcpupin The nature of the problem suggests there's a race condition somewhere. I suspected glib but I don't know their code well to prove it. Anyway, using strtoll_l() directly and taking glib out of the picture solves the issue. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- meson.build | 1 + src/util/virstring.c | 82 +++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 79 insertions(+), 4 deletions(-) diff --git a/meson.build b/meson.build index aa391e7178..cd713795f5 100644 --- a/meson.build +++ b/meson.build @@ -596,6 +596,7 @@ functions = [ 'sched_setscheduler', 'setgroups', 'setrlimit', + 'strtoll_l', 'symlink', 'sysctlbyname', ] diff --git a/src/util/virstring.c b/src/util/virstring.c index 1a13570d30..076e885624 100644 --- a/src/util/virstring.c +++ b/src/util/virstring.c @@ -127,8 +127,17 @@ virStrToLong_i(char const *s, char **end_ptr, int base, int *result) char *p; int err; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoll_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoll(s, &p, base); +#endif err = (errno || (!end_ptr && *p) || p == s || (int) val != val); if (end_ptr) *end_ptr = p; @@ -148,13 +157,23 @@ virStrToLong_ui(char const *s, char **end_ptr, int base, unsigned int *result) char *p; bool err = false; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif /* This one's tricky. We _want_ to allow "-1" as shorthand for * UINT_MAX regardless of whether long is 32-bit or 64-bit. But - * g_ascii_strtoull treats "-1" as ULLONG_MAX, and going from ullong back - * to uint differs depending on the size of uint. */ + * strtoull_l/g_ascii_strtoull treats "-1" as ULLONG_MAX, and + * going from ullong back to uint differs depending on the size of + * uint. */ if (memchr(s, '-', p - s)) { if (-val > UINT_MAX) err = true; @@ -180,8 +199,17 @@ virStrToLong_uip(char const *s, char **end_ptr, int base, unsigned int *result) char *p; bool err = false; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (memchr(s, '-', p - s) || errno || (!end_ptr && *p) || p == s || (unsigned int) val != val); if (end_ptr) @@ -204,13 +232,23 @@ virStrToLong_ul(char const *s, char **end_ptr, int base, unsigned long *result) char *p; bool err = false; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif /* This one's tricky. We _want_ to allow "-1" as shorthand for * ULONG_MAX regardless of whether long is 32-bit or 64-bit. But - * g_ascii_strtoull treats "-1" as ULLONG_MAX, and going from ullong back - * to ulong differs depending on the size of ulong. */ + * strtoull_l/g_ascii_strtoull treats "-1" as ULLONG_MAX, and + * going from ullong back to ulong differs depending on the size + * of ulong. */ if (memchr(s, '-', p - s)) { if (-val > ULONG_MAX) err = true; @@ -237,8 +275,17 @@ virStrToLong_ulp(char const *s, char **end_ptr, int base, char *p; int err; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (memchr(s, '-', p - s) || errno || (!end_ptr && *p) || p == s || (unsigned long) val != val); if (end_ptr) @@ -257,8 +304,17 @@ virStrToLong_ll(char const *s, char **end_ptr, int base, long long *result) char *p; int err; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoll_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoll(s, &p, base); +#endif err = (errno || (!end_ptr && *p) || p == s); if (end_ptr) *end_ptr = p; @@ -279,8 +335,17 @@ virStrToLong_ull(char const *s, char **end_ptr, int base, char *p; int err; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (errno || (!end_ptr && *p) || p == s); if (end_ptr) *end_ptr = p; @@ -300,8 +365,17 @@ virStrToLong_ullp(char const *s, char **end_ptr, int base, char *p; int err; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (memchr(s, '-', p - s) || errno || (!end_ptr && *p) || p == s); if (end_ptr) -- 2.39.3

On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
It's weird that in 2023 there's no reliable and portable way to parse strings, but okay.
We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse).
What platforms are you seeing the failures on ? All platforms or just on Windows, or just some other specific one?
The nature of the problem suggests there's a race condition somewhere. I suspected glib but I don't know their code well to prove it.
The code in question is: guint64 g_ascii_strtoull (const gchar *nptr, gchar **endptr, guint base) { #if defined(USE_XLOCALE) && defined(HAVE_STRTOULL_L) return strtoull_l (nptr, endptr, base, get_C_locale ()); #else gboolean negative; guint64 result; result = g_parse_long_long (nptr, (const gchar **) endptr, base, &negative); /* Return the result of the appropriate sign. */ return negative ? -result : result; #endif } gint64 g_ascii_strtoll (const gchar *nptr, gchar **endptr, guint base) { #if defined(USE_XLOCALE) && defined(HAVE_STRTOLL_L) return strtoll_l (nptr, endptr, base, get_C_locale ()); #else gboolean negative; guint64 result; result = g_parse_long_long (nptr, (const gchar **) endptr, base, &negative); if (negative && result > (guint64) G_MININT64) { errno = ERANGE; return G_MININT64; } else if (!negative && result > (guint64) G_MAXINT64) { errno = ERANGE; return G_MAXINT64; } else if (negative) return - (gint64) result; else return (gint64) result; #endif } On Linux, we should take the first branch on the #ifdef, which is calling strtoll_l just as your patch is switching us to. I guess Windows would use the g_parse_long_long fallback. I don't see anything in that code which could be non-thread safe. It is just a copy of strtol cdoe from glibc. On the Linux case get_C_locale is static locale_t get_C_locale (void) { static gsize initialized = FALSE; static locale_t C_locale = NULL; if (g_once_init_enter (&initialized)) { C_locale = newlocale (LC_ALL_MASK, "C", NULL); g_once_init_leave (&initialized, TRUE); } return C_locale; } and only way that could fail is if newlocale isn't threadsafe, and we have other code in libvirt calling newlocale at exact same time as the *first* call to get_C_locale.
Anyway, using strtoll_l() directly and taking glib out of the picture solves the issue.
I'm pretty surprised if it fixes anything ! I'm a little concerned we might have just masked a bug that's still hiding elsewhere.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- meson.build | 1 + src/util/virstring.c | 82 +++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 79 insertions(+), 4 deletions(-)
diff --git a/meson.build b/meson.build index aa391e7178..cd713795f5 100644 --- a/meson.build +++ b/meson.build @@ -596,6 +596,7 @@ functions = [ 'sched_setscheduler', 'setgroups', 'setrlimit', + 'strtoll_l', 'symlink', 'sysctlbyname', ] diff --git a/src/util/virstring.c b/src/util/virstring.c index 1a13570d30..076e885624 100644 --- a/src/util/virstring.c +++ b/src/util/virstring.c @@ -127,8 +127,17 @@ virStrToLong_i(char const *s, char **end_ptr, int base, int *result) char *p; int err;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoll_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoll(s, &p, base); +#endif err = (errno || (!end_ptr && *p) || p == s || (int) val != val); if (end_ptr) *end_ptr = p; @@ -148,13 +157,23 @@ virStrToLong_ui(char const *s, char **end_ptr, int base, unsigned int *result) char *p; bool err = false;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif
/* This one's tricky. We _want_ to allow "-1" as shorthand for * UINT_MAX regardless of whether long is 32-bit or 64-bit. But - * g_ascii_strtoull treats "-1" as ULLONG_MAX, and going from ullong back - * to uint differs depending on the size of uint. */ + * strtoull_l/g_ascii_strtoull treats "-1" as ULLONG_MAX, and + * going from ullong back to uint differs depending on the size of + * uint. */ if (memchr(s, '-', p - s)) { if (-val > UINT_MAX) err = true; @@ -180,8 +199,17 @@ virStrToLong_uip(char const *s, char **end_ptr, int base, unsigned int *result) char *p; bool err = false;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (memchr(s, '-', p - s) || errno || (!end_ptr && *p) || p == s || (unsigned int) val != val); if (end_ptr) @@ -204,13 +232,23 @@ virStrToLong_ul(char const *s, char **end_ptr, int base, unsigned long *result) char *p; bool err = false;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif
/* This one's tricky. We _want_ to allow "-1" as shorthand for * ULONG_MAX regardless of whether long is 32-bit or 64-bit. But - * g_ascii_strtoull treats "-1" as ULLONG_MAX, and going from ullong back - * to ulong differs depending on the size of ulong. */ + * strtoull_l/g_ascii_strtoull treats "-1" as ULLONG_MAX, and + * going from ullong back to ulong differs depending on the size + * of ulong. */ if (memchr(s, '-', p - s)) { if (-val > ULONG_MAX) err = true; @@ -237,8 +275,17 @@ virStrToLong_ulp(char const *s, char **end_ptr, int base, char *p; int err;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (memchr(s, '-', p - s) || errno || (!end_ptr && *p) || p == s || (unsigned long) val != val); if (end_ptr) @@ -257,8 +304,17 @@ virStrToLong_ll(char const *s, char **end_ptr, int base, long long *result) char *p; int err;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoll_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoll(s, &p, base); +#endif err = (errno || (!end_ptr && *p) || p == s); if (end_ptr) *end_ptr = p; @@ -279,8 +335,17 @@ virStrToLong_ull(char const *s, char **end_ptr, int base, char *p; int err;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (errno || (!end_ptr && *p) || p == s); if (end_ptr) *end_ptr = p; @@ -300,8 +365,17 @@ virStrToLong_ullp(char const *s, char **end_ptr, int base, char *p; int err;
+#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + if (virLocaleInitialize() < 0) + return -1; +#endif + errno = 0; +#if defined(WITH_STRTOLL_L) && defined(WITH_NEWLOCALE) + val = strtoull_l(s, &p, base, virLocaleRaw); +#else val = g_ascii_strtoull(s, &p, base); +#endif err = (memchr(s, '-', p - s) || errno || (!end_ptr && *p) || p == s); if (end_ptr) -- 2.39.3
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 6/19/23 14:31, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
It's weird that in 2023 there's no reliable and portable way to parse strings, but okay.
We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse).
What platforms are you seeing the failures on ? All platforms or just on Windows, or just some other specific one?
I've tried only Linux so far. Windows is out of question.
The nature of the problem suggests there's a race condition somewhere. I suspected glib but I don't know their code well to prove it.
The code in question is:
guint64 g_ascii_strtoull (const gchar *nptr, gchar **endptr, guint base) { #if defined(USE_XLOCALE) && defined(HAVE_STRTOULL_L) return strtoull_l (nptr, endptr, base, get_C_locale ()); #else gboolean negative; guint64 result;
result = g_parse_long_long (nptr, (const gchar **) endptr, base, &negative);
/* Return the result of the appropriate sign. */ return negative ? -result : result; #endif }
gint64 g_ascii_strtoll (const gchar *nptr, gchar **endptr, guint base) { #if defined(USE_XLOCALE) && defined(HAVE_STRTOLL_L) return strtoll_l (nptr, endptr, base, get_C_locale ()); #else gboolean negative; guint64 result;
result = g_parse_long_long (nptr, (const gchar **) endptr, base, &negative);
if (negative && result > (guint64) G_MININT64) { errno = ERANGE; return G_MININT64; } else if (!negative && result > (guint64) G_MAXINT64) { errno = ERANGE; return G_MAXINT64; } else if (negative) return - (gint64) result; else return (gint64) result; #endif }
On Linux, we should take the first branch on the #ifdef, which is calling strtoll_l just as your patch is switching us to.
I guess Windows would use the g_parse_long_long fallback. I don't see anything in that code which could be non-thread safe. It is just a copy of strtol cdoe from glibc.
On the Linux case get_C_locale is
static locale_t get_C_locale (void) { static gsize initialized = FALSE; static locale_t C_locale = NULL;
if (g_once_init_enter (&initialized)) { C_locale = newlocale (LC_ALL_MASK, "C", NULL); g_once_init_leave (&initialized, TRUE); }
return C_locale; }
and only way that could fail is if newlocale isn't threadsafe, and we have other code in libvirt calling newlocale at exact same time as the *first* call to get_C_locale.
Yeah, that's why I don't understand what's going on. Anyway, try running tests repeatedly (you can use oneliner from the commit message) and you'll run into the problem. I've tried to debug the problem over weekend but was really unsuccessful. Firstly, I suspected that glib version of pthread_once() was broken. So I've rewritten get_C_locale() to use pthread_once() but that didn't help. Which tends to point into direction of glibc, supported by the fact that glibc impl of newlocale() does indeed have a global RW lock. But I haven't found anything obviously wrong there either.
Anyway, using strtoll_l() directly and taking glib out of the picture solves the issue.
I'm pretty surprised if it fixes anything ! I'm a little concerned we might have just masked a bug that's still hiding elsewhere.
I understand that. And if you have any idea how to get to the root of the problem, I'm more than happy to try it. Michal

On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote:
On 6/19/23 14:31, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
It's weird that in 2023 there's no reliable and portable way to parse strings, but okay.
We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse).
What platforms are you seeing the failures on ? All platforms or just on Windows, or just some other specific one?
I've tried only Linux so far. Windows is out of question.
snip
On the Linux case get_C_locale is
static locale_t get_C_locale (void) { static gsize initialized = FALSE; static locale_t C_locale = NULL;
if (g_once_init_enter (&initialized)) { C_locale = newlocale (LC_ALL_MASK, "C", NULL); g_once_init_leave (&initialized, TRUE); }
return C_locale; }
and only way that could fail is if newlocale isn't threadsafe, and we have other code in libvirt calling newlocale at exact same time as the *first* call to get_C_locale.
Yeah, that's why I don't understand what's going on. Anyway, try running tests repeatedly (you can use oneliner from the commit message) and you'll run into the problem.
Yes, I've hit the problem with the error message error: XML error: Invalid value '8388608' for element or attribute './memory[1]' interestingly, I've also hit a few other error messages, some failures without error message (I suspect virsh simply crashed with no output), and most intriguing is that it 'virshtest' simply hangs when I run it directly. The hang shows this stack trace in primary virsh process Thread 2 (Thread 0x7f6da5c036c0 (LWP 3342182) "vshEventLoop"): #0 0x00007f6da746a35d in poll () at /lib64/libc.so.6 #1 0x00007f6da75fd3a9 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0 #2 0x00007f6da759ca23 in g_main_context_iteration () at /lib64/libglib-2.0.so.0 #3 0x00007f6da78fe8e0 in virEventGLibRunOnce () at ../src/util/vireventglib.c:515 #4 0x00007f6da79619c5 in virEventRunDefaultImpl () at ../src/util/virevent.c:362 #5 0x00005563004442d5 in vshEventLoop (opaque=0x7ffedebdd4f0) at ../tools/vsh.c:2043 #6 0x00007f6da79529c4 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256 #7 0x00007f6da73f1907 in start_thread () at /lib64/libc.so.6 #8 0x00007f6da7477870 in clone3 () at /lib64/libc.so.6 Thread 1 (Thread 0x7f6da5c07140 (LWP 3342175) "virsh"): #0 0x00007f6da746a35d in poll () at /lib64/libc.so.6 #1 0x00007f6da793dde8 in poll (__timeout=-1, __nfds=1, __fds=0x7ffedebdd3d8) at /usr/include/bits/poll2.h:39 #2 virPolkitAgentCreate () at ../src/util/virpolkit.c:209 #3 0x00005563003f8e5d in virshConnect (ctl=ctl@entry=0x7ffedebdd4f0, uri=0x55630113c190 "test:///default", readonly=false) at ../tools/virsh.c:127 #4 0x00005563003f9093 in virshReconnect (ctl=ctl@entry=0x7ffedebdd4f0, name=name@entry=0x0, readonly=<optimized out>, readonly@entry=false, force=force@entry=false) at ../tools/virsh.c:216 #5 0x00005563003f86cb in virshInit (ctl=0x7ffedebdd4f0) at ../tools/virsh.c:371 #6 main (argc=<optimized out>, argv=<optimized out>) at ../tools/virsh.c:890 and in the spawned child we have: #0 0x00007f6da746fb5d in syscall () at /lib64/libc.so.6 #1 0x00007f6da75f4554 in g_mutex_lock_slowpath () at /lib64/libglib-2.0.so.0 #2 0x00007f6da75c9b51 in g_once_init_enter () at /lib64/libglib-2.0.so.0 #3 0x00007f6da75c1cc0 in get_C_locale () at /lib64/libglib-2.0.so.0 #4 0x00007f6da75c1fff in g_ascii_strtoll () at /lib64/libglib-2.0.so.0 #5 0x00007f6da794b564 in virStrToLong_i (s=0x556301141033 "0", end_ptr=end_ptr@entry=0x0, base=base@entry=10, result=result@entry=0x7ffedebdd34c) at ../src/util/virstring.c:52 #6 0x00007f6da78f586b in virCommandMassCloseGetFDsLinux (cmd=0x55630113f340, fds=0x55630113aa50) at ../src/util/vircommand.c:502 #7 virCommandMassClose (childerr=<optimized out>, childout=<optimized out>, childin=0, cmd=0x55630113f340) at ../src/util/vircommand.c:609 #8 virExec (cmd=0x55630113f340) at ../src/util/vircommand.c:741 #9 virCommandRunAsync (cmd=0x55630113f340, pid=pid@entry=0x0) at ../src/util/vircommand.c:2633 #10 0x00007f6da793ddba in virPolkitAgentCreate () at ../src/util/virpolkit.c:203 #11 0x00005563003f8e5d in virshConnect (ctl=ctl@entry=0x7ffedebdd4f0, uri=0x55630113c190 "test:///default", readonly=false) at ../tools/virsh.c:127 #12 0x00005563003f9093 in virshReconnect (ctl=ctl@entry=0x7ffedebdd4f0, name=name@entry=0x0, readonly=<optimized out>, readonly@entry=false, force=force@entry=false) at ../tools/virsh.c:216 #13 0x00005563003f86cb in virshInit (ctl=0x7ffedebdd4f0) at ../tools/virsh.c:371 #14 main (argc=<optimized out>, argv=<optimized out>) at ../tools/virsh.c:890 This problem is sadly all too familiar - the fork()d child is waiting on a lock that was held by another thread in the parent process at the time fork() was called. So what we have going on is: * virsh spawns a thread to run the main event loop. This runs the glib main event loop, and some code in there uses g_once_init_enter & thus its mutex * virsh then spawns polkit Inbetween fork and exec we call g_ascii_strtoll which uses g_once_init_enter So the cause of the hang is exceedingly obvious. Very few glib APIs are safe to uses in between fork & exec, and we are (yet again) being burnt. What I can't explain, however, is why we sometimes get an error message instead of a hang. If I modify 'virGlobalInit' to call g_ascii_strtoll("0", NULL, 10); then neither the hang or error messages occur, but the hang *should* still occurr as the mutex locking race fundamentally stil exists. I think it just changes the timing enough to avoid it in our case. My only thought with the error messages is that somehow the 'locale_t' is getting its initialization missed somehow.
I've tried to debug the problem over weekend but was really unsuccessful. Firstly, I suspected that glib version of pthread_once() was broken. So I've rewritten get_C_locale() to use pthread_once() but that didn't help. Which tends to point into direction of glibc, supported by the fact that glibc impl of newlocale() does indeed have a global RW lock. But I haven't found anything obviously wrong there either.
If glib was actually using pthread_once() we wouldn't have a problem. Their g_once_init_enter/leave stuff doesn't use pthread_once() and their impl is not safe due to its mutex usage. Anyway, I understand why your proposed change here is avoiding problems, even if I don't understand the full root cause. Ultimately I think our virExec() impl is verging on broken by design. We have got too much functionality in it to be safe. Especially since we switched from gnulib to glib, we're not so concious of what higher level constructs we're accidentally relying on - eg no one would guess g_ascii_strtol would acquire locks. Rather than change our virstring.c impl, I'm inclined to think we need to be more aggressive and eliminate as much use of glib as possible from the virExec() and virFork() functions. Limit them to pure POSIX only. Basically g_new/g_free are the only bits of glib I would entirely trust in between fork/exec, without analyznig the code of other glib APIs. Specifically for this virStrToLong_i problem, I would suggest that we just directly call strtol from virCommandMassCloseGetFDsLinux(), and leave the main virStrToLong_i impl unchanged. Second, we should modify the FreeBSD impl of virCommandMassClose so that it works on Linux, when close_range() is available. That would avoid calling the problem code in the first place for modern linux. Third, we should move virExec and virFork and related helpers into a standalone file, so that it is very clear which bits of code are running in between fork+exec and thus need careful design to avoid anything except async safe POSIX. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 6/19/23 17:02, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote:
On 6/19/23 14:31, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
It's weird that in 2023 there's no reliable and portable way to parse strings, but okay.
We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse).
What platforms are you seeing the failures on ? All platforms or just on Windows, or just some other specific one?
I've tried only Linux so far. Windows is out of question.
snip
On the Linux case get_C_locale is
static locale_t get_C_locale (void) { static gsize initialized = FALSE; static locale_t C_locale = NULL;
if (g_once_init_enter (&initialized)) { C_locale = newlocale (LC_ALL_MASK, "C", NULL); g_once_init_leave (&initialized, TRUE); }
return C_locale; }
and only way that could fail is if newlocale isn't threadsafe, and we have other code in libvirt calling newlocale at exact same time as the *first* call to get_C_locale.
Yeah, that's why I don't understand what's going on. Anyway, try running tests repeatedly (you can use oneliner from the commit message) and you'll run into the problem.
Yes, I've hit the problem with the error message
error: XML error: Invalid value '8388608' for element or attribute './memory[1]'
interestingly, I've also hit a few other error messages, some failures without error message (I suspect virsh simply crashed with no output), and most intriguing is that it 'virshtest' simply hangs when I run it directly.
<snip/>
So the cause of the hang is exceedingly obvious.
Very few glib APIs are safe to uses in between fork & exec, and we are (yet again) being burnt.
Yep, and switching to plain strtol() fixes this hang. But it does not solve ...
What I can't explain, however, is why we sometimes get an error message instead of a hang.
.. this. I bet it has something to do with fork() + exec() because when I set up logging and run those tests I can see which one is failing (with that "unable to parse NNN" message), but when I run it manually with the exact arguments I don't see any hiccups.
If I modify 'virGlobalInit' to call g_ascii_strtoll("0", NULL, 10); then neither the hang or error messages occur, but the hang *should* still occurr as the mutex locking race fundamentally stil exists. I think it just changes the timing enough to avoid it in our case.
My only thought with the error messages is that somehow the 'locale_t' is getting its initialization missed somehow.
I've tried to debug the problem over weekend but was really unsuccessful. Firstly, I suspected that glib version of pthread_once() was broken. So I've rewritten get_C_locale() to use pthread_once() but that didn't help. Which tends to point into direction of glibc, supported by the fact that glibc impl of newlocale() does indeed have a global RW lock. But I haven't found anything obviously wrong there either.
If glib was actually using pthread_once() we wouldn't have a problem. Their g_once_init_enter/leave stuff doesn't use pthread_once() and their impl is not safe due to its mutex usage.
Anyway, I understand why your proposed change here is avoiding problems, even if I don't understand the full root cause.
Ultimately I think our virExec() impl is verging on broken by design. We have got too much functionality in it to be safe. Especially since we switched from gnulib to glib, we're not so concious of what higher level constructs we're accidentally relying on - eg no one would guess g_ascii_strtol would acquire locks.
Rather than change our virstring.c impl, I'm inclined to think we need to be more aggressive and eliminate as much use of glib as possible from the virExec() and virFork() functions. Limit them to pure POSIX only.
Basically g_new/g_free are the only bits of glib I would entirely trust in between fork/exec, without analyznig the code of other glib APIs.
Specifically for this virStrToLong_i problem, I would suggest that we just directly call strtol from virCommandMassCloseGetFDsLinux(), and leave the main virStrToLong_i impl unchanged.
I've done this locally and it helped with deadlocks, but not with the number parsing problem.
Second, we should modify the FreeBSD impl of virCommandMassClose so that it works on Linux, when close_range() is available. That would avoid calling the problem code in the first place for modern linux.
Yeah, I remember I wanted to do this when the syscall was introduced to the Linux kernel, but for some reason didn't. BTW: glibc has closefrom() which is just a wrapper over close_range(N, ~0U, 0); so we might get the FreeBSD implementation for nothing.
Third, we should move virExec and virFork and related helpers into a standalone file, so that it is very clear which bits of code are running in between fork+exec and thus need careful design to avoid anything except async safe POSIX.
I've tried to do this, but it's bigger task than I thought. Plenty of dependencies across the file. Michal

On Tue, Jun 20, 2023 at 01:25:23PM +0200, Michal Prívozník wrote:
On 6/19/23 17:02, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:45:57PM +0200, Michal Prívozník wrote:
On 6/19/23 14:31, Daniel P. Berrangé wrote:
On Mon, Jun 19, 2023 at 02:03:48PM +0200, Michal Privoznik wrote:
It's weird that in 2023 there's no reliable and portable way to parse strings, but okay.
We started with strtol(). Except, it doesn't work the same on Linux and Windows. On Windows it behaves a bit different when it comes to parsing strings with base = 0. So we've switched to g_ascii_strtoll() which promised to solve this. And it did. Except, it introduced another problem. I don't really understand why, but I've seen random test failures and all of them claimed inability to parse a number (specifically <memory/> from the hard coded domain XML from test driver, which IMO is just a coincidence because it's the first number we parse).
What platforms are you seeing the failures on ? All platforms or just on Windows, or just some other specific one?
I've tried only Linux so far. Windows is out of question.
snip
On the Linux case get_C_locale is
static locale_t get_C_locale (void) { static gsize initialized = FALSE; static locale_t C_locale = NULL;
if (g_once_init_enter (&initialized)) { C_locale = newlocale (LC_ALL_MASK, "C", NULL); g_once_init_leave (&initialized, TRUE); }
return C_locale; }
and only way that could fail is if newlocale isn't threadsafe, and we have other code in libvirt calling newlocale at exact same time as the *first* call to get_C_locale.
Yeah, that's why I don't understand what's going on. Anyway, try running tests repeatedly (you can use oneliner from the commit message) and you'll run into the problem.
Yes, I've hit the problem with the error message
error: XML error: Invalid value '8388608' for element or attribute './memory[1]'
interestingly, I've also hit a few other error messages, some failures without error message (I suspect virsh simply crashed with no output), and most intriguing is that it 'virshtest' simply hangs when I run it directly.
<snip/>
So the cause of the hang is exceedingly obvious.
Very few glib APIs are safe to uses in between fork & exec, and we are (yet again) being burnt.
Yep, and switching to plain strtol() fixes this hang. But it does not solve ...
What I can't explain, however, is why we sometimes get an error message instead of a hang.
.. this. I bet it has something to do with fork() + exec() because when I set up logging and run those tests I can see which one is failing (with that "unable to parse NNN" message), but when I run it manually with the exact arguments I don't see any hiccups.
Ok, I've found the root cause. g_mutex_lock() sometimes splatters errno on contended locks: https://gitlab.gnome.org/GNOME/glib/-/issues/3034 most of the time this won't matter, as the APIs have a return value that identifies failure, and only then does the caller look at errno. The strtoll/g_ascii_strtoll functions are special though because we have to directly inspect errno to identify failure. The splattering of errno breaks this :-( IOW, as of today, the g_ascii_strtoll functions are all broken and unusable on Linux when GLib is built to natively use futex() instead of pthread_mutex_t. We can avoid this, however, by making virInitialize invoke g_ascii_strtoll() on a dummy string. This triggers the glib one-time initializer, and thereafter we shouldn't hit the locking bug because g_once_init_enter will go via a fast-path that avoids the futex syscall.
Third, we should move virExec and virFork and related helpers into a standalone file, so that it is very clear which bits of code are running in between fork+exec and thus need careful design to avoid anything except async safe POSIX.
I've tried to do this, but it's bigger task than I thought. Plenty of dependencies across the file.
No problem, not an urgent thing. Just something I think we ought to do at somepoint With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (3)
-
Daniel P. Berrangé
-
Michal Privoznik
-
Michal Prívozník