Re: [libvirt] [PATCH v2 1/3] conf: Properly truncate wide character names in virDomainObjGetShortName

29 Aug 2017

On Fri, Aug 25, 2017 at 06:03:15PM -0400, John Ferlan wrote:
...
On 08/25/2017 07:21 AM, Martin Kletzander wrote:
...
We always truncated the name at 20 bytes instead of characters.  In
case 20 bytes were in the middle of a multi-byte character, then the
string became invalid and various parts of the code would error
out (e.g. XML parsing of that string).  Let's instead properly
truncate it after 20 characters instead.
We cannot test this in our test suite because we would need to know
what locales are installed on the system where the tests are ran and
if there is supported one (most probably there will be, but we cannot
be 100% sure), we could initialize gettext in qemuxml2argvtest, but
there would still be a chance of getting two different (both valid,
though) results.
Yeah - I tried - it got ugly fast.... Although I suppose in a test
environment would could just "pick" a charset like en_US.UTF-8 and just
ensure that things work as expected with that. Perhaps even make the
shortened name tests go at the bottom with a FAILURE to compare argv
results before setting the locale and a success after a call on the same
test by name. Not something for this series because I'm not even sure it
would work properly.  Maybe something for the byte sized tasks (rather
literally too ;-)!)
The problem is that we can't pick any charset except "C", which is the
default.  There is no guarantee that you will have the charset installed
in the system.  For example I only have en_GB.utf8, cs_CZ.utf8 and the
defaults, C and POSIX.  We could, maybe, programmatically construct the
locale in the test itself, but that's way big of a hammer IMO.
...
...
In order to test this it is enough to start a machine with a name for
which trimming it after 20 bytes would create invalid sequence (e.g.
1234567890123456789č where č is any multi-byte character).  Then start
the domain and restart libvirtd.  The domain would disappear because
such illegal sequence will not go through the XML parser.  And that's
not a bug of the parser, it should not be in the XML in the first
place, but since we don't use any sophisticated formatter, just
mash some strings together, the formatting succeeds.
If the domain was started before these patches, then the domain is still
hidden since the shortened path names won't match... Such is life...
Such domain will never be loaded.  It's not because the paths won't
match.  We don't generate the paths for running domains, but we get them
from the state XML (that's needed so that we can change the way we
shorten characters and still be able to find domains with older
"shortening rules").  The problem with the domains that get lost is that
the XML consists of an invalid string literal and it is already saved on
the disk.  The parser (libxml2) will fail (rightfully) with an error
that such XML is invalid.  We can't do anything with that.  And,
honestly, I can't say I care that much about that as this really isn't
someone would notice in production.
...
...
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1448766
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
---
 src/conf/domain_conf.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 47eba4dbb315..dd73158f028b 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -27131,6 +27131,8 @@ virDomainDefHasMemballoon(const virDomainDef *def)
 }
+#define VIR_DOMAIN_SHORT_NAME_MAX 20
+
 /**
  * virDomainObjGetShortName:
  * @vm: Machine for which to get a name
@@ -27141,15 +27143,52 @@ virDomainDefHasMemballoon(const virDomainDef *def)
 char *
 virDomainObjGetShortName(const virDomainDef *def)
 {
-    const int dommaxlen = 20;
+    wchar_t wshortname[VIR_DOMAIN_SHORT_NAME_MAX + 1] = {0};
+    size_t len = 0;
+    char *shortname = NULL;
     char *ret = NULL;
-    ignore_value(virAsprintf(&ret, "%d-%.*s",
-                             def->id, dommaxlen, def->name));
+    /* No need to do the whole conversion thing when there are no multibyte
+     * characters.  The same applies for illegal sequences as they can occur
+     * with incompatible locales. */
+    len = mbstowcs(NULL, def->name, 0);
+    if ((len == (size_t) -1 && errno == EILSEQ) ||
+        len == strlen(def->name)) {
+        ignore_value(virAsprintf(&ret, "%d-%.*s", def->id,
+                                 VIR_DOMAIN_SHORT_NAME_MAX, def->name));
+        return ret;
consistently speaking
return NULL;
Not really, ret will be NULL here if and only if virAsprintf() failed.
This is not an error path, but rather a fallback path if either the
locale is incompatible or there are no wide characters (your
suggestion).  I hoped I explained that enough in the comment.
...
...
+    }
+
+    if (len == (size_t) -1 ||
+        mbstowcs(wshortname, def->name, VIR_DOMAIN_SHORT_NAME_MAX) == (size_t) -1) {
+        virReportError(VIR_ERR_INVALID_ARG, "%s",
+                       _("Cannot convert domain name to wide character string"));
You could go with something like :
virReportSystemError(errno,
     _("Cannot convert domain name to wide character string, len=%zd"),
     len);
That way you'd know if you failed because len == -1 or the second
mbstowcs call failed and you'd know why. Not sure ->name at this point
could be considered an invalid argument, but getting the system error
would certainly point you in the right direction.
Good point, I like that.
...
FWIW:
There may be an off by 1 error. With your this, 1234567890123456789č
would be shortened to 19 characters (it's at 21), thus if you have two
domains:
1234567890123456789č
and
1234567890123456789
They'd return the same answer
They won't.  That's the point of this whole function.  Such domains will
differ by their ID that's unique, the shortened name is ID-name.
...
Of course so would 123456789012345678901 and 1234567890123456789012, but
at least for those two the length is > 20.  Not sure if it matters or
not as 19 or 20 characters for uniqueness is a lot.
Still the bz had 7 multibyte characters that took up 21 char's - so it
would seem things would be more limited with that charset. If that last
character was the differentiator between all domains, then the short
name is too short using only 18 chars that appear as 6 multibyte chars
for the short name.
Not really.  I don't shorten the name to 20 bytes, but rather to 20
characters.  Keep in mind that mbstowcs()'s third parameter is the
number of wide characters that you want to convert, not the number of
bytes.
...
Likewise a name using fully wide chars, e.g. ÄèÎØÛÄèÎØÛÄèÎØÛÄèÎØÛÄèÎØÛ
would be shortened to ÄèÎØÛÄèÎØÛ.
At this point I'm ambivalent as to whether that's considered a problem.
There should not be any.  (after writing this sentence, the phrase
"famous last words" springs to mind for an _unknown_ reason :) )
...
...
+        return NULL;
+    }
+
+    len = wcstombs(NULL, wshortname, 0);
+    if (len == (size_t) -1)
+        return NULL;
Would returning NULL here elicit a useful error message?
According to the manual pages there should not be any error case for
this, but I'll add the error reporting, just in case.
...
...
+    if (len > VIR_DOMAIN_SHORT_NAME_MAX)
+        len = VIR_DOMAIN_SHORT_NAME_MAX;
Ah, the remnants of an unfinished patch...  This should not be here at all.
...
Given a name fully using a wide character set if this were altered to:
if (len > VIR_DOMAIN_SHORT_NAME_MAX*2)
       len = VIR_DOMAIN_SHORT_NAME_MAX*2;
Then ÄèÎØÛÄèÎØÛÄèÎØÛÄèÎØÛÄèÎØÛ shortens to ÄèÎØÛÄèÎØÛÄèÎØÛÄèÎØÛ, but
1234567890123456789č! doesn't get shortened to 1234567890123456789č. The
multibyte charsets would need what * 3?  It's somewhat ludicrous - but
why I suggested originally to compare strlen() and mbstowcs() - if
they're the same, less then 20, or having an error on the mbstowcs(),
then just stick with the old logic.  Only fall into the new logic when
wide or multibyte chars are present.
I hope this is explained by previous paragraphs.
...
Of course the really harsh route is to count the width of each character
in the array using mblen, then somehow mathematically get a result of 20
"printed characters" of uniqueness.  Again, IDC really, but since I was
looking and thinking about it I figured I'd at least mention it.
So, I think the only required thing to fix would be the NULL return if
len == -1 without an error message.  Beyond that altering the return to
be NULL instead of ret and altering to use virReportSystemError are may
as well do it since you need to fix the first - with that done:
Reviewed-by: John Ferlan <jferlan@redhat.com>
Thanks for the review.  I'll read the other reviews in this series, fix
it up and push (if appropriate).
...
John
Whether the shortened name displayed "apparent" length rabbit hole needs
adjustment - is up to you. I'm fine with it the way it is. I don't
believe we ever documented our shortening "rules", but wide and
multichar wide charsets really shorten things a lot.  Perhaps that's a
cross that bridge when someone complains type thing!
...
+
+    if (VIR_ALLOC_N(shortname, len + 1) < 0)
+        return NULL;
+
+    if (wcstombs(shortname, wshortname, len) == (size_t) -1) {
+        virReportError(VIR_ERR_INVALID_ARG, "%s",
+                       _("Cannot convert domain name from wide character string"));
+        goto cleanup;
+    }
+
+    ignore_value(virAsprintf(&ret, "%d-%s", def->id, shortname));
+ cleanup:
+    VIR_FREE(shortname);
     return ret;
 }
+#undef VIR_DOMAIN_SHORT_NAME_MAX
int
 virDomainGetBlkioParametersAssignFromDef(virDomainDefPtr def,