Re: [libvirt] [PATCH] avoid one more ctype vs. sign-extension problem

9 May 2008

      On Thu, May 08, 2008 at 04:44:38PM +0200, Jim Meyering wrote:
...
This change demonstrates that the new syntax-check rule's
regexp can be improved.  It missed the unsafe tolower use,
since there was already a to_uchar use on that line.
Since this shows up again, let me argue a bit more why I dislike
using is* and to* in libvirt.

<pedantic topic="string encodings">

  We use to* only for comparing A-F/a-f in ethernet addresses
comparisons, so those use are safe. The use of XML as the front
end for all text parsed in libvirt also safeguards against what
I consider a nasty problems with the use of strings in system code
i.e. their dependance to locale. I think it's not acceptable to 
have a risk of different behaviour if the user locale has changed,
this can lead to untractable bugs, and also mean we allowed at
a given point the problem of string whose encoding is undefined
to occur in the library (in the libxml2 docs I point to [1] and [2]
as introductions to the problem).

  As long as we use XML for input, we are safe IMHO, because 
any string input will automatically come converted as UTF-8 
by the way of libxml2, that means if someone want to use japanese
characters for the name of their domain or files, at the libvirt level 
they can and we should not misparse or break those (it's likely to
break down the line in xend, but i would hope that lxc and QEmu
would be better at handling this though I don't know how we could enforce
the exec'ec programs args to be interpreted as UTF-8 in the forked
process, as a library we just can't touch the locale value, but maybe
we can do that after fork() and before exec()).

  But each time we extract a string from libxml2 we should know and keep
in mind it is an UTF-8 string. In practice the ethernet MAC addresses
strings used with to_lower and to_upper in util.c are UTF-8, not strings
in the user locale. We are safe here because most locales match
with ASCII on the first 128 values, UTF-8 included, and the 
characters for the MAC address range cannot conflict for UTF-8
encoding, but from a pure correctness POV using those convenience
macros on those UTF-8 strings is a mistake.

  And I would rather see automated checks on the code look for those
mistakes, rather than doing a lookup for a wrong cast to then use 
a wrong macro with a false sense of security.

</pedantic>

  okay now I feel better :-)

Daniel

[1] http://www.joelonsoftware.com/articles/Unicode.html
[2] http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard@redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/