Re: [Libvir] [PATCH] Enhance virBuffer code

14 Dec 2007

      "Richard W.M. Jones" <rjones@redhat.com> wrote:
...
Jim Meyering wrote:
...
"Richard W.M. Jones" <rjones@redhat.com> wrote:
...
Jim Meyering wrote:
...
...
...
What do you think of using this?
isascii (*p) && isalnum (*p)
I'm not sure I'm qualified to say what this does on EBCDIC, but quite
Jim Meyering wrote:
likely lots of other code breaks there too anyway.  This is nicely
self-documenting anyway.
As Daniel suggested, isalnum is locale-sensitive.
If there's a locale with an alphabetic byte that is outside
"Richard W.M. Jones" <rjones@redhat.com> wrote:
the logical a-zA-Z range, yet still within 0..127, then the above
expression will give a false-positive for that byte.
I've been inclined to stop worrying about EBCDIC for years, but a quick
search on the web finds that people are still stuck using it, and do
report bugs in ASCII-assuming code.
This is why autoconf goes to the trouble of doing this:
  tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
not this:
  tr a-z A-Z
to convert to upper case.
Another factor to consider here is that it doesn't matter if this
function over-escapes, but it does matter if the function
under-escapes. That is to say, it could escape every character as a
%xx hex code, which would be ugly and slightly inefficient but not
wrong.
IMHO, if you don't use the all-enumerating switch-based code that Daniel
objects to, it'd be good to document (in both loops) that the test is
inaccurate with EBCDIC, and explain why it's ok to get false positives.
Without comments, people might be tempted to use a similar test in a
context where the differences matter.
OK, how about this?
Rich.
+    for (p = str; *p; ++p) {
+        /* Want to escape only A-Z and 0-9.  This may not work on
EBCDIC. */
+        if (isascii (*p) && isalnum (*p))
Actually, with that, the code is at the mercy of locale definitions,
(which are notoriously unreliable), and it probably works with EBCDIC.

I wrote the following and tested a few systems:

#include <ctype.h>
#include <stdio.h>
#include <locale.h>
int
is_alphanum (char c)
{
  switch (c)
  {
    /* generated by LC_ALL=C perl -e \
       "print map {qq(case '\$_': )}('a'..'z', 'A'..'Z', '0'..'9')"|fmt */
    case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g':
    case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n':
    case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u':
    case 'v': case 'w': case 'x': case 'y': case 'z': case 'A': case 'B':
    case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I':
    case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P':
    case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W':
    case 'X': case 'Y': case 'Z': case '0': case '1': case '2': case '3':
    case '4': case '5': case '6': case '7': case '8': case '9':
      return 1;
    default:
      return 0;
  }
}

int
main ()
{
  setlocale (LC_ALL, "");
  for (unsigned int i = 0; i < 256; i++)
    if (isalnum (i) && isascii (i) && !is_alphanum (i))
      printf ("%d: %c", i, i);

  return 0;
}
-------------------------------
I compiled and ran it against all installed locales like this:

gcc -o k k.c && for i in $(locale -a); do \
  test "$(LC_ALL=$i ./k|wc -l)" = 0 || echo $i;done

On RHEL4, RHEL5, and rawhide, it finds this exception:

  vi_VN.tcvn

Running manually in that locale suggests something is fishy:

  $ LC_ALL=vi_VN.tcvn ./k
  1:
  2:
  4:
  5:
  6:
  17:
  18:
  19:
  20:
  21:
  22:
  23:

Surprise, surprise...
So in this locale, using "isascii (*p) && isalnum (*p)" would
*under*quote.

I didn't expect to find such a convincing argument.
I stand by my suggestion to use the switch statement.