"Richard W.M. Jones" <rjones(a)redhat.com> wrote:
Jim Meyering wrote:
> "Richard W.M. Jones" <rjones(a)redhat.com> wrote:
>
>> Jim Meyering wrote:
>>> "Richard W.M. Jones" <rjones(a)redhat.com> wrote:
>>>> Jim Meyering wrote:
>>>>> What do you think of using this?
>>>>>
>>>>> isascii (*p) && isalnum (*p)
>>>> I'm not sure I'm qualified to say what this does on EBCDIC, but
quite
>>>> likely lots of other code breaks there too anyway. This is nicely
>>>> self-documenting anyway.
>>> As Daniel suggested, isalnum is locale-sensitive.
>>> If there's a locale with an alphabetic byte that is outside
>>> the logical a-zA-Z range, yet still within 0..127, then the above
>>> expression will give a false-positive for that byte.
>>>
>>> I've been inclined to stop worrying about EBCDIC for years, but a quick
>>> search on the web finds that people are still stuck using it, and do
>>> report bugs in ASCII-assuming code.
>>>
>>> This is why autoconf goes to the trouble of doing this:
>>> tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
>>> not this:
>>> tr a-z A-Z
>>> to convert to upper case.
>> Another factor to consider here is that it doesn't matter if this
>> function over-escapes, but it does matter if the function
>> under-escapes. That is to say, it could escape every character as a
>> %xx hex code, which would be ugly and slightly inefficient but not
>> wrong.
>
> IMHO, if you don't use the all-enumerating switch-based code that Daniel
> objects to, it'd be good to document (in both loops) that the test is
> inaccurate with EBCDIC, and explain why it's ok to get false positives.
>
> Without comments, people might be tempted to use a similar test in a
> context where the differences matter.
OK, how about this?
Rich.
+ for (p = str; *p; ++p) {
+ /* Want to escape only A-Z and 0-9. This may not work on
EBCDIC. */
+ if (isascii (*p) && isalnum (*p))
Actually, with that, the code is at the mercy of locale definitions,
(which are notoriously unreliable), and it probably works with EBCDIC.
I wrote the following and tested a few systems:
#include <ctype.h>
#include <stdio.h>
#include <locale.h>
int
is_alphanum (char c)
{
switch (c)
{
/* generated by LC_ALL=C perl -e \
"print map {qq(case '\$_': )}('a'..'z',
'A'..'Z', '0'..'9')"|fmt */
case 'a': case 'b': case 'c': case 'd': case
'e': case 'f': case 'g':
case 'h': case 'i': case 'j': case 'k': case
'l': case 'm': case 'n':
case 'o': case 'p': case 'q': case 'r': case
's': case 't': case 'u':
case 'v': case 'w': case 'x': case 'y': case
'z': case 'A': case 'B':
case 'C': case 'D': case 'E': case 'F': case
'G': case 'H': case 'I':
case 'J': case 'K': case 'L': case 'M': case
'N': case 'O': case 'P':
case 'Q': case 'R': case 'S': case 'T': case
'U': case 'V': case 'W':
case 'X': case 'Y': case 'Z': case '0': case
'1': case '2': case '3':
case '4': case '5': case '6': case '7': case
'8': case '9':
return 1;
default:
return 0;
}
}
int
main ()
{
setlocale (LC_ALL, "");
for (unsigned int i = 0; i < 256; i++)
if (isalnum (i) && isascii (i) && !is_alphanum (i))
printf ("%d: %c", i, i);
return 0;
}
-------------------------------
I compiled and ran it against all installed locales like this:
gcc -o k k.c && for i in $(locale -a); do \
test "$(LC_ALL=$i ./k|wc -l)" = 0 || echo $i;done
On RHEL4, RHEL5, and rawhide, it finds this exception:
vi_VN.tcvn
Running manually in that locale suggests something is fishy:
$ LC_ALL=vi_VN.tcvn ./k
1:
2:
4:
5:
6:
17:
18:
19:
20:
21:
22:
23:
Surprise, surprise...
So in this locale, using "isascii (*p) && isalnum (*p)" would
*under*quote.
I didn't expect to find such a convincing argument.
I stand by my suggestion to use the switch statement.