[Libvir] [PATCH] Enhance virBuffer code

Apart from a couple of minor fixes, this adds the following useful functions to the virBuffer implementation: virBufferAddChar: Append a single character to the buffer. virBufferURIEncodeString: Append a %-encoded string to the buffer (eg. virBufferEscapeString and the similar function in libxml2). Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

Richard W.M. Jones wrote:
(eg. virBufferEscapeString [...]
Erm, that should be 'cf' not 'eg'. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Thu, Dec 13, 2007 at 06:27:18PM +0000, Richard W.M. Jones wrote:
Richard W.M. Jones wrote:
(eg. virBufferEscapeString [...]
Erm, that should be 'cf' not 'eg'.
The remote driver in CVS doesn't work with this patch applied - it just refuses the connection everytime. It looks like a conditional got reversed - see attached patch Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

On Mon, Dec 17, 2007 at 05:06:19PM +0000, Daniel P. Berrange wrote:
On Thu, Dec 13, 2007 at 06:27:18PM +0000, Richard W.M. Jones wrote:
Richard W.M. Jones wrote:
(eg. virBufferEscapeString [...]
Erm, that should be 'cf' not 'eg'.
The remote driver in CVS doesn't work with this patch applied - it just refuses the connection everytime. It looks like a conditional got reversed - see attached patch
Yes, I just replied to the wrong thread - this was supposed to be a reply to the thread about remote_internal.c refactoring to use qparams.c Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
On Thu, Dec 13, 2007 at 06:27:18PM +0000, Richard W.M. Jones wrote:
Richard W.M. Jones wrote:
(eg. virBufferEscapeString [...] Erm, that should be 'cf' not 'eg'.
The remote driver in CVS doesn't work with this patch applied - it just refuses the connection everytime. It looks like a conditional got reversed - see attached patch
Yes, you're completely right. I'll apply that patchlet now. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

"Richard W.M. Jones" <rjones@redhat.com> wrote: ...
+ * virBufferURIEncodeString: + * @buf: the buffer to append to + * @str: the string argument which will be URI-encoded + * + * Append the string to the buffer. The string will be URI-encoded + * during the append (ie any non alpha-numeric characters are replaced + * with '%xx' hex sequences). + * + * Returns 0 successful, -1 in case of internal or API error. + */ +int +virBufferURIEncodeString (virBufferPtr buf, const char *str) +{ + int grow_size = 0; + const char *p; + unsigned char uc; + const char *hex = "0123456789abcdef"; + + for (p = str; *p; ++p) { + /* Want to leave only strict 7 bit ASCII alphanumerics ... */ + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z')) ... + for (p = str; *p; ++p) { + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z'))
Hi Rich, What do you think of using this? isascii (*p) && isalnum (*p)

On Thu, Dec 13, 2007 at 11:21:31PM +0100, Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote: ...
+ * virBufferURIEncodeString: + * @buf: the buffer to append to + * @str: the string argument which will be URI-encoded + * + * Append the string to the buffer. The string will be URI-encoded + * during the append (ie any non alpha-numeric characters are replaced + * with '%xx' hex sequences). + * + * Returns 0 successful, -1 in case of internal or API error. + */ +int +virBufferURIEncodeString (virBufferPtr buf, const char *str) +{ + int grow_size = 0; + const char *p; + unsigned char uc; + const char *hex = "0123456789abcdef"; + + for (p = str; *p; ++p) { + /* Want to leave only strict 7 bit ASCII alphanumerics ... */ + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z')) ... + for (p = str; *p; ++p) { + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z'))
Hi Rich,
What do you think of using this?
isascii (*p) && isalnum (*p)
I have learned to be very cautious of the is* macros because they tend to be local dependant whichis usually really not what you would like or expect. In that case this may work, but explicit ranges are 100% clear about what you intend to accept or not, that's why I usually prefer them. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard <veillard@redhat.com> wrote:
On Thu, Dec 13, 2007 at 11:21:31PM +0100, Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote: ...
+ * virBufferURIEncodeString: + * @buf: the buffer to append to + * @str: the string argument which will be URI-encoded + * + * Append the string to the buffer. The string will be URI-encoded + * during the append (ie any non alpha-numeric characters are replaced + * with '%xx' hex sequences). + * + * Returns 0 successful, -1 in case of internal or API error. + */ +int +virBufferURIEncodeString (virBufferPtr buf, const char *str) +{ + int grow_size = 0; + const char *p; + unsigned char uc; + const char *hex = "0123456789abcdef"; + + for (p = str; *p; ++p) { + /* Want to leave only strict 7 bit ASCII alphanumerics ... */ + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z')) ... + for (p = str; *p; ++p) { + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z'))
Hi Rich,
What do you think of using this?
isascii (*p) && isalnum (*p)
I have learned to be very cautious of the is* macros because they tend to be local dependant whichis usually really not what you would like or expect. In that case this may work, but explicit ranges are 100% clear about what you intend to accept or not, that's why I usually prefer them.
You're right that we shouldn't use isalnum here. However, we shouldn't use inequality comparisons, either. While 0 <= c <= 9 is guaranteed to be ok for the digits, the a..z and A..Z ranges need not be contiguous, i.e., with EBCDIC: http://www.natural-innovations.com/computing/asciiebcdic.html so how about this instead? int is_alphanum (char c) { switch (c) { /* generated by LC_ALL=C perl -e \ "print map {qq(case '\$_': )}('a'..'z', 'A'..'Z', '0'..'9')"|fmt case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': return 1; default: return 0; } } Of course, systems for which this can make a difference (z/OS, S390) may not be libvirt portability targets, but better safe than sorry.

On Fri, Dec 14, 2007 at 08:34:13AM +0100, Jim Meyering wrote:
You're right that we shouldn't use isalnum here. However, we shouldn't use inequality comparisons, either. While 0 <= c <= 9 is guaranteed to be ok for the digits, the a..z and A..Z ranges need not be contiguous, i.e., with EBCDIC:
http://www.natural-innovations.com/computing/asciiebcdic.html
so how about this instead?
int is_alphanum (char c) { switch (c) { /* generated by LC_ALL=C perl -e \ "print map {qq(case '\$_': )}('a'..'z', 'A'..'Z', '0'..'9')"|fmt case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9':
This makes it unreadable, really no.
Of course, systems for which this can make a difference (z/OS, S390) may not be libvirt portability targets, but better safe than sorry.
I do that kind range comparison in libxml2. This raised a problem only once ever on an Unisys mainframe, they upgraded the compiler and got an ascii option which allowed to fix it without patching libxml2. Maintainance wise, I really prefer we keep the status quo. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Fri, Dec 14, 2007 at 08:34:13AM +0100, Jim Meyering wrote:
Daniel Veillard <veillard@redhat.com> wrote:
On Thu, Dec 13, 2007 at 11:21:31PM +0100, Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote: ...
+ * virBufferURIEncodeString: + * @buf: the buffer to append to + * @str: the string argument which will be URI-encoded + * + * Append the string to the buffer. The string will be URI-encoded + * during the append (ie any non alpha-numeric characters are replaced + * with '%xx' hex sequences). + * + * Returns 0 successful, -1 in case of internal or API error. + */ +int +virBufferURIEncodeString (virBufferPtr buf, const char *str) +{ + int grow_size = 0; + const char *p; + unsigned char uc; + const char *hex = "0123456789abcdef"; + + for (p = str; *p; ++p) { + /* Want to leave only strict 7 bit ASCII alphanumerics ... */ + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z')) ... + for (p = str; *p; ++p) { + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z'))
Hi Rich,
What do you think of using this?
isascii (*p) && isalnum (*p)
I have learned to be very cautious of the is* macros because they tend to be local dependant whichis usually really not what you would like or expect. In that case this may work, but explicit ranges are 100% clear about what you intend to accept or not, that's why I usually prefer them.
You're right that we shouldn't use isalnum here. However, we shouldn't use inequality comparisons, either. While 0 <= c <= 9 is guaranteed to be ok for the digits, the a..z and A..Z ranges need not be contiguous, i.e., with EBCDIC:
Seriously who gives a damn about EBCDIC anymore. This just makes it totally unreadable. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

"Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Dec 14, 2007 at 08:34:13AM +0100, Jim Meyering wrote:
Daniel Veillard <veillard@redhat.com> wrote:
On Thu, Dec 13, 2007 at 11:21:31PM +0100, Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote: ...
+ * virBufferURIEncodeString: + * @buf: the buffer to append to + * @str: the string argument which will be URI-encoded + * + * Append the string to the buffer. The string will be URI-encoded + * during the append (ie any non alpha-numeric characters are replaced + * with '%xx' hex sequences). + * + * Returns 0 successful, -1 in case of internal or API error. + */ +int +virBufferURIEncodeString (virBufferPtr buf, const char *str) +{ + int grow_size = 0; + const char *p; + unsigned char uc; + const char *hex = "0123456789abcdef"; + + for (p = str; *p; ++p) { + /* Want to leave only strict 7 bit ASCII alphanumerics ... */ + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z')) ... + for (p = str; *p; ++p) { + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z'))
Hi Rich,
What do you think of using this?
isascii (*p) && isalnum (*p)
I have learned to be very cautious of the is* macros because they tend to be local dependant whichis usually really not what you would like or expect. In that case this may work, but explicit ranges are 100% clear about what you intend to accept or not, that's why I usually prefer them.
You're right that we shouldn't use isalnum here. However, we shouldn't use inequality comparisons, either. While 0 <= c <= 9 is guaranteed to be ok for the digits, the a..z and A..Z ranges need not be contiguous, i.e., with EBCDIC:
Seriously who gives a damn about EBCDIC anymore.
Daniel and Daniel, Personally, I don't care at all, but I suspect that people using z/OS and OS/390 care, among others. Some might even be Red Hat customers. What's wrong with making the code portable, just in case?
This just makes it totally unreadable.
This just makes *what* totally unreadable? I proposed to do this: for (p = str; *p; ++p) { is_alphanum (*p) ... for (p = str; *p; ++p) { is_alphanum (*p) in place of the code above with all the comparisons. Or maybe I'm misinterpreting something and you're saying the 11-operations-per-byte version is harder to read/maintain? In that case, I would agree. Encapsulating the test makes it _more_ readable and maintainable, since there is far less syntax, less logic, and less duplication. If you're saying that the definition of is_alphanum is unreadable, I don't buy it. It's verifiably correct, and works properly even for those unfortunate enough to be using EBCDIC. Why bother to read it when it's so simple and automatically generated? Between an understandable name and a comment explaining what it does, I would have thought that'd be enough for anyone. This isn't the first time (by a long shot) that we've had such disagreements. It's obvious that our work philosophies are very different, so, after giving it some thought, I believe it's best that I no longer contribute to this project. I need to use my time more efficiently. Good luck with libvirt.

On Fri, Dec 14, 2007 at 11:53:35PM +0100, Jim Meyering wrote:
"Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Dec 14, 2007 at 08:34:13AM +0100, Jim Meyering wrote:
Daniel Veillard <veillard@redhat.com> wrote:
On Thu, Dec 13, 2007 at 11:21:31PM +0100, Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote: ...
+ * virBufferURIEncodeString: + * @buf: the buffer to append to + * @str: the string argument which will be URI-encoded + * + * Append the string to the buffer. The string will be URI-encoded + * during the append (ie any non alpha-numeric characters are replaced + * with '%xx' hex sequences). + * + * Returns 0 successful, -1 in case of internal or API error. + */ +int +virBufferURIEncodeString (virBufferPtr buf, const char *str) +{ + int grow_size = 0; + const char *p; + unsigned char uc; + const char *hex = "0123456789abcdef"; + + for (p = str; *p; ++p) { + /* Want to leave only strict 7 bit ASCII alphanumerics ... */ + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z')) ... + for (p = str; *p; ++p) { + if ((*p >= '0' && *p <= '9') || + (*p >= 'a' && *p <= 'z') || + (*p >= 'A' && *p <= 'Z'))
Hi Rich,
What do you think of using this?
isascii (*p) && isalnum (*p)
I have learned to be very cautious of the is* macros because they tend to be local dependant whichis usually really not what you would like or expect. In that case this may work, but explicit ranges are 100% clear about what you intend to accept or not, that's why I usually prefer them.
You're right that we shouldn't use isalnum here. However, we shouldn't use inequality comparisons, either. While 0 <= c <= 9 is guaranteed to be ok for the digits, the a..z and A..Z ranges need not be contiguous, i.e., with EBCDIC:
Seriously who gives a damn about EBCDIC anymore.
Daniel and Daniel,
Personally, I don't care at all, but I suspect that people using z/OS and OS/390 care, among others. Some might even be Red Hat customers. What's wrong with making the code portable, just in case?
Portability while certainly worthwhile isn't absolute - we only have finite development time. While we could spend that time worrying about EBCDIC and other platforms we're not likely to encounter any time soon, there is a tradeoff to be made at some point where we have to get back to actually developing features that matter to the 99.99999% of our userbase who isn't EBCDIC. I'm not saying we should go out of our way to break other platforms, but there's a cost/benefit tradeoff to be had.
This just makes it totally unreadable.
This just makes *what* totally unreadable?
The giant switch statements is what I was referring to.
I proposed to do this:
for (p = str; *p; ++p) { is_alphanum (*p) ...
for (p = str; *p; ++p) { is_alphanum (*p)
I've no objection to this - just the giant switch statement. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

On Fri, Dec 14, 2007 at 11:53:35PM +0100, Jim Meyering wrote:
"Daniel P. Berrange" <berrange@redhat.com> wrote: Personally, I don't care at all, but I suspect that people using z/OS and OS/390 care, among others. Some might even be Red Hat customers. What's wrong with making the code portable, just in case?
You can spend a week making the code portable to 0.01% of the user base and in the process introduce suble problems which may impact far more than those 0.01%, that's why I think keeping the range checks and adding a warning about it not being okay on EBCDIC is the best ATM.
This just makes it totally unreadable.
This just makes *what* totally unreadable? I proposed to do this:
for (p = str; *p; ++p) { is_alphanum (*p) ...
for (p = str; *p; ++p) { is_alphanum (*p)
in place of the code above with all the comparisons. Or maybe I'm misinterpreting something and you're saying the 11-operations-per-byte version is harder to read/maintain? In that case, I would agree.
The big switch was unreadable.
Encapsulating the test makes it _more_ readable and maintainable, since there is far less syntax, less logic, and less duplication.
If you're saying that the definition of is_alphanum is unreadable, I don't buy it. It's verifiably correct, and works properly even for those unfortunate enough to be using EBCDIC. Why bother to read it when
You always base your decision criteria on a GNU toolchain, where you know how to go and look for the definitions of the macros and the code. I guess a lot of our potential users (and even Red Hat customers) may be tempted to use libvirt and the related code in completely different environment, and your decision criteria which are perfectly valid in a GNU framework may not be that correct there (how do you verify is_alphanum behaviour w.r.t. locale settings when you are compiling for example on the Chinese Windows version ? I have no idea, and sometimes I don't want to take the risk.) I think it is good to have different kind of expertise around, as long as all the inputs can be taken into account to make the final decision. But that means there will be time where someone viewpoint will have to be ruled off in order to progress. To me the key point of an open source process is that the various points and people are recorded, for example to be able to appeal or explain if the issue is raised in the future.
it's so simple and automatically generated? Between an understandable name and a comment explaining what it does, I would have thought that'd be enough for anyone.
Well the understandable name of is_* happens to be locale dependant in a number of cases you can't tell offhand which ones, and that is a serious side effect which is not clear at all. I have been bitten by that in the past, and now I'm just cautious about using them when for grammar definition and not string processing. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Jim Meyering wrote:
What do you think of using this?
isascii (*p) && isalnum (*p)
I'm not sure I'm qualified to say what this does on EBCDIC, but quite likely lots of other code breaks there too anyway. This is nicely self-documenting anyway. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

"Richard W.M. Jones" <rjones@redhat.com> wrote:
Jim Meyering wrote:
What do you think of using this?
isascii (*p) && isalnum (*p)
I'm not sure I'm qualified to say what this does on EBCDIC, but quite likely lots of other code breaks there too anyway. This is nicely self-documenting anyway.
As Daniel suggested, isalnum is locale-sensitive. If there's a locale with an alphabetic byte that is outside the logical a-zA-Z range, yet still within 0..127, then the above expression will give a false-positive for that byte. I've been inclined to stop worrying about EBCDIC for years, but a quick search on the web finds that people are still stuck using it, and do report bugs in ASCII-assuming code. This is why autoconf goes to the trouble of doing this: tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ not this: tr a-z A-Z to convert to upper case.

Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote:
What do you think of using this?
isascii (*p) && isalnum (*p) I'm not sure I'm qualified to say what this does on EBCDIC, but quite
Jim Meyering wrote: likely lots of other code breaks there too anyway. This is nicely self-documenting anyway.
As Daniel suggested, isalnum is locale-sensitive. If there's a locale with an alphabetic byte that is outside the logical a-zA-Z range, yet still within 0..127, then the above expression will give a false-positive for that byte.
I've been inclined to stop worrying about EBCDIC for years, but a quick search on the web finds that people are still stuck using it, and do report bugs in ASCII-assuming code.
This is why autoconf goes to the trouble of doing this: tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ not this: tr a-z A-Z to convert to upper case.
Another factor to consider here is that it doesn't matter if this function over-escapes, but it does matter if the function under-escapes. That is to say, it could escape every character as a %xx hex code, which would be ugly and slightly inefficient but not wrong. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

"Richard W.M. Jones" <rjones@redhat.com> wrote:
Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote:
What do you think of using this?
isascii (*p) && isalnum (*p) I'm not sure I'm qualified to say what this does on EBCDIC, but quite
Jim Meyering wrote: likely lots of other code breaks there too anyway. This is nicely self-documenting anyway.
As Daniel suggested, isalnum is locale-sensitive. If there's a locale with an alphabetic byte that is outside the logical a-zA-Z range, yet still within 0..127, then the above expression will give a false-positive for that byte.
I've been inclined to stop worrying about EBCDIC for years, but a quick search on the web finds that people are still stuck using it, and do report bugs in ASCII-assuming code.
This is why autoconf goes to the trouble of doing this: tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ not this: tr a-z A-Z to convert to upper case.
Another factor to consider here is that it doesn't matter if this function over-escapes, but it does matter if the function under-escapes. That is to say, it could escape every character as a %xx hex code, which would be ugly and slightly inefficient but not wrong.
IMHO, if you don't use the all-enumerating switch-based code that Daniel objects to, it'd be good to document (in both loops) that the test is inaccurate with EBCDIC, and explain why it's ok to get false positives. Without comments, people might be tempted to use a similar test in a context where the differences matter.

Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote:
Jim Meyering wrote:
What do you think of using this?
isascii (*p) && isalnum (*p) I'm not sure I'm qualified to say what this does on EBCDIC, but quite
Jim Meyering wrote: likely lots of other code breaks there too anyway. This is nicely self-documenting anyway. As Daniel suggested, isalnum is locale-sensitive. If there's a locale with an alphabetic byte that is outside
"Richard W.M. Jones" <rjones@redhat.com> wrote: the logical a-zA-Z range, yet still within 0..127, then the above expression will give a false-positive for that byte.
I've been inclined to stop worrying about EBCDIC for years, but a quick search on the web finds that people are still stuck using it, and do report bugs in ASCII-assuming code.
This is why autoconf goes to the trouble of doing this: tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ not this: tr a-z A-Z to convert to upper case. Another factor to consider here is that it doesn't matter if this function over-escapes, but it does matter if the function under-escapes. That is to say, it could escape every character as a %xx hex code, which would be ugly and slightly inefficient but not wrong.
IMHO, if you don't use the all-enumerating switch-based code that Daniel objects to, it'd be good to document (in both loops) that the test is inaccurate with EBCDIC, and explain why it's ok to get false positives.
Without comments, people might be tempted to use a similar test in a context where the differences matter.
OK, how about this? Rich. + for (p = str; *p; ++p) { + /* Want to escape only A-Z and 0-9. This may not work on EBCDIC. */ + if (isascii (*p) && isalnum (*p)) + grow_size++; + else + grow_size += 3; /* %ab */ + } + + if (virBufferGrow (buf, grow_size) == -1) + return -1; + + for (p = str; *p; ++p) { + /* See comment above. */ + if (isascii (*p) && isalnum (*p)) + buf->content[buf->use++] = *p; + else { + uc = (unsigned char) *p; + buf->content[buf->use++] = '%'; + buf->content[buf->use++] = hex[uc >> 4]; + buf->content[buf->use++] = hex[uc & 0xf]; + } + } -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

"Richard W.M. Jones" <rjones@redhat.com> wrote:
Jim Meyering wrote:
"Richard W.M. Jones" <rjones@redhat.com> wrote:
Jim Meyering wrote:
What do you think of using this?
isascii (*p) && isalnum (*p) I'm not sure I'm qualified to say what this does on EBCDIC, but quite
Jim Meyering wrote: likely lots of other code breaks there too anyway. This is nicely self-documenting anyway. As Daniel suggested, isalnum is locale-sensitive. If there's a locale with an alphabetic byte that is outside
"Richard W.M. Jones" <rjones@redhat.com> wrote: the logical a-zA-Z range, yet still within 0..127, then the above expression will give a false-positive for that byte.
I've been inclined to stop worrying about EBCDIC for years, but a quick search on the web finds that people are still stuck using it, and do report bugs in ASCII-assuming code.
This is why autoconf goes to the trouble of doing this: tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ not this: tr a-z A-Z to convert to upper case. Another factor to consider here is that it doesn't matter if this function over-escapes, but it does matter if the function under-escapes. That is to say, it could escape every character as a %xx hex code, which would be ugly and slightly inefficient but not wrong.
IMHO, if you don't use the all-enumerating switch-based code that Daniel objects to, it'd be good to document (in both loops) that the test is inaccurate with EBCDIC, and explain why it's ok to get false positives.
Without comments, people might be tempted to use a similar test in a context where the differences matter.
OK, how about this?
Rich.
+ for (p = str; *p; ++p) { + /* Want to escape only A-Z and 0-9. This may not work on EBCDIC. */ + if (isascii (*p) && isalnum (*p))
Actually, with that, the code is at the mercy of locale definitions, (which are notoriously unreliable), and it probably works with EBCDIC. I wrote the following and tested a few systems: #include <ctype.h> #include <stdio.h> #include <locale.h> int is_alphanum (char c) { switch (c) { /* generated by LC_ALL=C perl -e \ "print map {qq(case '\$_': )}('a'..'z', 'A'..'Z', '0'..'9')"|fmt */ case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': return 1; default: return 0; } } int main () { setlocale (LC_ALL, ""); for (unsigned int i = 0; i < 256; i++) if (isalnum (i) && isascii (i) && !is_alphanum (i)) printf ("%d: %c", i, i); return 0; } ------------------------------- I compiled and ran it against all installed locales like this: gcc -o k k.c && for i in $(locale -a); do \ test "$(LC_ALL=$i ./k|wc -l)" = 0 || echo $i;done On RHEL4, RHEL5, and rawhide, it finds this exception: vi_VN.tcvn Running manually in that locale suggests something is fishy: $ LC_ALL=vi_VN.tcvn ./k 1: 2: 4: 5: 6: 17: 18: 19: 20: 21: 22: 23: Surprise, surprise... So in this locale, using "isascii (*p) && isalnum (*p)" would *under*quote. I didn't expect to find such a convincing argument. I stand by my suggestion to use the switch statement.

On Fri, Dec 14, 2007 at 02:17:31PM +0100, Jim Meyering wrote:
I stand by my suggestion to use the switch statement.
I stand by the satus quo, with an EBCDIC comment if you really care Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Jim Meyering wrote:
$ LC_ALL=vi_VN.tcvn ./k 1: 2: 4: 5: 6: 17: 18: 19: 20: 21: 22: 23:
Surprise, surprise... So in this locale, using "isascii (*p) && isalnum (*p)" would *under*quote.
I didn't expect to find such a convincing argument. I stand by my suggestion to use the switch statement.
Oh that's bizarre ... Looks like switch statement it is then! Rich.

On Fri, Dec 14, 2007 at 11:06:36AM +0000, Richard W.M. Jones wrote:
OK, how about this?
Rich.
+ for (p = str; *p; ++p) { + /* Want to escape only A-Z and 0-9. This may not work on EBCDIC. */ + if (isascii (*p) && isalnum (*p)) + grow_size++; + else + grow_size += 3; /* %ab */ + }
Huh ? How much time and context informations are needed to understand what that code does. Compare to the time needed for the explicit ranges. I really don't see any progress there, and still object to the EBCDIC argument, on such machine you usually don't even get c99 like compilers. This makes no sense to me, really ! Add /* Not EBCDIC safe */ comment on top of the range if you really believe someone is gonna cut and paste libvirt escaping code into a different context, but please keep the semantic of the code clear, it's #1 IMHO for maintainance. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Fri, Dec 14, 2007 at 08:22:24AM -0500, Daniel Veillard wrote:
On Fri, Dec 14, 2007 at 11:06:36AM +0000, Richard W.M. Jones wrote:
OK, how about this?
Rich.
+ for (p = str; *p; ++p) { + /* Want to escape only A-Z and 0-9. This may not work on EBCDIC. */ + if (isascii (*p) && isalnum (*p)) + grow_size++; + else + grow_size += 3; /* %ab */ + }
Huh ? How much time and context informations are needed to understand what that code does. Compare to the time needed for the explicit ranges. I really don't see any progress there, and still object to the EBCDIC argument, on such machine you usually don't even get c99 like compilers. This makes no sense to me, really ! Add /* Not EBCDIC safe */ comment on top of the range if you really believe someone is gonna cut and paste libvirt escaping code into a different context, but please keep the semantic of the code clear, it's #1 IMHO for maintainance.
In a nutshell, I fine with the original version of that patch, Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
participants (5)
-
Daniel P. Berrange
-
Daniel Veillard
-
Jim Meyering
-
Richard Jones
-
Richard W.M. Jones