Re: [libvirt] [PATCH v2 3/5] Extend nwfilter schema to accept comment attributes

Tuesday, 28 September 2010

On 09/28/2010 04:28 AM, Stefan Berger wrote:
...
> okay.  It also leaves out 8-bit bytes - could that be a problem
for i18n

> where people want comments with native-language accented characters?
> That is, are we being too strict here?  Maybe a better pattern would be
> to reject specific non-printing ASCII bytes we want to avoid, assuing
> you can use escape sequences like [^\001]?

 Looking at

 http://www.asciitable.com/

 I should probably include 0x20-0x7E and 128-175, 224-238 - maybe even
 more? So the regex then becomes

 [&#x20;-&#x7E;&#128;-&#175;&#224;-&#238;]{0,256} 
True ASCII is strictly 7-bit; any locale where isprint() returns true on 
8-bit bytes is a superset single-byte encoding, such as ISO-8859-1, or 
'extended ascii' from the URL you posted above.  But I'm also thinking 
about multi-byte encodings, like UTF-8, where we cannot a priori write a 
regex that will accept all valid Unicode printable characters, in part 
because you have to look at more than one byte at a time to determine if 
you have a printable character.  Which goes back to my suggestion of an 
inverse charset - rejecting bytes that are known to be non-printable 
ASCII, and letting everything else whether or not it is is a printable 
byte sequence in the current locale.  So what about this idea: exclude 
control characters except for tab, and let space and everything after 
through (I don't know if it needs to be adjusted to also reject &#x00):

[^&#x01;-&#x08&#x0A-&#x1F]{0,256}

-- 
Eric Blake   eblake(a)redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [PATCH v2 3/5] Extend nwfilter schema to accept comment attributes