On Thu, Sep 30, 2010 at 07:31:56AM -0400, Stefan Berger wrote:
On 09/28/2010 04:06 PM, Stefan Berger wrote:
[...]
>> you have a printable character. Which goes back to my
suggestion of an
>> inverse charset - rejecting bytes that are known to be non-printable
>> ASCII, and letting everything else whether or not it is is a printable
>> byte sequence in the current locale. So what about this idea: exclude
>> control characters except for tab, and let space and everything after
>> through (I don't know if it needs to be adjusted to also reject �):
>>
>> [^-
-]{0,256}
>
>Fine by me. We may just give the impression of accepting unicode
>while the code does not handle it.
... except that xmllint does not like  with or without
preceding ^ (among other things):
xmllint --relaxng ./docs/schemas/nwfilter.rng
tests/nwfilterxml2xmlout/comment-test.xml
./docs/schemas/nwfilter.rng:862: parser error : xmlParseCharRef:
invalid xmlChar value 1
<param
name="pattern">[^-
-]{0,256}</param>
^
The set of characters allowed in XML documents are defined by the
XML specification:
http://www.w3.org/TR/REC-xml/#NT-Char
It's a subset of Unicode, it refuses 0 et most ASCII control chars.
If comments are really comments embbeded in XML I don't see why
you should try to limit the comment content beside what XML imposes.
On the otehr hand if the comments are exposed somewhere else, we need
to look at the conversion issues, while limiting to ASCII is safe, it
can be really inconvenient in practice.
Do you really need to put any extra restriction ?
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit
http://xmlsoft.org/
daniel(a)veillard.com | Rpmfind RPM search engine
http://rpmfind.net/
http://veillard.com/ | virtualization library
http://libvirt.org/