Re: [RFC 00/29] RFC: Generate object-model code based on relax-ng files

21 Apr 2020

      On Wed, Mar 25, 2020 at 03:11:40PM +0800, Shi Lei wrote:
...
Outline
=========
In libvirt world, objects(like domain, network, etc.) are described with two
representations: structures in c-language and XML specified by relax-ng.
Since c-language-implementation and xml restricted by relax-ng are merely
the different expressions of the same object-model, we can make a tool to
generate those C-language codes based on relax-ng files.
...
Problems
=========
If we replace hardcoded code with generated code directly, there're still
several problems.
Problem1 - name convension
  Generated code follows specific name convension, now there're some differences
  between generated code and current libvirt codes.
  For example, generated struct's name will be like virXXXDef, parsefunc's name
  will be like virXXXDefParseXML; the 'XXX' is a kind of Camel-Case name which
  implies the struct's place in the hierarchy.
  But in current code, there're many variations of name.
  So if we replace current codes with generated codes directly, it requires
  a lot of modifications in the caller codes.
Regardless of whether or how we auto-generate code, cleaning up current
naming conventions to be more consistent is a benefit. So as a general
rule, instead of having a code generator which knows how to generate
every wierd name, we should just fix the wierd names to follow a standard
convention.
...
Problem2 - error-checking codes in the parsefunc
  Most of current parse functions have error-checking codes after parsing XML
  code. These code lines validate existence and parsing results.
  Now RNG2C can't generate error-checking codes in the parsefunc. So we
  should find a way to handle it.
In the virDomainDef  XML code, we've been slowly splitting out the code
for doing error validation into separate methods.

These two bugs illustrate the core ideas:

  https://gitlab.com/libvirt/libvirt/-/issues/7
  https://gitlab.com/libvirt/libvirt/-/issues/8

the same concept would apply to other XML parsers such as the network
parser, but it hasn't been a high priority for non-domain XML schemas.
...
Problem3 - hierarchy consistence
  The RNG2C generate a 'struct'(C code) for each 'element'(relax-ng). Most
  of current code follows this convention, but there're still some exceptions.
  For example, the element 'bridge'(in network.rng) should be translated into a
  struct called 'virNetworkBridgeDef'; but in fact, no this struct in the current
  codes, and its members (stp/delay/...) are exposed to the struct 'virNetworkDef'.
Yep, this is the really hard problem I think. The idea of generating
the structs from the RNG has a fundamental assumption that the way
we express & modularize the RNG matches  the way we want to express
and modularize the C structs.

To "fix" this, either we have to change the C structs in to something
that may be worse for the code, or we have to change the RNG schema
into something that may be worse for the schema validation.

There's a case in the domain XML which illustrates how painfull his
can be:

  <define name="os">
    <choice>
      <ref name="osxen"/>
      <ref name="oshvm"/>
      <ref name="osexe"/>
    </choice>
  </define>

here, the osxen/oshvm share many attributes, and thus at the C level
we want them all in the same struct, but at the RNG level we wanted
them separate to more strictly validate the XML documents.

...cut all the technical details about the rng2c DIRECTIVE lang....

What I'm really interested in above all is what the end result looks like
for both maintainers, and casual contributors. As a general goal we want
to simplify libvirt to make it easier to contribute to.

Eliminating the need to write custom XML parsing code certainly fits with
that goal at a conceptual level.

The code generator itself though needs some input so that it knows what to
generate, and we have to be mindful of what this input data looks like and
how easy it will be for maintainers/contributors to understand.

In last 6 months, we've had a general aim to reduce the number of languages
we expose contributors to. We've eliminated alot of Perl usage. We've started
to replace HTML with RST, and also want to replace our XML/XSL based website
generator with something simpler and non-XML based.

There are some places where we can't avoid another language, as XDR for the
RPC protocol.  We'll never eliminate XML from libvirt entirely, since it is
a fundamental part of our public API, but I am interested in how we can
minimize the visibility of XML to our contributors.

Being able to auto-generate XML parsers is thus interesting, but if the
source input for the auto-generator is another XML document, that is kind
of failing. This is where I'm finding the rng2c tool quite unappealing.

The libvirt code is primarily C based, with a little bit of magic in places
where we auto-generate code. For example the RPC code generator has magic
comments in the XDR protocol definition, that is used to generate client
and server dispatch code. I think of the XDR language as "psuedo-C" - it
is conceptually close enough to C that C programmers can understand the
XDR definitions fairly easily.

The RNG schemas are a very different beast. Since they're XML based they
are obviously completely unlike any of the C code, and I find that people
have quite a strong dislike of use of XML in general.  With this in mind
I'm not enthusiastic about the idea of auto-generating the XML parsers
from the RNG schemas, as it means everyone needs to know more about both
the RNG schema language, and also learn this custom DIRECTIVE language
used by the rng2c tool.

If we consider this series, we've deleted 530 lines from network_conf.c
and added 200 new lines for post-parse validation logic. This is is good
overall improvement.  No matter what approach or tool we use for XML
parser auto-generation we'll get the similar result.

Now consider the network_conf.h file, we have deleted

    typedef struct _virNetworkDNSTxtDef virNetworkDNSTxtDef;
    typedef virNetworkDNSTxtDef *virNetworkDNSTxtDefPtr;
    struct _virNetworkDNSTxtDef {
        char *name;
        char *value;
    };

    typedef struct _virNetworkDNSSrvDef virNetworkDNSSrvDef;
    typedef virNetworkDNSSrvDef *virNetworkDNSSrvDefPtr;
    struct _virNetworkDNSSrvDef {
        char *domain;
        char *service;
        char *protocol;
        char *target;
        unsigned int port;
        unsigned int priority;
        unsigned int weight;
    };

    typedef struct _virNetworkDNSHostDef virNetworkDNSHostDef;
    typedef virNetworkDNSHostDef *virNetworkDNSHostDefPtr;
    struct _virNetworkDNSHostDef {
        virSocketAddr ip;
        size_t nnames;
        char **names;
    };

    typedef struct _virNetworkDNSForwarder virNetworkDNSForwarder;
    typedef virNetworkDNSForwarder *virNetworkDNSForwarderPtr;
    struct _virNetworkDNSForwarder {
        virSocketAddr addr;
        char *domain;
    };

    typedef struct _virNetworkDNSDef virNetworkDNSDef;
    typedef virNetworkDNSDef *virNetworkDNSDefPtr;
    struct _virNetworkDNSDef {
        int enable;            /* enum virTristateBool */
        int forwardPlainNames; /* enum virTristateBool */
        size_t ntxts;
        virNetworkDNSTxtDefPtr txts;
        size_t nhosts;
        virNetworkDNSHostDefPtr hosts;
        size_t nsrvs;
        virNetworkDNSSrvDefPtr srvs;
        size_t nfwds;
        virNetworkDNSForwarderPtr forwarders;
    };

and instead of this, we now have have these extra rules in the
RNG schemas:

      <!-- VIRT:DIRECTIVE {
        "structure": {"output": "src/conf/network_conf"},
        "clearfunc": {"output": "src/conf/network_conf"},
        "parsefunc": {
          "output": "src/conf/network_conf",
          "post": true,
          "args.instname": true
        },
        "formatfunc": {"output": "src/conf/network_conf"}
      } -->

      <!-- VIRT:DIRECTIVE {
        "name": "virNetworkDNSForwarder",
        "structure": {"output": "src/conf/network_conf"},
        "clearfunc": {"output": "src/conf/network_conf"},
        "parsefunc": {
          "output": "src/conf/network_conf",
          "post": true,
          "args.noctxt": true,
          "args.instname": true
        },
        "formatfunc": {
          "output": "src/conf/network_conf",
          "order": ["domain", "addr"]
        }
      } -->

      <!-- VIRT:DIRECTIVE {
        "structure": {"output": "src/conf/network_conf"},
        "clearfunc": {"output": "src/conf/network_conf"},
        "parsefunc": {
          "output": "src/conf/network_conf",
          "post": true,
          "args.noctxt": true,
          "args.instname": true,
          "args": [
            {"name": "partialOkay", "type": "Bool"}
          ]
        },
        "formatfunc": {"output": "src/conf/network_conf"},
        "members": [
          {"id": "value", "opt": true}
        ]
      } -->

      <!-- VIRT:DIRECTIVE {
        "structure": {"output": "src/conf/network_conf"},
        "clearfunc": {"output": "src/conf/network_conf"},
        "parsefunc": {
          "output": "src/conf/network_conf",
          "post": true,
          "args.instname": true,
          "args": [
            {"name": "partialOkay", "type": "Bool"}
          ]
        },
        "formatfunc": {"output": "src/conf/network_conf"},
        "members": [
          {"id": "service", "opt": true},
          {"id": "protocol", "opt": true}
        ]
      } -->

      <!-- VIRT:DIRECTIVE {
        "structure": {"output": "src/conf/network_conf"},
        "clearfunc": {"output": "src/conf/network_conf"},
        "parsefunc": {
          "output": "src/conf/network_conf",
          "args.instname": true,
          "post": true,
          "args": [
            {"name": "partialOkay", "type": "Bool"}
          ]
        },
        "formatfunc": {"output": "src/conf/network_conf"},
        "members": [
          {"id": "ip", "opt": true},
          {"id": "hostname", "name": "name", "opt": true}
        ]
      } -->

If I come to libvirt as a contributor with C language skils, but little
experiance of RNG, I think this is a significant step backwards in the
ability to understand libvirt code.  It is way easier to understand
what's going on from the C structs, than from the RNG schema and the
VIRT:DIRECTIVE IMHO.  Even as a maintainer, and having read the cover
letter here,  I find the VIR:DIRECTIVE metadata to be way too verbose
and hard to understand compared to the structs.

So I think that although the RNG based code generator elimintes alot
of C code, it has the cost of forcing people to know more about the
RNG code. Overall I don't think that's a clear win.

This doesn't mean auto-generating code is a bad idea. I think it just
means that the RNG schema is not the right place to drive auto-generation
from.

I'm wondering if you've ever done any programming with Golang and used
its XML parsing capabilities ?

Golang has a very clever approach to XML/JSON/YAML parsing which is
based on directives recorded against the native Go structs. In the
libvirt-go-xml.git repository, we've used this to map all the libvirt
XML schemas into Go structs. IME, this has been the most pleasant
way I've come across for parsing XML.

If we consider just the DNS structs that you used to illustrate
rng2c in this patch series. To add support for Go XML parsing
and formatting, required the following comments against the
Golang structs:

    type NetworkDNSTXT struct {
            XMLName xml.Name `xml:"txt"`
            Name    string   `xml:"name,attr"`
            Value   string   `xml:"value,attr"`
    }
    type NetworkDNSSRV struct {
            XMLName  xml.Name `xml:"srv"`
            Service  string   `xml:"service,attr,omitempty"`
            Protocol string   `xml:"protocol,attr,omitempty"`
            Target   string   `xml:"target,attr,omitempty"`
            Port     uint     `xml:"port,attr,omitempty"`
            Priority uint     `xml:"priority,attr,omitempty"`
            Weight   uint     `xml:"weight,attr,omitempty"`
            Domain   string   `xml:"domain,attr,omitempty"`
    }
    type NetworkDNSHostHostname struct {
            Hostname string `xml:",chardata"`
    }

    type NetworkDNSHost struct {
            XMLName   xml.Name                 `xml:"host"`
            IP        string                   `xml:"ip,attr"`
            Hostnames []NetworkDNSHostHostname `xml:"hostname"`
    }
    type NetworkDNSForwarder struct {
            Domain string `xml:"domain,attr,omitempty"`
            Addr   string `xml:"addr,attr,omitempty"`
    }
    type NetworkDNS struct {
            Enable            string                `xml:"enable,attr,omitempty"`
            ForwardPlainNames string                `xml:"forwardPlainNames,attr,omitempty"`
            Forwarders        []NetworkDNSForwarder `xml:"forwarder"`
            TXTs              []NetworkDNSTXT       `xml:"txt"`
            Host              []NetworkDNSHost      `xml:"host"`
            SRVs              []NetworkDNSSRV       `xml:"srv"`
    }

The key important thing here is that the programmer is still fundamentally
working with their normal Golang struct types / fields. All that is required
to handle XML parsing & formatting is to add some magic comments which serve
as directives to the XML parser, telling it attribute/element names, whether
things are optional or not, etc.  The attribute/element names are only needed
if the struct field name is different from the XML name.

My gut feeling is that if we want to go ahead with auto-generating C code
for XML parsing/formatting, we should ignore the RNG schemas, and instead
try to do something similar to the Golang approach to XML.

The hard thing is that this would require us to write something which can
parse the C header files. Generally in our XML parser header files we don't
try to do anything too fancy - it is quite boring C code, so we would not
have to parse the full C language. We can cope with a fairly simplified
parser, that assumes the C header is following certain conventions.

So we would have to add some magic comment directives to each struct
we have

    typedef struct _virNetworkDNSTxtDef virNetworkDNSTxtDef;
    typedef virNetworkDNSTxtDef *virNetworkDNSTxtDefPtr;
    struct _virNetworkDNSTxtDef {
        char *name; /* xmlattr */
        char *value; /* xmlattr */
    };

    typedef struct _virNetworkDNSSrvDef virNetworkDNSSrvDef;
    typedef virNetworkDNSSrvDef *virNetworkDNSSrvDefPtr;
    struct _virNetworkDNSSrvDef {
        char *domain; /* xmlattr,omitempty */
        char *service; /* xmlattr,omitempty */
        char *protocol; /* xmlattr,omitempty */
        char *target; /* xmlattr,omitempty */
        unsigned int port; /* xmlattr,omitempty */
        unsigned int priority; /* xmlattr,omitempty */
        unsigned int weight; /* xmlattr,omitempty */
    };

    typedef struct _virNetworkDNSHostDef virNetworkDNSHostDef;
    typedef virNetworkDNSHostDef *virNetworkDNSHostDefPtr;
    struct _virNetworkDNSHostDef {
        virSocketAddr ip; /* xmlcallback */
        size_t nnames;
        char **names; /* xmlchardata:hostname,array */
    };

    typedef struct _virNetworkDNSForwarder virNetworkDNSForwarder;
    typedef virNetworkDNSForwarder *virNetworkDNSForwarderPtr;
    struct _virNetworkDNSForwarder {
        virSocketAddr addr; /* xmlcallback */
        char *domain; /* xmlattr */
    };

    typedef struct _virNetworkDNSDef virNetworkDNSDef;
    typedef virNetworkDNSDef *virNetworkDNSDefPtr;
    struct _virNetworkDNSDef {
        int enable;            /* xmlattr */
        int forwardPlainNames; /* xmlattr */
        size_t ntxts;
        virNetworkDNSTxtDefPtr txts; /* xmlelement:txt,array */
        size_t nhosts;
        virNetworkDNSHostDefPtr hosts; /* xmlelement:host,array */
        size_t nsrvs;
        virNetworkDNSSrvDefPtr srvs; /* xmlelement:srv,array */
        size_t nfwds;
        virNetworkDNSForwarderPtr forwarders; /* xmlelement:forwarder,array */
    };

Some explanation:

  - xmlattr

    Parse the field as an XML attribute with the same name
    as the struct field

  - xmlattr:thename

    Parse the field as an XML attribute called "thenanme"

  - xmlelement

    Parse the field as a child struct, populating from the
    XML element with same name.

  - xmlelement:thename

    Parse the field as a child struct, populating from the
    XML element called 'thename'

  - xmlcallback

    Call out to custom written XML handler methods to handle
    converting the data to/from the custom data type. eg for
    a field virSocketAddr, we'd call virSocketAddrXMLParse
    and virSocketAddrXMLFormat.

  - ,omitempty

    Don't format the attribute/element/chardata is the struct field
    is a NULL pointer, or an integer with value 0.

BTW, when looking at libvirt-go-xml.git, you'll see there are still a bunch
of places where we have to hand-write parsing/formatting code. There are
essentially two reasons for this

 - Golang has no support for unions.  As a result you have to fake
   unions by having a struct, with a bunch of optional fields each
   corresponding to a new struct. Rely on convention that only one
   field can be non-NULL. This requires custom parse functions since
   the Golang XML parser has no support for this code pattern.

 - The XML parser has no built-in directive to control whether it
   parses/formats in decimal vs hex.

Both of these are things we won't suffer from if we did this in C, so
potentially the XML parsing annotations needed for our C struct would
result in something even simpler than the libvirt-go-xml.git code.

The key question is just how difficult will it be to write a tool that
can parse the C header files, and magic comments, to output suitable
XML parser/formatter functions ? There's no easy way to answer that
without someone trying it.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [RFC 00/29] RFC: Generate object-model code based on relax-ng files

Daniel P. Berrangé