[Libvir] RFC [0/3]: Re-factor QEMU daemon protocol to use XDR

I've been thinking about a way to move forward on the remote management capabilities in libvirt since our last list discussion didn't come to any clear agreement. The current situation - QEMUD daemon uses a hand-crafted protocol basically just dumping structs on the wire. Key points: - Fairly simple to understand code for marshalling/demarshalling - Very low overhead on wire - Not safe endianess - Not safe wrt to architecture data alignment rules - No authentication/encryption - Server driver specific to QEMU - The generic libvirt daemon uses SunRPC as the underlying protocol. - Protocol versioning - Safe wrt to endianess - Safe wrt to architecture data alignment rules - Added TLS encryption / SSH tunneling - Generic for any libvirt backend - Doesn't work with non-blocking sockets due to SunRPC limitations - SunRPC api is not well known / 'wierd' - Call + reply only; no async signals from server Looking at the current impl of the generic libvirt daemon there are some things which strike me: 1. We have an extra layer of marshalling/demarshalling calls. ie the methods in the libvirt driver, call to the SunRPC marshalling methods, which populate an XDR generated struct, which call the XDR routines to write the data. 2. The main loop doesn't allow us to listen for other non-SunRPC file handles (eg the QEMU/dnsmasq processes). 3. We've basically re-written the the SunRPC svc*_create methods to get support for TLS / SSH We are going to need to address 2 by creating our own mainloop impl. For point 1 we can have the libvirt driver populate the XDR struct directly and eliminate the SunRPC generated marshalling stubs completely. If we did that, and given points 2 & 3, we're basically left using very little of the SunRPC capabilities at all. Pretty much only thing it is doing for us at that point is the protocol versioning, and giving us a horrific API which doesn't support non-blocking IO so forces us to be multi-threads or multi-process. So, I'm thinking why don't we just use XDR directly & be done with it. As an experiment I took the existing QEMU daemon & driver and re-factored the hand written binary protocol to use XDR instead. Looking at the diffstat - there was basically no appreciable difference in codesize after the re-factoring, and we get the added benefit of being endianness & alignment safe. qemud/Makefile.am | 15 + qemud/conf.c | 1 qemud/dispatch.c | 538 ++++++++++++++++------------------------------ qemud/dispatch.h | 2 qemud/internal.h | 21 + qemud/protocol.h | 329 ---------------------------- qemud/protocol.x | 608 ++++++++++++++++++++++++++++++++++++++++++++++++++++ qemud/qemud.c | 189 +++++++++++----- qemud/uuid.c | 1 src/Makefile.am | 11 src/internal.h | 2 src/qemu_internal.c | 593 ++++++++++++++++++++++++++------------------------ 12 files changed, 1288 insertions(+), 1022 deletions(-) The particular changes I'll explain in the next 3 mails where I attach the patches. Meanwhile though, I'm thinking we could take approach of incremental code refactoring, pulling in all the various capabilities from the remote patch (except for the SunRPC demarshalling itself) to the QEMU daemon. At each stage having a working daemon & client but with ever growing capabilities, until it provides full remote management. 1. Add an extra initial message type to do a major/minor version number exchange. Based on the negotiated versions, the server can activate or deactivate particular requests it allows from the client. 2. Add in the URI parsing routines to the qemud_interal.c driver, but only allow use of QEMU initially. 3. Add in the TLS + SSH client & server socket setup code to give an authenticated & secured remote channel 4. Re-factor qemud/dispatch.c so that it calls into the regular libvirt API for the Xen case, but keeps calling qemud/driver.c for the QEMU case. BTW, these patches are not intended to be applied to CVS - they're for discussion purposes at this stage. That said, they are fully functional so you can apply them to your checkout & you should be able run all the existing virsh commands against the libvirt_qemud. Just make sure you patch & rebuild both the client & server ;-) Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

First off we have the XDR definition file. This is basically a translation of the original qemud/protocol.h into the XDR format. The important changes are that - Instead of a single enum of message types there are now two enums - enum qemud_packet_client_data_type lists valid packets (ie method calls) that a client can make - enum qemud_packet_server_data_type lists valie packets (ie method replies) that a server can make There are two reasons for this split - This lets us add in more server packets which are not neccessaril method replies. ie async notifications ('domain foo start') - We use the enum to discriminate two unions later and there is actually already one type that's used by the server only QEMU_SERVER_PACKET_FAILURE to indicate a method which failed. - The big union qemud_packet_data is now split up - The inline per-message type structs are now all represented at the top level as fully named structs. - There is a qemud_packet_client_data union which is the union of all structs relevant to method calls - There is a qemud_packet_server_data union which is the union of all structs relevant to method replies - Introduce the idea of serial numbers. A serial number allows a client to track the incoming packet from the server which corresponds to a message it just sent. This is because in the future a server may be sending async notifications mixed inbetween method replies. So we have - A qemud_packet_client struct which contains a serial number & a qemud_packet_client_data struct. - A qemud_packet_server struct which contains a serial number, a replyto serial number & a qemud_packet_server_data struct. - Finally there is a qemud_packet_header which contains a magic 4 byte sequence and a length field. This is basically to make pulling the variable length XDR payloads off the wire easier to deal with. The header is always followed by a qemud_packet_client or qemud_packet_server XDR payload. The header length field gives the size of the payload. The magic byte sequence is just a sanity check, which is also useful in debugging. $ diffstat libvirt-qemu-xdr-protocol.patch Makefile.am | 15 + protocol.h | 329 -------------------------------- protocol.x | 608 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 620 insertions(+), 332 deletions(-) Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

This patch updates the client end to use XDR for making requests to the QEMU daemon. The bulk of the patch is basically a simple string replacement to deal with slightly different struct/union nesting & names. The interesting bit of the code is that which actually converts from the qemud_packet_client / qemud_packet_server to the XDR encoded data on the wire. This is, IMHO, much clearer to understand now that its using the XDR routines, although its hard to see this from the diff. Better to look at the two complete impls side-by-side to see the difference. $ diffstat libvirt-qemu-xdr-client.patch Makefile.am | 11 - internal.h | 2 qemu_internal.c | 593 +++++++++++++++++++++++++++++--------------------------- 3 files changed, 327 insertions(+), 279 deletions(-) Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

This patch updates the server end to use XDR for processing requests from the QEMU client. The bulk of the patch is basically a simple string replacement to deal with slightly different struct/union nesting & names. The interesting bit of the code is the changes to the qemud/qemud.c file which is where the XDR encoding/ decoding takes places. As with the client equivalent, I feel this code has become much clearer to understand with the use of XDR. Particularly since it also gives us the endianness & alignment safety we never had before. I'd actually consider going one stage further in this refactoring. In the handcrafted protocol all the strings on the wire had a fixed size as we were basically throwing structs straight onto the wire. With XDR however, it is perfectly possible to have regular variable length strings (ie char*) in the structs and thus the payload on the wire is only as large as actually needed, and not artificially limited in size. $ diffstat libvirt-qemu-xdr-server.patch conf.c | 1 dispatch.c | 538 +++++++++++++++++++++---------------------------------------- dispatch.h | 2 internal.h | 21 +- qemud.c | 189 ++++++++++++++------- uuid.c | 1 6 files changed, 341 insertions(+), 411 deletions(-) Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

As you say, no appreciable difference in code side before and after the change. It's interesting to note what SunRPC gives you over XDR (which you have also noted in your emails): - serial numbers to correlate client requests and server responses (in SunRPC these are called 'XID's) - program numbers and versioning - procedure numbers (same as your qemud_packet_{client,server} _data_type) In fact the header of the SunRPC request and reply is just an XDR-encoded data structure looking like this: struct rpc_msg { u_long rm_xid; // XID // 0 = CALL, 1 = REPLY enum msg_type rm_direction; union { struct call_body RM_cmb; struct reply_body RM_rmb; } ru; }; struct call_body { u_long cb_rpcvers; // always 2 u_long cb_prog; // program number u_long cb_vers; // program version u_long cb_proc; // procedure number struct opaque_auth cb_cred; // credentials - ignore struct opaque_auth cb_verf; // verifier - ignore }; struct reply_body is a bit more complex - see <rpc/rpc_msg.h>. Actually there are functions provided to generate the call and reply headers (xdr_callhdr and xdr_replymsg) so we could use those directly to create SunRPC-compatible headers. Advantages: Wire sniffing tools which grok SunRPC will still work (Wireshark does NFS, so presumably it does general Sun/ONC-RPC; Systemtap apparently supports it too). There are various other efforts to run SunRPC over TLS (including an OCaml one!) which we may still be able to interoperate with. In addition we have provided encryption and authentication. Your patch punts on that - I understand why. About strings: char foo[MAX] (as you pointed out) could be replaced by variable-length strings. There are at least three catches: (1) Who owns/frees the strings? - I was never able to work that out. (2) NULL is not supported directly, you have to use the option-looks-like- a-pointer type (string *). (3) If the client and server don't trust each other then you need to check the length of incoming strings carefully to make sure that they aren't huge and potentially could cause a DoS.
1. Add an extra initial message type to do a major/minor version number exchange. Based on the negotiated versions, the server can activate or deactivate particular requests it allows from the client.
2. Add in the URI parsing routines to the qemud_interal.c driver, but only allow use of QEMU initially.
3. Add in the TLS + SSH client & server socket setup code to give an authenticated & secured remote channel
4. Re-factor qemud/dispatch.c so that it calls into the regular libvirt API for the Xen case, but keeps calling qemud/driver.c for the QEMU case.
You'd definitely want to call it something other than 'qemu' ... Rich. -- Emerging Technologies, Red Hat http://et.redhat.com/~rjones/ 64 Baker Street, London, W1U 7DF Mobile: +44 7866 314 421 "[Negative numbers] darken the very whole doctrines of the equations and make dark of the things which are in their nature excessively obvious and simple" (Francis Maseres FRS, mathematician, 1759)

Richard W.M. Jones wrote:
As you say, no appreciable difference in code side before and after the change.
s/side/size/ ... Rich. -- Emerging Technologies, Red Hat http://et.redhat.com/~rjones/ 64 Baker Street, London, W1U 7DF Mobile: +44 7866 314 421 "[Negative numbers] darken the very whole doctrines of the equations and make dark of the things which are in their nature excessively obvious and simple" (Francis Maseres FRS, mathematician, 1759)

On Fri, Mar 23, 2007 at 10:02:33AM +0000, Richard W.M. Jones wrote:
As you say, no appreciable difference in code side before and after the change.
It's interesting to note what SunRPC gives you over XDR (which you have also noted in your emails):
- serial numbers to correlate client requests and server responses (in SunRPC these are called 'XID's)
Unfortunately they crippled their use by only providing a single 'rm_xid' field. So the only way to do tracking is to mandate that the rm_xid of the method reply matches the rm_xid of the method call. This is why it is impossible to do asynchronous notifications from the server - ie let the server send you a message without any corresponding message having been received from the client. To do this correctly requires 2 independantly incremented sequences, one owned by the server & the other by the client. For method replies you need to have a completely separate field to indicate the serial number of the message you are replying to. I don't see any easy way to work around this limitation in SunRPC wire format :-(
- program numbers and versioning
Not very different to replicate though - and in a less onerous way than SunRPC approach. It would seem that for every new version you need to provide a new set of all the marshalling methods defined by your program, even if you're only adding one new one.
- procedure numbers (same as your qemud_packet_{client,server} _data_type)
In fact the header of the SunRPC request and reply is just an XDR-encoded data structure looking like this:
struct rpc_msg { u_long rm_xid; // XID // 0 = CALL, 1 = REPLY enum msg_type rm_direction; union { struct call_body RM_cmb; struct reply_body RM_rmb; } ru; };
struct call_body { u_long cb_rpcvers; // always 2 u_long cb_prog; // program number u_long cb_vers; // program version u_long cb_proc; // procedure number struct opaque_auth cb_cred; // credentials - ignore struct opaque_auth cb_verf; // verifier - ignore };
struct reply_body is a bit more complex - see <rpc/rpc_msg.h>.
Actually there are functions provided to generate the call and reply headers (xdr_callhdr and xdr_replymsg) so we could use those directly to create SunRPC-compatible headers. Advantages: Wire sniffing tools which grok SunRPC will still work (Wireshark does NFS, so presumably it does general Sun/ONC-RPC; Systemtap apparently supports it too). There are various other efforts to run SunRPC over TLS (including an OCaml one!) which we may still be able to interoperate with.
Hmm, interesting point
In addition we have provided encryption and authentication. Your patch punts on that - I understand why.
Yep, that's a side issue for now - the existing code you've written could quite easily be ported over, since pretty much just have to replace the socket connection setup code & the read/write calls. So there's not really all that much to compare between the two impls.
About strings: char foo[MAX] (as you pointed out) could be replaced by variable-length strings. There are at least three catches: (1) Who owns/frees the strings? - I was never able to work that out. (2) NULL is not supported directly, you have to use the option-looks-like- a-pointer type (string *). (3) If the client and server don't trust each other then you need to check the length of incoming strings carefully to make sure that they aren't huge and potentially could cause a DoS.
In the XDR definition you can give a maximum size eg string foo<>; Allows an unlimited size string (MAX_INT32), while string foo<500>; Would ensure that the generated XDR routines will refuse a string larger than 500 bytes. The most compelling reason for string foo<500> instead of char foo[500] is that every array element is encoded on a 4 byte boundary, so your 1 byte chars end up taking 4 bytes each. The string datatype has a 4 byte length header, following by single bytes per character. A minor detail, but worth bearing in mind.
1. Add an extra initial message type to do a major/minor version number exchange. Based on the negotiated versions, the server can activate or deactivate particular requests it allows from the client.
2. Add in the URI parsing routines to the qemud_interal.c driver, but only allow use of QEMU initially.
3. Add in the TLS + SSH client & server socket setup code to give an authenticated & secured remote channel
4. Re-factor qemud/dispatch.c so that it calls into the regular libvirt API for the Xen case, but keeps calling qemud/driver.c for the QEMU case.
You'd definitely want to call it something other than 'qemu' ...
Yes, the binary would obviously be renamed to just 'libvirtd', not that the user would really see the difference, becuase we already called the init script libvirtd to hide the details. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
On Fri, Mar 23, 2007 at 10:02:33AM +0000, Richard W.M. Jones wrote:
As you say, no appreciable difference in code side before and after the change.
It's interesting to note what SunRPC gives you over XDR (which you have also noted in your emails):
- serial numbers to correlate client requests and server responses (in SunRPC these are called 'XID's)
Unfortunately they crippled their use by only providing a single 'rm_xid' field. So the only way to do tracking is to mandate that the rm_xid of the method reply matches the rm_xid of the method call. This is why it is impossible to do asynchronous notifications from the server - ie let the server send you a message without any corresponding message having been received from the client.
Well ... I don't think it's "impossible". We may have to invent a magic XID for asynchronous replies however. I rather question why full XIDs are really necessary unless we are proposing that the server can answer RPCs out of order. All we really need to know is: "is this the reply for my previous request, or is it an asynchronous message?" and for that XID != 0 / XID == 0 is fine.
- program numbers and versioning
Not very different to replicate though - and in a less onerous way than SunRPC approach. It would seem that for every new version you need to provide a new set of all the marshalling methods defined by your program, even if you're only adding one new one.
I think you can add new procedures, because if the server doesn't understand a procedure number then it sends back a well-defined response (ar_stat == PROC_UNAVAIL). So this is similar to the state of the current C API where we can add new procedures, but if we ever wanted to change an existing procedure then we'd need a whole new version of the library.
About strings: char foo[MAX] (as you pointed out) could be replaced by variable-length strings. There are at least three catches: (1) Who owns/frees the strings? - I was never able to work that out. (2) NULL is not supported directly, you have to use the option-looks-like- a-pointer type (string *). (3) If the client and server don't trust each other then you need to check the length of incoming strings carefully to make sure that they aren't huge and potentially could cause a DoS.
In the XDR definition you can give a maximum size eg
string foo<>;
Allows an unlimited size string (MAX_INT32), while
string foo<500>;
Would ensure that the generated XDR routines will refuse a string larger than 500 bytes.
'Tis true. I'll need to check that the XDR receiving routines actually enforce the maxima. Rich. -- Emerging Technologies, Red Hat http://et.redhat.com/~rjones/ 64 Baker Street, London, W1U 7DF Mobile: +44 7866 314 421 "[Negative numbers] darken the very whole doctrines of the equations and make dark of the things which are in their nature excessively obvious and simple" (Francis Maseres FRS, mathematician, 1759)

Hi Dan, Only getting around to looking at this now ... This all looks perfectly reasonable to me. I don't see a good reason why you shouldn't just go ahead with this next time we feel like de-stabilising the tree for a while. The only downside is things might break a little for people during upgrades. One comment ... On Thu, 2007-03-22 at 20:46 +0000, Daniel P. Berrange wrote:
1. Add an extra initial message type to do a major/minor version number exchange. Based on the negotiated versions, the server can activate or deactivate particular requests it allows from the client.
I think the approach I'd take would be: - Never remove a request from the server - Allow the server to tell the client the latest version of the protocol it supports - The client, where it can, falls back to an older request if the client doesn't support the appropriate newer request. Otherwise, it returns an error to the caller. The way a newer client has a chance of dealing with an older server and an older client client works with a newer server. Cheers, Mark.

I don't think those patches got memory allocation / deallocation of XDR structures right. Not surprising really since it's totally undocumented! In a bid to rectify this, I have documented how to do it here: http://et.redhat.com/~rjones/xdr_tests/ Rich. -- Emerging Technologies, Red Hat http://et.redhat.com/~rjones/ 64 Baker Street, London, W1U 7DF Mobile: +44 7866 314 421 Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA) and David Owens (Ireland)

On Wed, Apr 11, 2007 at 01:01:30PM +0100, Richard W.M. Jones wrote:
I don't think those patches got memory allocation / deallocation of XDR structures right. Not surprising really since it's totally undocumented! In a bid to rectify this, I have documented how to do it here:
Nice. BTW on the subject of record streams - xdrrec_create - while very nice looking on the surface, it is utterly useless because the impl relies on the underlying FD / socket being in blocking mode. If you use non-blocking sockets marshalling/de-marshalling will fail on the first -EAGAIN the routines see, and they have no way to restart where they left off. This is why I serialized to/from a xdrmem buffer, and then used my own read/write code to send to the socket where I could correctly deal with non-blocking mode. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
On Wed, Apr 11, 2007 at 01:01:30PM +0100, Richard W.M. Jones wrote:
I don't think those patches got memory allocation / deallocation of XDR structures right. Not surprising really since it's totally undocumented! In a bid to rectify this, I have documented how to do it here:
Nice. BTW on the subject of record streams - xdrrec_create - while very nice looking on the surface, it is utterly useless because the impl relies on the underlying FD / socket being in blocking mode. If you use non-blocking sockets marshalling/de-marshalling will fail on the first -EAGAIN the routines see, and they have no way to restart where they left off. This is why I serialized to/from a xdrmem buffer, and then used my own read/write code to send to the socket where I could correctly deal with non-blocking mode.
Oh right - that totally passed me by. When are you using non-blocking sockets? I was under the impression they were all blocking in qemu_internal / libvirt_qemud. Rich. -- Emerging Technologies, Red Hat http://et.redhat.com/~rjones/ 64 Baker Street, London, W1U 7DF Mobile: +44 7866 314 421 Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA) and David Owens (Ireland)

On Wed, Apr 11, 2007 at 01:54:52PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
On Wed, Apr 11, 2007 at 01:01:30PM +0100, Richard W.M. Jones wrote:
I don't think those patches got memory allocation / deallocation of XDR structures right. Not surprising really since it's totally undocumented! In a bid to rectify this, I have documented how to do it here:
Nice. BTW on the subject of record streams - xdrrec_create - while very nice looking on the surface, it is utterly useless because the impl relies on the underlying FD / socket being in blocking mode. If you use non-blocking sockets marshalling/de-marshalling will fail on the first -EAGAIN the routines see, and they have no way to restart where they left off. This is why I serialized to/from a xdrmem buffer, and then used my own read/write code to send to the socket where I could correctly deal with non-blocking mode.
Oh right - that totally passed me by. When are you using non-blocking sockets? I was under the impression they were all blocking in qemu_internal / libvirt_qemud.
The client end is always blocking because its only got a single connection to worry about. The server end is completely non-blocking because it has to deal with multiple client connections as well as I/O for the guest VM monitor / stderr / stdout and forking/threading just adds uneccessary complexity. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|
participants (3)
-
Daniel P. Berrange
-
Mark McLoughlin
-
Richard W.M. Jones