[libvirt] RFC / Braindump: public APIs needing data streams

The patches for secure migration raise an interesting question wrt the handling of data streams and their effects on the internal driver API and the public API. Although the migration helper APIs are not technically public, they do map onto the remote wire protocol and as such we have the same long term compatability issues to worry about. The way the migration APIs fit together are obscuring the picture a little, so for the sake of clarity, the remainder of this mail is going to talk about a ficticious public API 'virDomainRestoreStream' which allows a guest domain to be restored from a generic data stream, rather than a named file. If we can solve this API problem, then the design will trivially apply to secure migration. I'll now outline some possible approaches at public API level: 1. Pass a file handle to the public API Application usage: fd = open(filename); ret = virDomainRestoreStream(dom, fd); Driver internal usage: int virDomainRestoreStreamImpl(virDomainPtr dom, int fd) char buf[4096]; int ret; int qemuFD; qemuFD = runQEMUWithIncomingFD(dom); do { ret = read(fd, buf, sizeof buf); if (ret > 0) write(qemuFD, buf, ret); } while (ret > 0); } Good: Restore functionality all in one driver method Good: Public API is very simple Good: Internal driver can poll() on the FD to avoid blocking Bad: Application API is blocked Bad: Data read from FD might need transformation eg, uncompress, or decrypt TLS/SASL. 2. Provide public APIs for starting restore, feeding data, and completing. This matches proposal for secure migration patchset. Application usage: fd = open(filename); ret = virDomainRestorePrepare(dom); do { char buf[4096]; ret = read(fd, buf, sizeof buf); virDomainRestoreData(dom, buf, ret); } while (ret > 0); virDomainRestoreFinish(dom, ret == 0 ? 0 : 1); Driver internal usage: int virDomainRestorePrepareImpl(virDomainPtr dom) { qemuFD = runQEMUWithIncomingFD(dom); } int virDomainRestoreDataImpl(virDomainPtr dom, const char *buf, int buflen) { qemudFD = ...find previously opened qemuFD ... write(qemuFD, buf, buflen) } int virDomainRestoreFinishImpl(virDomainPtr dom, int error) { if (error) ...kill QEMU ... else qemuFD = ...find previously opened qemuFD ... close(qemuFD); } Good: Application can easily decrypt input data Good: Application can use an event loop to feed in data as it becomes available. eg poll in socket() Good: Application API never blocks execution for long time Bad: Driver has to maintain state across calls indefinitely Bad: Cannot guarentee that same client calls prepare/data. ie Different clients can get mixed up feeding data. Bad: Public API is fairly complex Bad: Lots of public API entry points for each method needing streams. 3. Provide a stream object for feeding data to driver from application. Similar to option 2, but provides easier state mgmt for driver. The driver will set callbacks on the data stream to receive data from client. Application usage: virDataStream stream = virDataStreamNew(); ret = virDomainRestore(dom, stream); do { char buf[4096]; ret = read(fd, buf, sizeof buf); virDataStreamWrite(stream, buf, ret); } while (ret > 0); virDataStreamFinish(dom, ret == 0 ? 0 : 1); Driver internal usage: int virDomainRestoreStreamImpl(virDomainPtr dom, virDataStream stream) { qemuFD = runQEMUWithIncomingFD(dom); virDataStreamSetCallbacks(stream, virDomainRestoreDataImpl, virDomainRestoreFinishImpl, (void *)qemudFD); } int virDomainRestoreDataImpl(virDomainPtr dom, const char *buf, int buflen, void *opaque) { int qemudFD = (int)opaque; write(qemuFD, buf, buflen) } int virDomainRestoreFinishImpl(virDomainPtr dom, int error, void *opaque) { if (error) ...kill QEMU ... else int qemudFD = (int)opaque; close(qemuFD); } Good: Application can easily decrypt input data Good: Application can use an event loop to feed in data as it becomes available. eg poll in socket() Good: Application API never blocks execution for long time Good: New APIs reusable for any public API with data stream Bad: Public API is fairly complex Bad: Driver has to maintain state across calls indefinitely 4. Provide a callback for driver to fetch data from the client app. Similar to option 1, but avoids need to expose concept of a 'fd' in public API directly. Application usage: int appreader(virDomainPtr dom, char *buf, int buflen, void *opaque) { int fd = (int)opaque; return read(fd, buf, buflen) } fd = open(filename); ret = virDomainRestoreStream(dom, appreader, (void *)fd); Driver internal usage int virDomainRestoreStreamImpl(virDomainPtr dom, int (*reader)(virDomainPtr, char *, int, voi d*), void *opaque) char buf[4096]; int ret; int qemuFD; qemuFD = runQEMUWithIncomingFD(dom); do { ret = (*reader)(dom, buf, sizeof buf, opaque); if (ret > 0) write(qemuFD, buf, ret); } while (ret > 0); } Good: Restore functionality all in one driver method Good: Public API is very simple Good: Client app callback can decrypt data Bad: Application API is blocked Bad: Internal driver code will block on executing callback if not data is available. Can't integrate with event loop. 5. Provide a generic public stream API to fetch data from the client app. Similar to option 4, but adding an stream object to manage the callback - allows more functionality to be added to public API later without changing restore API contract. Application usage: int appreader(virDomainPtr dom, char *buf, int buflen, void *opaque) { int fd = (int)opaque; return read(fd, buf, buflen) } fd = open(filename); stream = virDataStreamNewReader(appreader, (void*)fd); ret = virDomainRestoreStream(dom, stream); Driver internal usage int virDomainRestoreStreamImpl(virDomainPtr dom, virDataStream stream) { char buf[4096]; int ret; int qemuFD; qemuFD = runQEMUWithIncomingFD(dom); do { ret = virDataStreamRead(dom, buf, sizeof buf); if (ret > 0) write(qemuFD, buf, ret); } while (ret > 0); } Good: Restore functionality all in one driver method Good: Public API is fairly simple Good: Client app callback can decrypt data Bad: Application API is blocked Bad: Internal driver code will block on executing callback if not data is available. Can't integrate with event loop. All the APIs have good and bad points to them, in particular there is a difficult tradeoff between simplicity of the public API application code, vs the internal API implementation code. Some important goals though: - There must be a way to invoke the public API without blocking the application code - The driver must be able to receive data from an encrypted channel, because in libvirtd the FD might be the SASL/TLS socket - Internal driver API should not block on callbacks to app code, since it might need to be polling on another FD concurrently with reading data. - The number of new APIs to support streaming should not increase for each new method needing stream support. Ultimately I think options 3 or 5 are the most promising, because the addition of a generic 'virDataStream' public object makes it easier to manage the processing of the data stream without adding huge numbers of new APIs. Option 3 is a little more cumbersome to use from application code, but it avoids blocking either the client app, or the internal driver. The downside is the driver impl code is split across several methods. With option 5 it is harder to avoid blocking the client and internal driver, since it'd require the driver to integrate with an event loop, but there is no direct FD for the driver to poll() on. The choices made also have possible implications on the design of the remote wire protocol to support these methods. Ignoring the design of the public API, there are a handful of ways to stream data between client and server 1. Invoke primary method eg "restore domain", then feed the data in a sequence of following RPC calls. C --------------> S Restore domain call C <-------------- S Restore domain reply C --------------> S Restore data call 1 C <-------------- S Restore data reply 1 C --------------> S Restore data call 2 C <-------------- S Restore data reply 2 ............... C --------------> S Restore data call n C <-------------- S Restore data reply n C --------------> S Restore data complete C <-------------- S Restore data reply If server wants to abort a restore operation, it'll send an error on one of the replies. 2. Invoke primary method eg "restore domain", then feed the data in a sequence of following async messages. C --------------> S Restore domain call C <-------------- S Restore domain reply C --------------> S Restore data msg 1 C --------------> S Restore data msg 2 ............... C --------------> S Restore data msg n C --------------> S Restore data complete C <-------------- S Restore data reply If server wants to stop the client without closing the socket, it needs an async 'stop' message from server to client. This is prety much same as option 2, but is killing off the explicit replies for each data packet. 3. Invoke primary method eg "restore domain", but require the data to be stream to server, before the reply is sent back C --------------> S Restore domain call C --------------> S Stream data msg 1 C --------------> S Stream data msg 2 ............... C --------------> S Stream data msg n C --------------> S Stream finish msg C <-------------- S Restore domain reply If server wants to stop the client without closing the socket, it simply sends back the 'restore domain reply' message as an error before client finishes sending data, and ignore any further data messages. Options 2 or 3 have potential benefits on links with noticable latency, since they're not blocking the client on synchronous replies from the server. That said, the remote protocol does allow for interleaving of calls & replies, so with Option 1 the client could send multiple data packets without waiting for their replies, and deal with possible delayed error replies. If doing that though, the benefit of having a 1-to-1 call-to-reply ratio is minimal, miht as well go for a n-to-1, call-to-reply approach. It is hard to match public API option 3, with wire protocol option 3, because of the delayed 'restore domain reply' message. The options I'm really thinking are most viable are - Public API 3 + RPC 2 - Public API 5 + RPC 3 Both of these are a little more complex to implement in the libvirtd daemon that Chris' current secure migration patches, but then then also have functional & design benefits Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

FWIW this is the libguestfs RPC protocol: http://et.redhat.com/~rjones/libguestfs/guestfs.3.html#communication_protoco... http://git.et.redhat.com/?p=libguestfs.git;a=blob;f=src/guestfs_protocol.x;h... It's not directly relevant because at present the server is single- threaded and answers calls in order. These are the relevant points of the file transfer system: - At the API level, you pass in filenames. The caller is responsible for creating a named pipe in the filesystem, or passing in names like "/dev/fd/N". - File transfers are sent using chunked encoding. The key was to allow cancellation *initiated from either side* (not as easy as it seems). So if an error occurs at either end, the transfer can be stopped almost immediately, and synchronization can be reestablished. The details are in the link above. Rich. -- Richard Jones, Emerging Technologies, Red Hat http://et.redhat.com/~rjones virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into Xen guests. http://et.redhat.com/~rjones/virt-p2v

On Tue, May 26, 2009 at 06:57:18PM +0100, Richard W.M. Jones wrote:
FWIW this is the libguestfs RPC protocol:
http://et.redhat.com/~rjones/libguestfs/guestfs.3.html#communication_protoco... http://git.et.redhat.com/?p=libguestfs.git;a=blob;f=src/guestfs_protocol.x;h...
It's not directly relevant because at present the server is single- threaded and answers calls in order.
It is actually pretty relevant from the wire protocol POV, and matches the ideas I'd been having. With your chunked encoding, you've only got 4 bytes overhead per chunk sent. I was thinking of introducing a new message type to the existing three enum remote_message_direction { REMOTE_CALL = 0, /* client -> server */ REMOTE_REPLY = 1, /* server -> client */ REMOTE_MESSAGE = 2 /* server -> client, asynchronous [NYI] */ }; aka, REMOTE_DATA_CHUNK = 3 This indicates a message which has 'struct remote_message_header' then followed by the data. The idea of this new type, instead of REMOTE_MESSAGE, is that we treat the payload of REMOTE_DATA_CHUNK as totally opaque and thus avoid the extra data copies inherant in defining the payload to be an XDR byte array. So my idea would have 24 bytes overhead per chunk instead of your four. It would also allow us to maintain concurrency, with other threads can be making RPC calls over the same socket, and them being interleaved with individual data chunk mesages.
These are the relevant points of the file transfer system:
- At the API level, you pass in filenames. The caller is responsible for creating a named pipe in the filesystem, or passing in names like "/dev/fd/N".
That has the problem though, that you can't neccesarily assume that the file handle you have has the data in the same encoding you want to process it in. In the case of libvirtd invoking a libvirt API to handle an RPC request, the data is coming in off the client socket and thus needs passing through SASL/TLS decryption. To do this with a API taking a filename, you'd need to create a named pipe, and read off the socket, write into the pipe and then pass the pipe name to the API which adds several more data copies. With the RAM size of VMs this will have a significant impact on CPU & memory bandwidth utilization during migration. If we can pass the data directly from SASL/TLS decryption to the driver, then we can limit ourselves to 2 data copies in the libvirt space. Normal RPC calls have 3 copies in libvirt, the 3rd coming from the XDR format deserialization, but we avoid the third with the custom message type for data streams.
- File transfers are sent using chunked encoding. The key was to allow cancellation *initiated from either side* (not as easy as it seems). So if an error occurs at either end, the transfer can be stopped almost immediately, and synchronization can be reestablished. The details are in the link above.
Yes, those are the points that are particularly fun / interesting. It looks like the scenarios you've identified there all match up to those I've been worrying about. So that's good reassurance that I'm thinking along the rights lines. I reckon the extra 20 bytes overhead per chunk of using an explicit message type, instead of just sending a serious of len+payload chunks is a worthwhile tradeoff in libvirt's case to allow better message interleaving on the socket. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
participants (2)
-
Daniel P. Berrange
-
Richard W.M. Jones