[libvirt] kernel summit topic - 'containers end-game' - Devel

[libvirt] kernel summit topic - 'containers end-game'

older
[libvirt] PATCH: Fix LXC container...

Serge E. Hallyn

23 Jun 2009 23 Jun '09

4:56 p.m.

A topic on ksummit agenda is 'containers end-game and how do we get there'. So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable? More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include: 1. resource control 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart. thanks, -serge

Show replies by date

Balbir Singh

29 Jun 29 Jun

12:35 p.m.

On Tue, Jun 23, 2009 at 8:26 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:

...

A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control

We intend to hold a io-controller minisummit before KS, we should have updates on that front. We also need to discuss CPU hard limits and Memory soft limits. We need control for memory large page, mlock, OOM notification support, shared page accounting, etc. Eventually on the libvirt front, we want to isolate cgroup and lxc support into individual components (long term)

...

2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart.

Balbir Singh

Serge E. Hallyn

30 Jun 30 Jun

10:06 p.m.

Quoting Balbir Singh (balbir@linux.vnet.ibm.com):

...

On Tue, Jun 23, 2009 at 8:26 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control

We intend to hold a io-controller minisummit before KS, we should have updates on that front. We also need to discuss CPU hard limits and Memory soft limits. We need control for memory large page, mlock, OOM notification support, shared page accounting, etc. Eventually on the libvirt front, we want to isolate cgroup and lxc support into individual components (long term)

Thanks, Balbir. By the last sentence, are you talking about having cgroup in its own libcgroup, or do you mean something else? On the topic of cgroups, does anyone not agree that we should try to get rid of the ns cgroup, at least once user namespaces can prevent root in a container from escaping their cgroup? thanks, -serge

Balbir Singh

1 Jul 1 Jul

6:29 a.m.

* Serge E. Hallyn <serue@us.ibm.com> [2009-06-30 15:06:13]:

...

Quoting Balbir Singh (balbir@linux.vnet.ibm.com):

...
On Tue, Jun 23, 2009 at 8:26 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control

We intend to hold a io-controller minisummit before KS, we should have updates on that front. We also need to discuss CPU hard limits and Memory soft limits. We need control for memory large page, mlock, OOM notification support, shared page accounting, etc. Eventually on the libvirt front, we want to isolate cgroup and lxc support into individual components (long term)

Thanks, Balbir. By the last sentence, are you talking about having cgroup in its own libcgroup, or do you mean something else?

On the topic of cgroups, does anyone not agree that we should try to get rid of the ns cgroup, at least once user namespaces can prevent root in a container from escaping their cgroup?

I would have no objections to trying to obsolete ns cgroup once user namespaces can do what you suggest. -- Balbir

Daniel Lezcano

2 Jul 2 Jul

6:58 p.m.

Serge E. Hallyn wrote:

...

Quoting Balbir Singh (balbir@linux.vnet.ibm.com):

...
On Tue, Jun 23, 2009 at 8:26 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control

We intend to hold a io-controller minisummit before KS, we should have updates on that front. We also need to discuss CPU hard limits and Memory soft limits. We need control for memory large page, mlock, OOM notification support, shared page accounting, etc. Eventually on the libvirt front, we want to isolate cgroup and lxc support into individual components (long term)

Thanks, Balbir. By the last sentence, are you talking about having cgroup in its own libcgroup, or do you mean something else?

On the topic of cgroups, does anyone not agree that we should try to get rid of the ns cgroup, at least once user namespaces can prevent root in a container from escaping their cgroup?

I agree if there is a compatibility flag to clone the parent when creating a new cgroup, as suggested Paul. Thanks -- Daniel

Daniel Lezcano

6:43 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Serge E. Hallyn wrote:

...

A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers

Hi Serge, here are a few suggestions for the containers in general and most of these suggestions are pre-requisites for CR (may be not the higher priority but just to keep in mind). * time virtualization : for absolute timer CR, TCP socket timestamps, ... * inode virtualization : without this you won't be able to migrate some applications eg. samba which rely on the inode numbers. * debugging tools for the containers: at present we are not able to debug a multi-threaded application from outside of the container. * poweroff / reboot from inside the container : at poweroff / reboot, all the processes are killed expect the init process which will stay there making the container blocked. Maybe we can send a SIGINFO signal to the init's parent with some information, so it will be up the parent to: - ignore the signal - stop the container (poweroff/halt) - stop and start again the container (reboot).

...

3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And

Right.

...

checkpoint/restart needs... checkpoint/restart.

I know you are working hard on a CR patchset and most of the questions / suggestions below were already addressed in the mailing list since some month ago but IMO they were eluded :) If you can talk about these points and clarify what approach would be preferable that would be nice. IMHO the all-in-kernel-monolithic approach raise some problems: * the tasks are checkpointed from an external process and most of the kernel code is designed to run as current * if a checkpoint or a restart fails, how do we debug that ? How someone in the community using the CR can report an information about the checkpoint has failed in a particular place ? The same for the restart. And a much more harder case is if a restart succeeded but a resource was badly restored making the application to continue its execution but failing 1 hour later. * how this can be maintained ? who will port the CR each time a subsystem design changes ? * the current patchset is full kernel but needs an external tool to create the process tree by digging in the statefile, weird. * the container and the checkpoint/restart are not clearly decorrelated, that brings a dangerous heuristic in the kernel, especially with nested namespace and partial resources checkpoint. IMHO, the checkpoint / restart should succeed even if the resources are not isolated, we should not CR some boundaries like the namespaces. Regarding these points and the comments of Kerrighed and google guys, maybe it would be interesting to discuss the following design of the CR: 1) create a synchronism barrier (not the freezer), where all the tasks can set the checkpoint or restart status That allows to have a task to abort the checkpoint at any time by setting a status error in the synchronism barrier. The initiator of the checkpoint / restart is blocked on this barrier until the checkpoint / restart finishes or fails. If the initiator exits, that's cancel the current operation making possible to do Ctrl+C at checkpoint or restart time. 2) make a vdso which is the entry point of the checkpoint and set this entry as a signal handler for a new signal SIGCKPT, the same for SIGRESTART (AFAIR this is defined in posix 1003.m). This approach allows to checkpoint from the current context which is less arch dependant and/or to override the handler with a specific library making possible to do some work before calling the sys_checkpoint itself. That will allows to build the CR step by step by making in userspace a best-effort library to checkpoint/restart what is not supported in the kernel. 3) a process gains the checkpointable property with a specific flag or whatever. All the childs inherit this flag. That will allows to identify all the tasks which are checkpointable without isolating anything and than opens the door to the checkpoint/restart of a subset of a process tree. 4) dump everything in a core-file-like and improve the interpreter to recreate the process tree from this file. Dynamic behaviour would be: Checkpoint: - The initiator of the checkpoint initialize the barrier and send a signal SIGCKPT to all the checkpointable tasks and these ones will jump on the handler and block on the barrier. - When all these tasks reach this barrier, the initiator of the checkpoint dumps the system wide resources (memory, sysv ipc, struct files, etc ...). - When this is done, the tasks are released and they store their process wide resources (semundo, file descriptor, etc ...) to a current->ckpt_restart buffer and then set the status of the operation and block on the barrier. - The initiator of the checkpoint then collects all these informations and dump them. - Finally the initiator of the checkpoint release the tasks. Restart: - The user executes the statefile, that spawns the process tree and all the processes are blocked in the barrier. - The initiator of the restart restore the system wide resources and fill the restarted processes' current->ckpt_restart buffer. - The initiator sends a SIGRESTART to all the tasks and unblock the tasks - all the tasks restore their process wide resources regarding the current->ckpt_restart buffer. - all the tasks write their status and block on the barrier - the initiator of the restart release the tasks which will return to their execution context when they were checkpointed. This approach is different of you are doing but I am pretty sure most of the code is re-usable. I see different advantages of this approach: - because the process resources are checkpointed / restarted from current, it would be easy to reuse some syscalls code (from the kernel POV) and that would reduce the code duplication and maintenance overhead. - the approach is more fine grained as we can implement piece by piece the checkpoint / restart. - as the statefile is in the elf format, gdb could be used to debug a statefile as a core file - as each process checkpoint / restart themselves, most of the execution context is stored in the stack which is CR with the memory, so when returning from the signal handler, the process returns to the right context. That is less complicated and more generic than externally checkpoint the execution context of a frozen task which would be potentially different for the restart. I hope Serge you can present this approach as an alternative of the current patchset __if__ this one is not acceptable. Regards -- Daniel

Oren Laadan

8:27 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Hi Daniel, This is a fair-sized list of issues ... must have been cooking for a while ? ... Daniel Lezcano wrote:

...

Serge E. Hallyn wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers

Hi Serge,

here are a few suggestions for the containers in general and most of these suggestions are pre-requisites for CR (may be not the higher priority but just to keep in mind).

* time virtualization : for absolute timer CR, TCP socket timestamps, ...

Good point.

...

* inode virtualization : without this you won't be able to migrate some applications eg. samba which rely on the inode numbers.

Hmmm... have you given it a thought ?

...

* debugging tools for the containers: at present we are not able to debug a multi-threaded application from outside of the container.

Why not ? does ptrace-ing from parent container not work ?

...

* poweroff / reboot from inside the container : at poweroff / reboot, all the processes are killed expect the init process which will stay there making the container blocked. Maybe we can send a SIGINFO signal to the init's parent with some information, so it will be up the parent to: - ignore the signal - stop the container (poweroff/halt) - stop and start again the container (reboot).

...
3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And

Right.

...
checkpoint/restart needs... checkpoint/restart.

I know you are working hard on a CR patchset and most of the questions / suggestions below were already addressed in the mailing list since some month ago but IMO they were eluded :) If you can talk about these points and clarify what approach would be preferable that would be nice.

IMHO the all-in-kernel-monolithic approach raise some problems:

Hmmm... anouther round ? :( So, clearly, I couldn't restist :p

...

* the tasks are checkpointed from an external process and most of the kernel code is designed to run as current

I think we are already mostly reusing codes, with few exceptions. Can you elaborate where's the problem ?

...

* if a checkpoint or a restart fails, how do we debug that ? How someone in the community using the CR can report an information about the checkpoint has failed in a particular place ? The same for the restart. And a much more harder case is if a restart succeeded but a resource was badly restored making the application to continue its execution but failing 1 hour later.

For checkpoint we have a nice mechanism that adds (a) record(s) to the checkpoint image that describe the error when it occurs. There are a few examples already in the code. We haven't made much progress on the restart front, yet. I'm pretty sure any idea to this end is applicable in either approach.

...

* how this can be maintained ? who will port the CR each time a subsystem design changes ?

* the current patchset is full kernel but needs an external tool to create the process tree by digging in the statefile, weird.

It uses the head of the data to create the process hierarchy. What's weird about it ? The main advantage is the flexibility it provides. The alternative is to start all tasks in the kernel (a la OpenVZ), or what you suggest, which sounds like .. hmm .. external tool to create the process tree by digging in the statefile :p

...

* the container and the checkpoint/restart are not clearly decorrelated, that brings a dangerous heuristic in the kernel, especially with nested namespace and partial resources checkpoint. IMHO, the checkpoint / restart should succeed even if the resources are not isolated, we should not CR some boundaries like the namespaces.

That's already possible in the current approach.

...

Regarding these points and the comments of Kerrighed and google guys, maybe it would be interesting to discuss the following design of the CR:

1) create a synchronism barrier (not the freezer), where all the tasks can set the checkpoint or restart status

This is already how it works in restart.

...

That allows to have a task to abort the checkpoint at any time by

^^^^^^^^^^^ Is this an issue with current approach ? BTW, to be able to checkpoint at _any time_, preemptively, you _must_ be able to checkpoint externally to the tasks. For instance, how would you handle a ptraced task ? STOPed task ?

...

setting a status error in the synchronism barrier. The initiator of the checkpoint / restart is blocked on this barrier until the checkpoint / restart finishes or fails. If the initiator exits, that's cancel the current operation making possible to do Ctrl+C at checkpoint or restart time.

Aborting using ctrl-c or any other method is already possible now with no harm done. In fact, with less harm than when requiring the cooperation of participating tasks.

...

2) make a vdso which is the entry point of the checkpoint and set this entry as a signal handler for a new signal SIGCKPT, the same for SIGRESTART (AFAIR this is defined in posix 1003.m).

This approach allows to checkpoint from the current context which is less arch dependant and/or to override the handler with a specific

Why is it less arch dependent ? The only arch dependent code in the current patchset is what is defined differently by separate archs (cpus, mm-context).

...

library making possible to do some work before calling the sys_checkpoint itself. That will allows to build the CR step by step by making in userspace a best-effort library to checkpoint/restart what is not supported in the kernel.

This sort of notification is indeed desirable and can be added to either approach.

...

3) a process gains the checkpointable property with a specific flag or whatever. All the childs inherit this flag. That will allows to identify all the tasks which are checkpointable without isolating anything and than opens the door to the checkpoint/restart of a subset of a process tree.

Already possible. Isolation is a nice feature, not a requirement (at least if you ask me :)

...

4) dump everything in a core-file-like and improve the interpreter to recreate the process tree from this file.

How is this different from above ?

...

Dynamic behaviour would be:

Checkpoint: - The initiator of the checkpoint initialize the barrier and send a signal SIGCKPT to all the checkpointable tasks and these ones will jump on the handler and block on the barrier.

- When all these tasks reach this barrier, the initiator of the checkpoint dumps the system wide resources (memory, sysv ipc, struct files, etc ...).

Note that with namespaces, there are no "system wide resources", but instead there are multiple namespaces with resources.

...

- When this is done, the tasks are released and they store their process wide resources (semundo, file descriptor, etc ...) to a current->ckpt_restart buffer and then set the status of the operation and block on the barrier.

- The initiator of the checkpoint then collects all these informations and dump them.

- Finally the initiator of the checkpoint release the tasks.

Can you explain why this approach is better than the current one ? Rename "initiator" to "external checkpointer", and all the rest is nearly the same. Only that instead of relying on the freezer code (which is, clearly, reuse of existing code!), your approach requires a delicate mechanism to allow all tasks to cooperate at the initiator's will.

...

Restart: - The user executes the statefile, that spawns the process tree and all the processes are blocked in the barrier.

Done already.

...

- The initiator of the restart restore the system wide resources and fill the restarted processes' current->ckpt_restart buffer.

- The initiator sends a SIGRESTART to all the tasks and unblock the tasks

- all the tasks restore their process wide resources regarding the current->ckpt_restart buffer.

Done already (with the exception that they do it one by one because the checkpoint image is streamed).

...

- all the tasks write their status and block on the barrier

Done.

...

- the initiator of the restart release the tasks which will return to their execution context when they were checkpointed.

Ditto.

...

This approach is different of you are doing but I am pretty sure most of the code is re-usable. I see different advantages of this approach:

- because the process resources are checkpointed / restarted from current, it would be easy to reuse some syscalls code (from the kernel POV) and that would reduce the code duplication and maintenance overhead.

Checkpoint and restart are asymmetric: checkpoint needs to _observe_ and record, and restart needs to _create_ and build. That's why reusing existing syscalls is extremely helpful for restart, but not so much for checkpoint. In current approach, restart indeed is done in the current context. And that's where you'd like to reuse syscalls. Checkpoint is done by observing tasks (out of their context), and I believe the code will be pretty much the same for in-context. Being out of context requires little bit glue to guarantee safe access to non-current resources.

...

- the approach is more fine grained as we can implement piece by piece the checkpoint / restart.

Can do. Was discussed on containers mailing list some time ago with Kerrighead, IIRC in regarding IPC namespaces.

...

- as the statefile is in the elf format, gdb could be used to debug a statefile as a core file

- as each process checkpoint / restart themselves, most of the execution context is stored in the stack which is CR with the memory, so when returning from the signal handler, the process returns to the right context. That is less complicated and more generic than externally checkpoint the execution context of a frozen task which would be potentially different for the restart.

Ehh ? The code is actually straight forward. No kernel stack, and user stack is in memory anyway. Take a look at the code, it's pretty straightforward.

...

I hope Serge you can present this approach as an alternative of the current patchset __if__ this one is not acceptable.

There you go. I could not resist :O Now, before I go hide (...) - some of these points require attention, e.g. - error reporting on restart, notification mechanisms, partial containers and selected resources, etc. Oren.

Serge E. Hallyn

6 Jul 6 Jul

4:51 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Quoting Daniel Lezcano (dlezcano@fr.ibm.com):

...

Serge E. Hallyn wrote: ... Checkpoint: - The initiator of the checkpoint initialize the barrier and send a signal SIGCKPT to all the checkpointable tasks and these ones will jump on the handler and block on the barrier.

- When all these tasks reach this barrier, the initiator of the checkpoint dumps the system wide resources (memory, sysv ipc, struct files, etc ...).

- When this is done, the tasks are released and they store their process wide resources (semundo, file descriptor, etc ...) to a current->ckpt_restart buffer and then set the status of the operation and block on the barrier.

- The initiator of the checkpoint then collects all these informations and dump them.

Do you envision all of the dumping being done in kernel or by userspace? ...

...

- Finally the initiator of the checkpoint release the tasks.

Restart: - The user executes the statefile, that spawns the process tree and all the processes are blocked in the barrier.

- The initiator of the restart restore the system wide resources and fill the restarted processes' current->ckpt_restart buffer.

Same question about restore...

...

- The initiator sends a SIGRESTART to all the tasks and unblock the tasks

- all the tasks restore their process wide resources regarding the current->ckpt_restart buffer.

- all the tasks write their status and block on the barrier

- the initiator of the restart release the tasks which will return to their execution context when they were checkpointed.

This approach is different of you are doing but I am pretty sure most of the code is re-usable. I see different advantages of this approach:

- because the process resources are checkpointed / restarted from current, it would be easy to reuse some syscalls code (from the kernel POV) and that would reduce the code duplication and maintenance overhead.

- the approach is more fine grained as we can implement piece by piece the checkpoint / restart.

- as the statefile is in the elf format, gdb could be used to debug a statefile as a core file

Note btw that Dave has found that a checkpoint is faster than a core-dump at the moment :) That's not entirely an aside - I need to reread your email a few times and really process your suggestion, but given that some users want to dump hundreds of gigabytes of memory, not slowing down the checkpoint is a big consideration.

...

- as each process checkpoint / restart themselves, most of the execution context is stored in the stack which is CR with the memory, so when returning from the signal handler, the process returns to the right context. That is less complicated and more generic than externally checkpoint the execution context of a frozen task which would be potentially different for the restart.

I hope Serge you can present this approach as an alternative of the current patchset __if__ this one is not acceptable.

I'll try to understand it better than I do right now - I don't think it's for discussing at ksummit, but definately if we have a mini-summit or during the next round of discussions (during or immediately after the ckpt-v17 publish). thanks, -serge

Daniel Lezcano

8 Jul 8 Jul

9:55 a.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Serge E. Hallyn wrote:

...

Quoting Daniel Lezcano (dlezcano@fr.ibm.com):

...
Serge E. Hallyn wrote: ... Checkpoint: - The initiator of the checkpoint initialize the barrier and send a signal SIGCKPT to all the checkpointable tasks and these ones will jump on the handler and block on the barrier.

- When all these tasks reach this barrier, the initiator of the checkpoint dumps the system wide resources (memory, sysv ipc, struct files, etc ...).

- When this is done, the tasks are released and they store their process wide resources (semundo, file descriptor, etc ...) to a current->ckpt_restart buffer and then set the status of the operation and block on the barrier.

- The initiator of the checkpoint then collects all these informations and dump them.

Do you envision all of the dumping being done in kernel or by userspace?

Dumping is done by the kernel.

...

...
- Finally the initiator of the checkpoint release the tasks.

Restart: - The user executes the statefile, that spawns the process tree and all the processes are blocked in the barrier.

- The initiator of the restart restore the system wide resources and fill the restarted processes' current->ckpt_restart buffer.

Same question about restore...

The process tree is recreated from userspace, the rest from the kernel. This is very similar with what you have currently, the differences are the tasks are checkpointed from "current", the statefile is in elf format and a synchro is used instead of the freezer (allowing to get rid of the cgroup). The checkpoint is like a 'super-abort' and the restart a 'super-exec' :)

...

...
- The initiator sends a SIGRESTART to all the tasks and unblock the tasks

- all the tasks restore their process wide resources regarding the current->ckpt_restart buffer.

- all the tasks write their status and block on the barrier

- the initiator of the restart release the tasks which will return to their execution context when they were checkpointed.

This approach is different of you are doing but I am pretty sure most of the code is re-usable. I see different advantages of this approach:

- because the process resources are checkpointed / restarted from current, it would be easy to reuse some syscalls code (from the kernel POV) and that would reduce the code duplication and maintenance overhead.

- the approach is more fine grained as we can implement piece by piece the checkpoint / restart.

- as the statefile is in the elf format, gdb could be used to debug a statefile as a core file

Note btw that Dave has found that a checkpoint is faster than a core-dump at the moment :) That's not entirely an aside - I need to reread your email a few times and really process your suggestion, but given that some users want to dump hundreds of gigabytes of memory, not slowing down the checkpoint is a big consideration.

Interesting, any idea of why the core dump is slower ?

...

...
- as each process checkpoint / restart themselves, most of the execution context is stored in the stack which is CR with the memory, so when returning from the signal handler, the process returns to the right context. That is less complicated and more generic than externally checkpoint the execution context of a frozen task which would be potentially different for the restart.

I hope Serge you can present this approach as an alternative of the current patchset __if__ this one is not acceptable.

I'll try to understand it better than I do right now - I don't think it's for discussing at ksummit, but definately if we have a mini-summit or during the next round of discussions (during or immediately after the ckpt-v17 publish).

Maybe the current patchset will be considered good, in this case discard my comments and drop this email :) or maybe some people would be arguing against the current approach because they don't like it, perhaps for the different reasons I gave previously, in this case you have a set of ideas / modifications for the patchset to propose alternatively and to discuss about, that was the purpose of my email :) Do you plan to do send the minutes of the ksummit ? Thanks. -- Daniel

Serge E. Hallyn

3:45 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Quoting Daniel Lezcano (dlezcano@fr.ibm.com):

...

Do you plan to do send the minutes of the ksummit ?

Absolutely. Of course it's not until October. I'll be sending out a copy of the notes I take with me (including the info from this thread) beforehand. thanks, -serge

Oren Laadan

2 Jul 2 Jul

8:38 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Serge E. Hallyn wrote:

...

A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart.

Heh ... it does need ... checkpoint/restart; and a few issues which we should think about sometime -- * Encapsulation of machine/OS config capabilities - how to detect (versioning, capabilities) ? - how to deal with mismatches ? (bail ? emulate ? hope for the best ?) - what happens if, e.g. VDSO page changes, or how to detect FPU changes... * Conversion of checkpoint image between kernel version (and automation) * Network namespaces, mnt namespaces - what's the best approach ? * Security assessment and brainstorming * Appealing use-cases for everyday use: - for hybernation - to reboot to new kernel without losing your session - to time travel back to before you lost in "bejewled" * Userspace tools - mainly for inspection of checkpoint images * Testing frameworks * Distributed c/r ? * Optimizations: low downtime, pre-copy, post-copy, cow, parallelization Now I really go hide :p Oren.

Serge E. Hallyn

6 Jul 6 Jul

4:34 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Quoting Oren Laadan (orenl@cs.columbia.edu):

...

Serge E. Hallyn wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart.

Heh ... it does need ... checkpoint/restart; and a few issues which we should think about sometime --

Yup, these are all things we need to discuss. For some of them we might just need to flail about and code a few approaches until we figure out an answer, but then I think that everyone has thought about a few of these in some detail, so there probably is much we could gain from talking. ... Does this mean we should try to have a mini-summit in the next 6 months or so? I'd recommend having one right before kernel summit so we can get our act together, but getting everyone to tokyo to chat seems uneconomical :) It'd be good to chat about at least the first two items before the summit, though. Maybe after we finish v17, we pick a few of these and try a focused push to get answers?

...

* Encapsulation of machine/OS config capabilities - how to detect (versioning, capabilities) ? - how to deal with mismatches ? (bail ? emulate ? hope for the best ?) - what happens if, e.g. VDSO page changes, or how to detect FPU changes...

* Conversion of checkpoint image between kernel version (and automation)

* Network namespaces, mnt namespaces - what's the best approach ?

* Security assessment and brainstorming

* Appealing use-cases for everyday use: - for hybernation - to reboot to new kernel without losing your session - to time travel back to before you lost in "bejewled"

* Userspace tools - mainly for inspection of checkpoint images

* Testing frameworks

* Distributed c/r ?

* Optimizations: low downtime, pre-copy, post-copy, cow, parallelization

Now I really go hide :p

Oren.

Oren Laadan

7:30 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Serge E. Hallyn wrote:

...

Quoting Oren Laadan (orenl@cs.columbia.edu):

...
Serge E. Hallyn wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart. Heh ... it does need ... checkpoint/restart; and a few issues which we should think about sometime --

Yup, these are all things we need to discuss. For some of them we might just need to flail about and code a few approaches until we figure out an answer, but then I think that everyone has thought about a few of these in some detail, so there probably is much we could gain from talking.

... Does this mean we should try to have a mini-summit in the next 6 months or so? I'd recommend having one right before kernel summit so we can get our act together, but getting everyone to tokyo to chat seems uneconomical :) It'd be good to chat about at least the first two items before the summit, though.

How about linux plumbers ? Oren.

...

Maybe after we finish v17, we pick a few of these and try a focused push to get answers?

...
* Encapsulation of machine/OS config capabilities - how to detect (versioning, capabilities) ? - how to deal with mismatches ? (bail ? emulate ? hope for the best ?) - what happens if, e.g. VDSO page changes, or how to detect FPU changes...

* Conversion of checkpoint image between kernel version (and automation)

* Network namespaces, mnt namespaces - what's the best approach ?

* Security assessment and brainstorming

* Appealing use-cases for everyday use: - for hybernation - to reboot to new kernel without losing your session - to time travel back to before you lost in "bejewled"

* Userspace tools - mainly for inspection of checkpoint images

* Testing frameworks

* Distributed c/r ?

* Optimizations: low downtime, pre-copy, post-copy, cow, parallelization

Now I really go hide :p

Oren.

Serge E. Hallyn

8:48 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Quoting Oren Laadan (orenl@cs.columbia.edu):

...

Serge E. Hallyn wrote:

...
Quoting Oren Laadan (orenl@cs.columbia.edu):

...
Serge E. Hallyn wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart. Heh ... it does need ... checkpoint/restart; and a few issues which we should think about sometime --

Yup, these are all things we need to discuss. For some of them we might just need to flail about and code a few approaches until we figure out an answer, but then I think that everyone has thought about a few of these in some detail, so there probably is much we could gain from talking.

... Does this mean we should try to have a mini-summit in the next 6 months or so? I'd recommend having one right before kernel summit so we can get our act together, but getting everyone to tokyo to chat seems uneconomical :) It'd be good to chat about at least the first two items before the summit, though.

How about linux plumbers ?

Well it seems like an appropriate place for it. Alas there is almost no chance of my being there, but let's hear a roll call - how many people (interested in checkpoint/restart) will be or can be at plumber's? I'm pretty sure Suka and Dave will be there. -serge

Oren Laadan

7 Jul 7 Jul

5:36 p.m.

New subject: [libvirt] Re: kernel summit topic - 'containers end-game'

Serge E. Hallyn wrote:

...

Quoting Oren Laadan (orenl@cs.columbia.edu):

...
Serge E. Hallyn wrote:

...
Quoting Oren Laadan (orenl@cs.columbia.edu):

...
Serge E. Hallyn wrote:

...
A topic on ksummit agenda is 'containers end-game and how do we get there'.

So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable?

More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include:

1. resource control 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration

For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart. Heh ... it does need ... checkpoint/restart; and a few issues which we should think about sometime -- Yup, these are all things we need to discuss. For some of them we might just need to flail about and code a few approaches until we figure out an answer, but then I think that everyone has thought about a few of these in some detail, so there probably is much we could gain from talking.

... Does this mean we should try to have a mini-summit in the next 6 months or so? I'd recommend having one right before kernel summit so we can get our act together, but getting everyone to tokyo to chat seems uneconomical :) It'd be good to chat about at least the first two items before the summit, though.

How about linux plumbers ?

Well it seems like an appropriate place for it. Alas there is almost no chance of my being there, but let's hear a roll call - how many people (interested in checkpoint/restart) will be or can be at plumber's?

I'm pretty sure Suka and Dave will be there.

Seems like I can make it. Oren.

5983

Age (days ago)

5998

Last active (days ago)

List overview

Download

14 comments

5 participants

participants (5)

Balbir Singh
Daniel Lezcano
Daniel Lezcano
Oren Laadan
Serge E. Hallyn