Daniel P. Berrange wrote:
On Tue, May 05, 2009 at 11:38:13PM -0500, Matthew Farrellee wrote:
> Daniel P. Berrange wrote:
>> On Tue, May 05, 2009 at 04:13:38PM -0400, Hugh O. Brock wrote:
>>> Not too long ago we took a patch that allowed QEMU VMs to keep running
>>> even if libvirtd died or was restarted.
>>>
>>> I was talking to Matt Farrellee (cc'd) this afternoon about
>>> manageability, and he feels fairly strongly that this behavior should be
>>> optional -- in other words, it should be possible to guarantee that if
>>> libvirtd dies, it will take all the VMs with the
"die-with-libvirtd"
>>> flag set down with it.
>>>
>>> I'm not sure this API is portable to Xen, but it would work on any
>>> hypervisor that represents the VM as a normal process.
>>>
>>> Does this strike anyone else as useful behavior?
>> This isn't really a model we want in the architecture. That the QEMU
>> instances used to die when libvirtd died was an unfortunate artifact
>> of the fact that QEMU was the parent process leader. These days all VMs
>> are fully daemonized, so there is no parent/child relationship. In fact
>> QEMU was really the odd-ball in this respect, because with Xen/OpenVZ/LXC
>> and VirtualBox, VMs have always happily continued when libvirtd stopped
>> or died, as do storage pools and virtual networks.
>>
>> This is important because it ensures we can automatically restart the
>> libvirtd daemon during RPM upgrades, and provides robustness should a
>> bug cause the daemon to crash - the daemon can be trivially restarted
>> and continue with no interruption to services being managed.
>>
> It doesn't appear to be the case that the libvirtd daemon can trivially
> restart and continue with no interruptions. Right now it loses track of VMs.
That a is a bug then, if you can reproduce it, please file a BZ ticket
so we can track it down & fix it.
> In a scenario where VMs are not deployed and locked to specific physical
> nodes, it can be highly valuable to have ways to ensure a VM is no
> longer running when a layer of its management stops functioning.
IMHO this is a problem to be solved by clustering software. If the
clustering software detects a failure with the management service,
then it should power fence the entire node. Relying on management
service failure to kill the VMs will never be reliable enough.
Daniel
Assuming clustering software were the answer, it is often too
specialized and does not scale nearly well enough. There are other
alternative to layers and layers of management software, but for many
years layers have been what we get to work with.
Best,
matt