On Mon, Oct 28, 2013 at 09:17:16AM -0400, John Ferlan wrote:
On 10/28/2013 07:52 AM, Daniel P. Berrange wrote:
> From: "Daniel P. Berrange" <berrange(a)redhat.com>
>
> The following sequence
>
> 1. Define a persistent QMEU guest
s/QMEU/QEMU
> 2. Start the QEMU guest
> 3. Stop libvirtd
> 4. Kill the QEMU process
> 5. Start libvirtd
> 6. List persistent guets
s/guets/guests
>
> At the last step, the previously running persistent guest
> will be missing. This is because of a race condition in the
> QEMU driver startup code. It does
>
> 1. Load all VM state files
> 2. Spawn thread to reconnect to each VM
> 3. Load all VM config files
>
> Only at the end of step 3, does the 'virDomainObjPtr' get
> marked as "persistent". There is therefore a window where
> the thread reconnecting to the VM will remove the persistent
> VM from the list.
>
> The easy fix is to simply switch the order of steps 2 & 3.
>
> Signed-off-by: Daniel P. Berrange <berrange(a)redhat.com>
> ---
> src/qemu/qemu_driver.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
The fix seems reasonable, although I immediately wondered why "at some
time" it was considered OK to reconnect before being persistent flag was
set for inactive guests. The condition/fault initially described
includes a host reboot in the processing. In that case (I would assume)
the restart of the guest would occur if autostart was set. The external
action of the kill()'ing of a guest outside of libvirtd's control
results in some unknown/unpredictable state for the guest. Is there
something in that initial load that could detect this condition better?
I tried following the steps without the patch, but on my host the guest
was listed after the restart - so yes a timing condition - but what
causes that timing condition.
Would setting the dom->persistent before the virObjectUnlock(dom) in
virDomainObjListLoadAllConfigs() change the results?
No, I tried that and it didn't help. None the less that should be
fixed separately.
Beyond that keeping the virConnectOpen() and
qemuProcessReconnectAll()
"together" after the loading of the inactive persistent configs seems to
keep code flow more normal. Whether that comes before or after the
Snapshot/ManagedSave load is I suppose just an "implementation detail".
Also, other drivers follow the load running, reconnect, and load
inactive/persistent configs. Should those have similar patches as well?
Historically it was safe because reconnecting to QEMU was
serialized. At some point we switched to using a separate
thread for reconnecting to QEMU which introduced a race.
The other drivers are still serialized IIUC.
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|