Wow, you guys certainly wake up bright and early on Monday morning. Regarding the patch, I knew before I submitted it that it likely wouldn't be to final solution to the overall problem. I just have issues leaving things unresolved (or at least un-hacked-around, anyway) before I can get to sleep. Besides, it at least hacks around my problem of domain restarts, and by posting it, if someone else has the same issue, they could roll their own package with that patch to resolve it at least temporarily.

As far as the actual problem, I would agree that the storage driver should complete initialization and autostart before the QEMU driver does anything. I originally implemented a system like that (basically, I added an enum-backed field to the struct used for the state drivers), then created two separate lists in the driver initialization function (one for hypervisors, the other for everything else). Finally, I initialized and auto-started everything else, then did the hypervisors. I wanted to implement a proper driver dependency tree, but just couldn't figure out how to do it other than building the tree manually.

That, unfortunately, showed a new bug. The storage driver opens a connection (hardcoded) to 'qemu:///' during auto-start because it (or rather its backends) needs to be able to lookup storage pools and secrets by name/uuid. Both of those existing API functions require a connection in order to accomplish. After thinking about it over the weekend, I came up with a possible solution for the circular dependency issue, though.

What if a new connection URI were defined? The URI would not implement any domain-related API functions, only the API functions that would be considered global, like secret lookups, storage pool lookups, etc. The URI could be something like 'conf:///' and while it may be available to anyone, would likely only be useful within the libvirt code. That would allow the storage backends (and other sections of code) to have a valid hardcoded connection URI for when they require data from a connection, but are called when a connection won't exist, like during their auto-start sequence. Outside of a new connection URI, the other option would be to provide libvirt code a method to perform those lookups without needing a connection. That may already exist for all I know, but if it does, not everything was changed to use it. Luckily, it looks like only two sections of code actually open hard-coded connections (storage/storage_driver.c and lxc/lxc_process.c). I haven't looked into why lxc_process opens one, just found it in a grep of the code.

Lastly, I also agree that translating domain XML into a QEMU command line during a restart of libvirt shouldn't be required. If the command line arguments are required during initialization, perhaps they could be added to the domain XML after the initial translation? It'd be ugly to look at, but would serve to persist that data across libvirt restarts since the "running" version of the domain XML is stored under /var/run (or /run) alongside the PID. That, however, is actually a separate issue from the storage driver requiring a connection to QEMU during auto-start.

On Mon, Dec 9, 2013 at 6:20 AM, Daniel P. Berrange <berrange@redhat.com> wrote:

On Mon, Dec 09, 2013 at 12:15:52PM +0100, Peter Krempa wrote:
> On 12/09/13 11:56, Daniel P. Berrange wrote:
> > On Sat, Dec 07, 2013 at 02:03:00AM -0500, Adam Walters wrote:
> >> This patch works around a race condition present during
> >> libvirt startup. The race condition only applies when
> >> using storage pool volumes for domain disks, and even
> >> then, only when restarting libvirt with running domains.
> >>
> >> The gist of the patch is simply to enter a (limited)
> >> retry loop during qemuTranslateDiskSourcePool. This
> >> particular implementation does have a slight drawback,
> >> though. Within that function, I can not determine if
> >> we are currently starting libvirt, or if we are just
> >> starting up a domain. In the latter case, this could
> >> cause a 800ms delay in reporting an error that the
> >> storage pool is inactive.
> >>
> >> I am happy to report, however, that with this patch,
> >> domains continue to run without restarts regardless
> >> of how often I restart libvirt. I have a fairly fast
> >> hypervisor, and have never seen it try more than one
> >> iteration of the retry loop unless I purposely set
> >> one of the storage pools to be inactive.
> >> ---
> >> src/qemu/qemu_conf.c | 11 +++++++++++
> >> 1 file changed, 11 insertions(+)
> >>
> >> diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c
> >> index c28908a..2e52fbf 100644
> >> --- a/src/qemu/qemu_conf.c
> >> +++ b/src/qemu/qemu_conf.c
> >> @@ -1359,6 +1359,8 @@ qemuTranslateDiskSourcePool(virConnectPtr conn,
> >> virStorageVolInfo info;
> >> int ret = -1;
> >> virErrorPtr savedError = NULL;
> >> + int attempt = 0;
> >> + int poolAutostart;
> >>
> >> if (def->type != VIR_DOMAIN_DISK_TYPE_VOLUME)
> >> return 0;
> >> @@ -1369,7 +1371,16 @@ qemuTranslateDiskSourcePool(virConnectPtr conn,
> >> if (!(pool = virStoragePoolLookupByName(conn, def->srcpool->pool)))
> >> return -1;
> >>
> >> +retry:
> >> if (virStoragePoolIsActive(pool) != 1) {
> >> + if (!(virStoragePoolGetAutostart(pool, &poolAutostart) < 0))
> >> + if (poolAutostart && attempt < 4) {
> >> + VIR_DEBUG("Waiting for storage pool '%s' to activate",
> >> + def->srcpool->pool);
> >> + usleep(200*1000); /* sleep 200ms */
> >> + attempt++;
> >> + goto retry;
> >> + }
> >> virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
> >> _("storage pool '%s' containing volume '%s' "
> >> "is not active"),
> >
> > This is rather dubious and points toa bug elsewhere. The storage driver
> > should complete its autostart before the QEMU driver even starts its
> > own initialization.
>
> We definitely need to make sure that storage is available at that point.
>
> This particular regression was introduced by my commit e1a4d08baf9a
> although was latently present for a few releases now as the volume code
> isn't used that much probably.
>
> Prior to my patch that added the check whether the pool is available we
> were blindly assuming that the pool was online. Pool drivers like
> gluster don't have their internal structures initialized if the pool
> isn't started and thus the translation would fail either way.
>
> Also the translation function is called separately from the reconnection
> code, thus we can pass different arguments to it so we don't spoil the
> normal code paths with unnecessary delays and other stuff.
>
> The question is what to do with domains that have storage on pools that
> can't be initialized. Should we kill those? Should we skip translation
> of the source and then something later may fail?

We do we need todo translation when restarting libvirtd ? Surely we
should only do translation when initialy starting a guest, and then
remember that data thereafter. If we ever try re-translating data later
for a running guest, we risk getting different answers which would be
bad.

Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|