On Fri, Jan 24, 2014 at 12:56:43PM +0000, Daniel P. Berrange wrote:
On Thu, Jan 23, 2014 at 07:47:54PM +0200, Pavel Fux wrote:
> there are 8 servers with 8 vms on each server. all the qcow images are on
> the nfs share on the same external server.
> we are starting all 64 vms at the same time.
> each vm is 2.5GB X 64vms = 160GB = 1280Gb
> to read all of the data on a 1Gbe interface will take 1280sec = 21.3min
> not all of the image is being read on boot so it takes only 5min
That's interesting, but it still doesn't explain the failures. QEMU will
start listening on its monitor socket before it even opens any of the
disk images. So the time it takes to read disk images on boot should have
no relevance to timeouts waiting for the monitor socket. All it does between
exec of the QEMU binary and listening for the monitor socket is to loaded
libraries QEMU is linked against and load a few misc pieces like BIOS
firmware blobs. I just can't see a reason why this would take anywhere
near 5 minutes - it should be a matter of a few seconds at worst.
I think it does a little bit more than that, but I have no proof for
it. When you look for most occurrences of this error wrt virt-manager
(I'm not sure why, maybe because people using virsh deal with it
themselves), you'll find that most of them are caused by a managed
save. When qemu is loading, it takes more than those 3 seconds we had
before, and it fails to start the machine. The thing is that there is
nothing else weird on those machines, removing the managed save solves
everything. And that's why I think it at least loads some
initialization values (in some special cases), although I haven't been
able to reproduce that.
The machine can have high load on the same resource where the
bottleneck for the first lines of code is (or even binary initialization
as Michal mentioned IIRC). And after that the machine is all right.
Martin