Hi,
my problem can be described simply: libvirt can't handle starting dozens of VMs at the
same time.
(technically, it can, but it's really slow.)
We have an AMD machine with 256 logical cores and 1.5T ram.
On that machine there is roughly 200 VMs.
Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other half is Win7
x64.
VMs are using qcow2 as the disk image. These images reside in the ramdisk (tmpfs).
We use these machines for automatic malware analysis, so our scenario consists of this
cycle:
- reverting VM to a running state
- execute sample inside of the VM for ~1-2 minutes
- shutdown the VM
Of course, this results in multiple VMs trying to start at the same time.
At first, reverts/starts are really fast - second or two.
After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, which
is really unacceptable.
For comparison, we're running the same scenarion on Proxmox, where the
revertToSnapshot usually takes 2 seconds.
Few notes:
- Because of this fast cycle (~2-3 minutes) and because of VMs taking 10-15 seconds to
start, there is barely more than 25-30 VMs running at once.
We would really love to utilise the whole potential of such beast machine of ours, and
have at least ~100 VMs running at any given time.
- During the time running, the avg. CPU load isn't higher than 25%. Also, there's
only about 280 GB of RAM used. Therefore, it's not limitation of our resources.
- When the framwork is running and libvirt is making its best to start our VMs, I noticed
that every libvirt operation is suddenly very slow.
Even simple "virsh list [--all]" takes few seconds to complete, even though it
finishes instantly when no VM is running/starting.
I was trying to search for this issue, but didn't really find anything besides this
presentation:
https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Scalabili...
However, I couldn't find those commits in your upstream.
Is this a known issue? Or is there some setting I don't know of which would magically
make the VMs start faster?
As for steps to reproduce - I don't think there is anything special needed. Just try
to start/destroy several VMs in a loop.
There is even provided one-liner for that in the presentation above.
```
# For multiple domains:
# while virsh start $vm && virsh destroy $vm; do : ; done
# → ~30s hang ups of the libvirtd main loop
```
Best Regards,
Petr