On Tue, 2019-12-10 at 14:54 +0000, Daniel P. Berrangé wrote:
On Tue, Dec 10, 2019 at 02:54:22PM +0100, Andrea Bolognani wrote:
> The only reason why I'm even questioning whether this should be done
> is capacity for the hypervisor host: the machine we're running all
> builders on has
>
> CPUs: 8
> Memory: 32 GiB
> Storage: 450 GiB
>
> and each of the guests is configured to use
>
> CPUs: 2
> Memory: 2 GiB
> Storage: 20 GiB
>
> So while we're good, and actually have plenty of room to grow, on
> the memory and storage front, we're already overcommitting our CPUs
> pretty significantly, which I guess is at least part of the reason
> why builds take so long.
NB the memory that's free is not really free - it is being usefull
as I/O cache for the VM disks. So more VMs will reduce I/O cache.
Whether that will actually impact us I don't know though.
Oh yeah, I'm aware of that. But like you, I'm unsure about the exact
impact that makes.
We could arguably reduce the amount of memory assigned to guests to
1 GiB: building libvirt and friends is not exactly memory-intensive,
and we could potentially benefit from the additionaly I/O cache or
lessen the blow caused by adding more VMs.
More importantly though, AFAICT, those are not 8 real CPUs.
virsh nodeinfo reports 8 cores, but virsh capabilities
reports it as a 1 socket, 4 core, 2 thread CPU.
IOW we haven't really got 8 CPUs, more like equivalent of 5 CPUs.
as HT only really gives a x1.3 boost in best case, and I suspect
builds are not likely to be hitting the best case.
That's also true.
> Can we afford to add 50% more load on the machine without making
it
> unusable? I don't know. But I think it would be worthwhile to at
> least try and see how it handles an additional 25%, which is exactly
> what this series does.
Giving it a try is ok I guess.
Can I have an R-b then? O:-)
I expect there's probably more we can do to optimize the setup
too.
For example, what actual features of qcow2 are we using ? We're
not snapshotting VMs, we don't need grow-on-demand allocation.
AFACT we're paying the performance cost of qcow2 (l1/l2 table
lookups & metadata caching), for no reason. Switch the VMs to
fully pre-allocated raw files may improve I/O performance.
Raw LVM VGs would be even better but that will be painful
to setup given the host install setup.
I also wonder if we have the optimal aio setting for disks,
as there's nothing in the XML.
We could consider using cache=unsafe for VMs, though for
that I think we'd want to separate off a separate disk
for /home/jenkins so that if there was a host OS crash,
we wouldn't have to rebuild the entire VMs - just throw
away the data disk & recreate.
The home directory for the jenkins user contains some configuration
as well, so you'd have to run 'lcitool update' anyway after attaching
the new disk... At that point, it might be less work overall to just
rebuild the entire VM from scratch. We have only seen a couple of
unexpected host shutdowns so far, and all were caused by hardware
issues that resulted in a multi-day downtime anyway, so the overhead
of reinstalling the various guest OS' would not be the dealbreaker.
Since we've got plenty of RAM, another obvious thing would be
to turn on huge pages and use them for all guest RAM. This may
well have a very significant performance boost from reducing
CPU overhead which is our biggest bottleneck.
Yeah, all of the above sounds like it could help, but I'm not really
well versed on the performance tuning front so I wouldn't know where
to start and even how to properly measure the impact of each change.
We also have to keep in mind that, while the CentOS CI environment
is the main consumer of the repository, we want it to be possible for
any developer to deploy builders locally relatively painlessly, so
things like hugepages and LVM usage would definitely have to be
optional if we introduced them.
--
Andrea Bolognani / Red Hat / Virtualization