On Tue, Jan 29, 2019 at 06:18:21PM -0800, nico wrote:
Hi folks,
First time contributor, but I felt that what I discovered was (probably) a
very rare situation.
I'm running a Centos server (my only Linux deployment) to which customers
all over the U.S. connect to process their micro-lender businesses. There
are several VM's, among other one which runs the fortress system, called a2.
In the beginning the .raw file was about 10GB, which was a 5X overkill in
terms of capacity, at the time.
For years we had no problems and the CentOS box would tick over day after
day without as much as a hiccup.
About three months ago a2 started to slow down, almost to the point of
timing out when applications and users log on. The band-aid was to copy an
earlier a2.raw backup over the current one on a regular basis, and it would
rectify the problem. At first applying this band-aid on Sunday nights, would
suffice. But, later we had to increase it to twice a week and these last
couple of weeks we had to do it almost every night. The system also sent
alerts that a "Degraded Array event had been detected on md device
/dev/md1". Inspecting the drives showed no crisis.
Today it folded completely and brought the system down, with clients' "our
computers are down" response to their customers walking into their stores.
Restarting the box just brought a2 to a paused state, never recovering. We
had to killall to get rid of it.
Having nowhere else to go with it, I decided to rebuild a2 in another,
separate drive to at least address the degraded array alerts. As I edited
the .xml file, I saw the following:
<source file='/var/lib/libvirt/:machines/a2/a2-disk1.raw'/>
What the hell was that colon doing there? I checked the size of the .raw
file. It has grown to over 96GB. Just to check the sanity-box, I checked the
other VM's .xml files and they didn't have a colon, as I expected.
I removed the colon and virsh-started a2, which fired up immediately, with
the rest of the system following suit. No doubt that ":" was the culprit!
My question is: Would that colon cause an append-action to the .raw file? We
have no idea when it got in there or how. We haven't worked on that xml file
for a long time. Why would a2 even fire up at all?
QEMU's -drive command line syntax allows for a ":" to denote use of a
particular QEMU block driver backend. So conceivably ":" in the filename
could confuse QEMU, but "/var/lib/libvirt/" would not be interpreted as
any kind of QEMU block driver AFAICT. In fact I'm rather puzzelled how
it would work at all unless you real do have a directory called
"/var/lib/libvirt/:machines" on your host. I also can't explain why
QEMU would make that file grow arbitrarily. A raw file is a fixed size
from QEMU's pov and won't ever change unless you issue a "resize"
command to QEMU via libvirt.
Is there any way you might have some background job that is runing
that would resize the file either directly or by talking to libvirt
or QEMU ?
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|