[libvirt-users] libvirt possibly ignoring cache=none ?

Hi, I have an instance with 8G ram assigned. All block devices have cache disabled (cache=none) on host. However, cgroup is reporting 4G of cache associated to the instance (on host) # cgget -r memory.stat libvirt/qemu/i-000009fa libvirt/qemu/i-000009fa: memory.stat: cache 4318011392 rss 8676360192 ... When I drop all system caches on host.. # echo 3 > /proc/sys/vm/drop_caches # ..cache associated to the instance drops too. # cgget -r memory.stat libvirt/qemu/i-000009fa libvirt/qemu/i-000009fa: memory.stat: cache 122880 rss 8674291712 ... Can somebody explain what is cached, if there is cache=none everywhere ? Thanks, Brano Zarnovican PS: versions: Scientific Linux release 6.4 (Carbon) kernel-2.6.32-358.11.1.el6.x86_64 qemu-kvm-0.12.1.2-2.355.el6_4.5.x86_64 libvirt-0.10.2-18.el6_4.5.x86_64

On 08/07/2013 02:24 PM, Brano Zarnovican wrote:
Hi,
I have an instance with 8G ram assigned. All block devices have cache disabled (cache=none) on host. However, cgroup is reporting 4G of cache associated to the instance (on host)
# cgget -r memory.stat libvirt/qemu/i-000009fa libvirt/qemu/i-000009fa: memory.stat: cache 4318011392 rss 8676360192 ...
When I drop all system caches on host..
# echo 3 > /proc/sys/vm/drop_caches #
..cache associated to the instance drops too.
# cgget -r memory.stat libvirt/qemu/i-000009fa libvirt/qemu/i-000009fa: memory.stat: cache 122880 rss 8674291712 ...
Can somebody explain what is cached, if there is cache=none everywhere ?
At first let me explain that libvirt is not ignoring the cache=none. This is propagated to qemu as a parameter for it's disk. From qemu's POV (anyone feel free to correct me if I'm mistaken) this means the file is opened with O_DIRECT flag; and from the open(2) manual, the O_DIRECT means "Try to minimize cache effects of the I/O to and from this file...", that doesn't necessarily mean there is no cache at all. But even if it does, this applies to files used as disks, but those disks are not the only files the process is using. You can check what othe files the process has mapped, opened etc. from the '/proc' filesystem or using the 'lsof' utility. All the other files can (and probably will) take some cache and there is nothing wrong with that. Are you trying to resolve an issue or asking just out of curiosity? Because this is wanted behavior and there should be no need for anyone to minimize this. Have a nice day, Martin

On Thu, Aug 8, 2013 at 9:39 AM, Martin Kletzander <mkletzan@redhat.com> wrote:
At first let me explain that libvirt is not ignoring the cache=none. This is propagated to qemu as a parameter for it's disk. From qemu's POV (anyone feel free to correct me if I'm mistaken) this means the file is opened with O_DIRECT flag; and from the open(2) manual, the O_DIRECT means "Try to minimize cache effects of the I/O to and from this file...", that doesn't necessarily mean there is no cache at all.
Thanks for explanation.
But even if it does, this applies to files used as disks, but those disks are not the only files the process is using. You can check what othe files the process has mapped, opened etc. from the '/proc' filesystem or using the 'lsof' utility. All the other files can (and probably will) take some cache and there is nothing wrong with that.
In my case there was 4GB of caches. Just now, I have thrashed one instance with many read/writes on various devices. In total, tens of GB of data. But the cache (on host) did not grow beyond 3MB. I'm not yet able to reproduce the problem.
Are you trying to resolve an issue or asking just out of curiosity? Because this is wanted behavior and there should be no need for anyone to minimize this.
Once or twice, one of our VMs was OOM killed because it reached 1.5 * memory limit for its cgroup. Here is an 8GB, instance. Libvirt created cgroup with 12.3GB memory limit, which we have filled to 98% [root@dev-cmp08 ~]# cgget -r memory.limit_in_bytes -r memory.usage_in_bytes libvirt/qemu/i-000009fa libvirt/qemu/i-000009fa: memory.limit_in_bytes: 13215727616 memory.usage_in_bytes: 12998287360 The 4G difference is the cache. That's why I'm so interested in what is consuming the cache on a VM which should be caching in guest only. Regards, Brano Zarnovican

On 08/08/2013 05:03 PM, Brano Zarnovican wrote:
On Thu, Aug 8, 2013 at 9:39 AM, Martin Kletzander <mkletzan@redhat.com> wrote:
At first let me explain that libvirt is not ignoring the cache=none. This is propagated to qemu as a parameter for it's disk. From qemu's POV (anyone feel free to correct me if I'm mistaken) this means the file is opened with O_DIRECT flag; and from the open(2) manual, the O_DIRECT means "Try to minimize cache effects of the I/O to and from this file...", that doesn't necessarily mean there is no cache at all.
Thanks for explanation.
But even if it does, this applies to files used as disks, but those disks are not the only files the process is using. You can check what othe files the process has mapped, opened etc. from the '/proc' filesystem or using the 'lsof' utility. All the other files can (and probably will) take some cache and there is nothing wrong with that.
In my case there was 4GB of caches.
Just now, I have thrashed one instance with many read/writes on various devices. In total, tens of GB of data. But the cache (on host) did not grow beyond 3MB. I'm not yet able to reproduce the problem.
Are you trying to resolve an issue or asking just out of curiosity? Because this is wanted behavior and there should be no need for anyone to minimize this.
Once or twice, one of our VMs was OOM killed because it reached 1.5 * memory limit for its cgroup.
Oh, please report this to us. This is one of the problems we'll be, unfortunately, dealing with forever, I guess. This limit is just a "guess" how much qemu might take and we're setting it to make sure host is not overwhelmed in case qemu is faulty/hacked. Since this isn't ever possible to set exactly, it already happened that thanks to cgroups, qemu was killed, so we had to increase the limit. I Cc'd Michal who might be the right person to know about any further increase. However, this behavior won't change with caches. Kernel knows that those are data (s)he can discard so before killing the process, unneeded caches will get dropped and after there is nothing to drop, the procedure falls back to killing the process.
Here is an 8GB, instance. Libvirt created cgroup with 12.3GB memory limit, which we have filled to 98%
The more it's filled with caches, the better, but if none of those are caches, whoa!, the limit should be increased.
[root@dev-cmp08 ~]# cgget -r memory.limit_in_bytes -r memory.usage_in_bytes libvirt/qemu/i-000009fa libvirt/qemu/i-000009fa: memory.limit_in_bytes: 13215727616 memory.usage_in_bytes: 12998287360
You can get rid of these problems by setting your own memory limits. The defaults limit get set only if there is no <memtune> setting in the domain XML: http://libvirt.org/formatdomain.html#elementsMemoryTuning
The 4G difference is the cache. That's why I'm so interested in what is consuming the cache on a VM which should be caching in guest only.
Regards,
Brano Zarnovican
Hope this helps, Martin

On 09.08.2013 13:39, Martin Kletzander wrote:
On 08/08/2013 05:03 PM, Brano Zarnovican wrote:
On Thu, Aug 8, 2013 at 9:39 AM, Martin Kletzander <mkletzan@redhat.com> wrote:
At first let me explain that libvirt is not ignoring the cache=none. This is propagated to qemu as a parameter for it's disk. From qemu's POV (anyone feel free to correct me if I'm mistaken) this means the file is opened with O_DIRECT flag; and from the open(2) manual, the O_DIRECT means "Try to minimize cache effects of the I/O to and from this file...", that doesn't necessarily mean there is no cache at all.
Thanks for explanation.
But even if it does, this applies to files used as disks, but those disks are not the only files the process is using. You can check what othe files the process has mapped, opened etc. from the '/proc' filesystem or using the 'lsof' utility. All the other files can (and probably will) take some cache and there is nothing wrong with that.
In my case there was 4GB of caches.
Just now, I have thrashed one instance with many read/writes on various devices. In total, tens of GB of data. But the cache (on host) did not grow beyond 3MB. I'm not yet able to reproduce the problem.
Are you trying to resolve an issue or asking just out of curiosity? Because this is wanted behavior and there should be no need for anyone to minimize this.
Once or twice, one of our VMs was OOM killed because it reached 1.5 * memory limit for its cgroup.
Oh, please report this to us. This is one of the problems we'll be, unfortunately, dealing with forever, I guess. This limit is just a "guess" how much qemu might take and we're setting it to make sure host is not overwhelmed in case qemu is faulty/hacked. Since this isn't ever possible to set exactly, it already happened that thanks to cgroups, qemu was killed, so we had to increase the limit.
I Cc'd Michal who might be the right person to know about any further increase.
Sometimes I feel like I should have not added the functionality. Guessing the correct limit for a process is like solving a Halting problem. It cannot be calculated by any algorithm and the best we can do is increase the limit once somebody is already in a trouble. D'oh! Morover, if somebody comes by and tell us about it, we blindly size the limit up without knowing that the qemu is not mem-leaking for sure (in which case the limit is right and OOM killer did the right thing). The more problems are reported the more I'm closer to writing a patch which removes this heuristic. Michal

On Fri, Aug 09, 2013 at 01:54:31PM +0200, Michal Privoznik wrote:
On 09.08.2013 13:39, Martin Kletzander wrote:
On 08/08/2013 05:03 PM, Brano Zarnovican wrote:
On Thu, Aug 8, 2013 at 9:39 AM, Martin Kletzander <mkletzan@redhat.com> wrote:
At first let me explain that libvirt is not ignoring the cache=none. This is propagated to qemu as a parameter for it's disk. From qemu's POV (anyone feel free to correct me if I'm mistaken) this means the file is opened with O_DIRECT flag; and from the open(2) manual, the O_DIRECT means "Try to minimize cache effects of the I/O to and from this file...", that doesn't necessarily mean there is no cache at all.
Thanks for explanation.
But even if it does, this applies to files used as disks, but those disks are not the only files the process is using. You can check what othe files the process has mapped, opened etc. from the '/proc' filesystem or using the 'lsof' utility. All the other files can (and probably will) take some cache and there is nothing wrong with that.
In my case there was 4GB of caches.
Just now, I have thrashed one instance with many read/writes on various devices. In total, tens of GB of data. But the cache (on host) did not grow beyond 3MB. I'm not yet able to reproduce the problem.
Are you trying to resolve an issue or asking just out of curiosity? Because this is wanted behavior and there should be no need for anyone to minimize this.
Once or twice, one of our VMs was OOM killed because it reached 1.5 * memory limit for its cgroup.
Oh, please report this to us. This is one of the problems we'll be, unfortunately, dealing with forever, I guess. This limit is just a "guess" how much qemu might take and we're setting it to make sure host is not overwhelmed in case qemu is faulty/hacked. Since this isn't ever possible to set exactly, it already happened that thanks to cgroups, qemu was killed, so we had to increase the limit.
I Cc'd Michal who might be the right person to know about any further increase.
Sometimes I feel like I should have not added the functionality. Guessing the correct limit for a process is like solving a Halting problem. It cannot be calculated by any algorithm and the best we can do is increase the limit once somebody is already in a trouble. D'oh! Morover, if somebody comes by and tell us about it, we blindly size the limit up without knowing that the qemu is not mem-leaking for sure (in which case the limit is right and OOM killer did the right thing). The more problems are reported the more I'm closer to writing a patch which removes this heuristic.
Yeah, it is a real pain. Further to what you say, if we can't figure out how to get a default limit right, how on earth are people supposed to know how to set the limits manually either. I'm really not sure what we should do here. We need to be able to support memory limits to avoid DOS attack from broken / compromised QEMU, but we're clearly lacking understanding / knowledge here, or QEMU's behaviour is just not good enough to be predictable, which is also arguably something that could need addressing. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Hi, when the instance was OOM killed (output attached), we have implemented Nagios check to monitor those close to the limit. However, now we are getting false alarms, because instance(s) can get close to the cgroup limit for valid reasons.
However, this behavior won't change with caches. Kernel knows that those are data (s)he can discard so before killing the process, unneeded caches will get dropped and after there is nothing to drop, the procedure falls back to killing the process.
I guess, in the check, we will have to subtract cache size from ' memory.usage_in_bytes'. It's still puzzling me, how instances with caching disabled for all its block devices can accumulate such large caches on host. Thanks all for your time, Regards, Brano Zarnovican
participants (4)
-
Brano Zarnovican
-
Daniel P. Berrange
-
Martin Kletzander
-
Michal Privoznik