Re: [libvirt] blkio cgroup

On Fri, Feb 18, 2011 at 03:42:45PM +0100, Dominik Klein wrote:
Hi Vivek
I don't know whether you follow the libvirt list, I assume you don't. So I thought I'd forward you an E-Mail involving the blkio controller and a terrible situation arising from using it (maybe in a wrong way).
I'd truely appreciate it if you read it and commented on it. Maybe I did something wrong, but maybe also I found a bug in some way.
Hi Dominik, Thanks for forwarding me this mail. Yes, I am not on libvir-list. I have just now subscribed. Few questions inline.
-------- Original Message -------- Subject: Re: [libvirt] [PATCH 0/6 v3] Add blkio cgroup support Date: Fri, 18 Feb 2011 14:42:51 +0100 From: Dominik Klein <dk@in-telegence.net> To: libvir-list@redhat.com
Hi
back with some testing results.
how about the start Guest with option "cache=none" to bypass pagecache? This should help i think.
I will read up on where to set that and give it a try. Thanks for the hint.
So here's what I did and found out:
The host system has 2 12 core CPUs and 128 GB of Ram.
I have 8 test VMs named kernel1 to kernel8. Each VM has 4 VCPUs, 2 GB of RAm and one disk, which is an lv on the host. Cache mode is "none":
So you have only one root SATA disk and setup a linear logical volume on that? I not, can you give more info about the storage configuration? - I am assuming you are using CFQ on your underlying physical disk. - What kernel version are you testing with. - Cache=none mode is good which should make all the IO O_DIRECT on host and should show up as SYNC IO on CFQ without losing io context info. The onlly probelm is intermediate dm layer and if it is changing the io context somehow. I am not sure at this point of time. - Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening. - Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1 on your underlying physical device where CFQ is running and see if it makes any difference.
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do virsh dumpxml $vm|grep cache; done <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/>
My goal is to give more I/O time to kernel1 and kernel2 than to the rest of the VMs.
mount -t cgroup -o blkio none /mnt cd /mnt mkdir important mkdir notimportant
echo 1000 > important/blkio.weight echo 100 > notimportant/blkio.weight for vm in kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/notimportant/tasks done done
for vm in kernel1 kernel2; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done
Then I used cssh to connect to all 8 VMs and execute dd if=/dev/zero of=testfile bs=1M count=1500 in all VMs simultaneously.
Results are: kernel1: 47.5593 s, 33.1 MB/s kernel2: 60.1464 s, 26.2 MB/s kernel3: 74.204 s, 21.2 MB/s kernel4: 77.0759 s, 20.4 MB/s kernel5: 65.6309 s, 24.0 MB/s kernel6: 81.1402 s, 19.4 MB/s kernel7: 70.3881 s, 22.3 MB/s kernel8: 77.4475 s, 20.3 MB/s
Results vary a little bit from run to run, but it is nothing spectacular, as weights of 1000 vs. 100 would suggest.
So I went and tried to throttle I/O of kernel3-8 to 10MB/s instead of weighing I/O. First I rebooted everything so that no old configuration of cgroup was left in place and then setup everything except the 100 and 1000 weight configuration.
quote from blkio.txt: ------------ - blkio.throttle.write_bps_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in bytes per second. Rules are per deivce. Following is the format.
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.write_bps_device -------------
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do ls -lH /dev/vdisks/$vm; done brw-rw---- 1 root root 254, 23 Feb 18 13:45 /dev/vdisks/kernel1 brw-rw---- 1 root root 254, 24 Feb 18 13:45 /dev/vdisks/kernel2 brw-rw---- 1 root root 254, 25 Feb 18 13:45 /dev/vdisks/kernel3 brw-rw---- 1 root root 254, 26 Feb 18 13:45 /dev/vdisks/kernel4 brw-rw---- 1 root root 254, 27 Feb 18 13:45 /dev/vdisks/kernel5 brw-rw---- 1 root root 254, 28 Feb 18 13:45 /dev/vdisks/kernel6 brw-rw---- 1 root root 254, 29 Feb 18 13:45 /dev/vdisks/kernel7 brw-rw---- 1 root root 254, 30 Feb 18 13:45 /dev/vdisks/kernel8
/bin/echo 254:25 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:26 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:27 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:28 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:29 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device
Then I ran the previous test again. This resulted in an ever increasing load (last I checked was ~ 300) on the host system. (This is perfectly reproducible).
uptime Fri Feb 18 14:42:17 2011 14:42:17 up 12 min, 9 users, load average: 286.51, 142.22, 56.71
Have you run top or something to figure out why load average is shooting up. I suspect that because of throttling limit, IO threads have been blocked and qemu is forking more IO threads. Can you just run top/ps and figure out what's happening. Again, is it some kind of linear volume group from which you have carved out logical volumes for each virtual machine? For throttling to begin with, can we do a simple test first. That is run a single virtual machine, put some throttling limit on logical volume and try to do READs. Once READs work, lets test WRITES and check why does system load go up. Thanks Vivek

On Fri, Feb 18, 2011 at 11:31:37AM -0500, Vivek Goyal wrote: [..]
So I went and tried to throttle I/O of kernel3-8 to 10MB/s instead of weighing I/O. First I rebooted everything so that no old configuration of cgroup was left in place and then setup everything except the 100 and 1000 weight configuration.
quote from blkio.txt: ------------ - blkio.throttle.write_bps_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in bytes per second. Rules are per deivce. Following is the format.
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.write_bps_device -------------
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do ls -lH /dev/vdisks/$vm; done brw-rw---- 1 root root 254, 23 Feb 18 13:45 /dev/vdisks/kernel1 brw-rw---- 1 root root 254, 24 Feb 18 13:45 /dev/vdisks/kernel2 brw-rw---- 1 root root 254, 25 Feb 18 13:45 /dev/vdisks/kernel3 brw-rw---- 1 root root 254, 26 Feb 18 13:45 /dev/vdisks/kernel4 brw-rw---- 1 root root 254, 27 Feb 18 13:45 /dev/vdisks/kernel5 brw-rw---- 1 root root 254, 28 Feb 18 13:45 /dev/vdisks/kernel6 brw-rw---- 1 root root 254, 29 Feb 18 13:45 /dev/vdisks/kernel7 brw-rw---- 1 root root 254, 30 Feb 18 13:45 /dev/vdisks/kernel8
/bin/echo 254:25 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:26 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:27 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:28 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:29 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device
Then I ran the previous test again. This resulted in an ever increasing load (last I checked was ~ 300) on the host system. (This is perfectly reproducible).
uptime Fri Feb 18 14:42:17 2011 14:42:17 up 12 min, 9 users, load average: 286.51, 142.22, 56.71
Have you run top or something to figure out why load average is shooting up. I suspect that because of throttling limit, IO threads have been blocked and qemu is forking more IO threads. Can you just run top/ps and figure out what's happening.
Again, is it some kind of linear volume group from which you have carved out logical volumes for each virtual machine?
For throttling to begin with, can we do a simple test first. That is run a single virtual machine, put some throttling limit on logical volume and try to do READs. Once READs work, lets test WRITES and check why does system load go up.
Ok, I was playing a bit with blkio throttling policy and it is working for me. Following is my setup. - Kernel 2.6.38-rc5 - I have a some Luns exported from an storage array. Created a multipath device on top of it and then partitioned it. Exported one of the partitions to virtual machine as virtio disk. So in the guest I see virtual disk /dev/vdb and corresponding device on host is /dev/dm-2. I am using cache=none option. - I create a test cgroup /cgroup/blkio/test1/ and move virtual machine there. - Run time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 with and without any throttling limits. Without throttling limits. # time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 1500+0 records in 1500+0 records out 1572864000 bytes (1.6 GB) copied, 6.37301 s, 247 MB/s real 0m6.446s user 0m0.003s sys 0m4.074s With throttling limit (50MB/s write limit) ----------------------------------------- # time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 1500+0 records in 1500+0 records out 1572864000 bytes (1.6 GB) copied, 30.1553 s, 52.2 MB/s real 0m30.422s user 0m0.004s sys 0m2.102s So throttling seems to be working for me as expected. Without throttling I got 247MB/s bandwidth and when I put a 50MB/s limit on virtual mahcine I get 52.2MB/s. I had used following command to limit virtual machine. echo "252:2 50000000" > /cgroup/blkio/test1/blkio.throttle.write_bps_device I was also running "vmstat 2" when throttled machine did IO and noticed that number of blocked processes went up, around 25-35. I am assuming these all are qemu IO threads blocked waiting for throttled IO to finish. I am not sure if blocked processes also contribute towards load average. While googling a bit, I found one wiki page which says following. "Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system." If this is true, that explains high system load in your testing. Throttling is working and we have around 30-35 IO threads/processes per qemu instance. You have 8 qemu instance running and roughly 240-280 processes blocked waiting for IO to finish and that will explain high load. But that is expected given the fact we are throttling IO? I also tried direct IO in virtual machine and that seems to be forking only 1 IO thread. # time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 oflag=direct 1500+0 records in 1500+0 records out 1572864000 bytes (1.6 GB) copied, 31.4301 s, 50.0 MB/s real 0m31.664s user 0m0.003s sys 0m0.819s While running this I noticed number of processes blocked was 1 all the time and hence low load average. Try oflag=direct option in your tests. Thanks Vivek

<snip>
I was also running "vmstat 2" when throttled machine did IO and noticed that number of blocked processes went up, around 25-35. I am assuming these all are qemu IO threads blocked waiting for throttled IO to finish. I am not sure if blocked processes also contribute towards load average.
While googling a bit, I found one wiki page which says following.
"Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system."
If this is true, that explains high system load in your testing. Throttling is working and we have around 30-35 IO threads/processes per qemu instance. You have 8 qemu instance running and roughly 240-280 processes blocked waiting for IO to finish and that will explain high load. But that is expected given the fact we are throttling IO?
I also tried direct IO in virtual machine and that seems to be forking only 1 IO thread.
# time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 oflag=direct 1500+0 records in 1500+0 records out 1572864000 bytes (1.6 GB) copied, 31.4301 s, 50.0 MB/s
real 0m31.664s user 0m0.003s sys 0m0.819s
While running this I noticed number of processes blocked was 1 all the time and hence low load average. Try oflag=direct option in your tests.
I can second that. oflag=direct in the vm shows 1 blocked process most of the time on the host. No oflag shows 1-16. Regards Dominik

On Fri, Feb 18, 2011 at 11:31:37AM -0500, Vivek Goyal wrote:
On Fri, Feb 18, 2011 at 03:42:45PM +0100, Dominik Klein wrote:
Hi Vivek
I don't know whether you follow the libvirt list, I assume you don't. So I thought I'd forward you an E-Mail involving the blkio controller and a terrible situation arising from using it (maybe in a wrong way).
I'd truely appreciate it if you read it and commented on it. Maybe I did something wrong, but maybe also I found a bug in some way.
Hi Dominik,
Thanks for forwarding me this mail. Yes, I am not on libvir-list. I have just now subscribed.
Few questions inline.
-------- Original Message -------- Subject: Re: [libvirt] [PATCH 0/6 v3] Add blkio cgroup support Date: Fri, 18 Feb 2011 14:42:51 +0100 From: Dominik Klein <dk@in-telegence.net> To: libvir-list@redhat.com
Hi
back with some testing results.
how about the start Guest with option "cache=none" to bypass pagecache? This should help i think.
I will read up on where to set that and give it a try. Thanks for the hint.
So here's what I did and found out:
The host system has 2 12 core CPUs and 128 GB of Ram.
I have 8 test VMs named kernel1 to kernel8. Each VM has 4 VCPUs, 2 GB of RAm and one disk, which is an lv on the host. Cache mode is "none":
So you have only one root SATA disk and setup a linear logical volume on that? I not, can you give more info about the storage configuration?
- I am assuming you are using CFQ on your underlying physical disk.
- What kernel version are you testing with.
- Cache=none mode is good which should make all the IO O_DIRECT on host and should show up as SYNC IO on CFQ without losing io context info. The onlly probelm is intermediate dm layer and if it is changing the io context somehow. I am not sure at this point of time.
- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
- Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1 on your underlying physical device where CFQ is running and see if it makes any difference.
Dominik, Apart from setting group_isolation=1, I would also recommend to do some tests on READS also. Service differentiation is much more visible there. Why? Because In case of writes I am seeing that there are extended periods where ther is no IO on underlying device from higher weight virtual machine. I am not sure what that virtual machine is doing for that duration but that's what blktrace shows. First I ran READS. Two partitions exported to two virtual machines. I started, time dd if=/mnt/vdb/testfile of=/dev/zero and as soon as it finished in first virtual machine, I stopped second virtual machine job also (manually, there could be better test script or use of fio tool which allows to run timed tests). [vm1 ~]# time dd if=/mnt/vdb/testfile of=/dev/zero 3072000+0 records in 3072000+0 records out 1572864000 bytes (1.6 GB) copied, 12.35 s, 127 MB/s real 0m12.503s user 0m0.527s sys 0m2.318s [vm2 ~]# time dd if=/mnt/vdb/testfile of=/dev/zero 420853+0 records in 420852+0 records out 215476224 bytes (215 MB) copied, 12.331 s, 17.5 MB/s real 0m12.342s user 0m0.082s sys 0m0.307s Here in the duration of 12 seconds, first VM did 1.6GB of READS (weight 1000) and second VM did 215MB of READS (weight 100). Then, I did some tests on WRITES and after setting group isolation with two virtual machines following are the results. [vm1 ~]# time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 1500+0 records in 1500+0 records out 1572864000 bytes (1.6 GB) copied, 6.47411 s, 243 MB/s real 0m6.711s user 0m0.002s sys 0m2.233s [vm2 ~]# time dd if=/dev/zero of=/mnt/vdb/testfile bs=1M count=1500 388+0 records in 388+0 records out 406847488 bytes (407 MB) copied, 6.68171 s, 60.9 MB/s real 0m6.739s user 0m0.002s sys 0m0.697s First machine wrote 1.6 GB while second machine wrote 400MB. And some of it could be lying in second virtual machine's cache and never made it do disk. So this is significant service differentiation I would say. Thanks Vivek

Dominik, Would you try "oflag=direct" when you do tests in Guests. And make sure /sys/block/xxx/queue/iosched/group_isolation is set to 1. I guess with such setting, your tests should goes well. Thanks, Gui Vivek Goyal wrote:
On Fri, Feb 18, 2011 at 03:42:45PM +0100, Dominik Klein wrote:
Hi Vivek
I don't know whether you follow the libvirt list, I assume you don't. So I thought I'd forward you an E-Mail involving the blkio controller and a terrible situation arising from using it (maybe in a wrong way).
I'd truely appreciate it if you read it and commented on it. Maybe I did something wrong, but maybe also I found a bug in some way.
Hi Dominik,
Thanks for forwarding me this mail. Yes, I am not on libvir-list. I have just now subscribed.
Few questions inline.
-------- Original Message -------- Subject: Re: [libvirt] [PATCH 0/6 v3] Add blkio cgroup support Date: Fri, 18 Feb 2011 14:42:51 +0100 From: Dominik Klein <dk@in-telegence.net> To: libvir-list@redhat.com
Hi
back with some testing results.
how about the start Guest with option "cache=none" to bypass pagecache? This should help i think. I will read up on where to set that and give it a try. Thanks for the hint. So here's what I did and found out:
The host system has 2 12 core CPUs and 128 GB of Ram.
I have 8 test VMs named kernel1 to kernel8. Each VM has 4 VCPUs, 2 GB of RAm and one disk, which is an lv on the host. Cache mode is "none":
So you have only one root SATA disk and setup a linear logical volume on that? I not, can you give more info about the storage configuration?
- I am assuming you are using CFQ on your underlying physical disk.
- What kernel version are you testing with.
- Cache=none mode is good which should make all the IO O_DIRECT on host and should show up as SYNC IO on CFQ without losing io context info. The onlly probelm is intermediate dm layer and if it is changing the io context somehow. I am not sure at this point of time.
- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
- Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1 on your underlying physical device where CFQ is running and see if it makes any difference.
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do virsh dumpxml $vm|grep cache; done <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/>
My goal is to give more I/O time to kernel1 and kernel2 than to the rest of the VMs.
mount -t cgroup -o blkio none /mnt cd /mnt mkdir important mkdir notimportant
echo 1000 > important/blkio.weight echo 100 > notimportant/blkio.weight for vm in kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/notimportant/tasks done done
for vm in kernel1 kernel2; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done
Then I used cssh to connect to all 8 VMs and execute dd if=/dev/zero of=testfile bs=1M count=1500 in all VMs simultaneously.
Results are: kernel1: 47.5593 s, 33.1 MB/s kernel2: 60.1464 s, 26.2 MB/s kernel3: 74.204 s, 21.2 MB/s kernel4: 77.0759 s, 20.4 MB/s kernel5: 65.6309 s, 24.0 MB/s kernel6: 81.1402 s, 19.4 MB/s kernel7: 70.3881 s, 22.3 MB/s kernel8: 77.4475 s, 20.3 MB/s
Results vary a little bit from run to run, but it is nothing spectacular, as weights of 1000 vs. 100 would suggest.
So I went and tried to throttle I/O of kernel3-8 to 10MB/s instead of weighing I/O. First I rebooted everything so that no old configuration of cgroup was left in place and then setup everything except the 100 and 1000 weight configuration.
quote from blkio.txt: ------------ - blkio.throttle.write_bps_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in bytes per second. Rules are per deivce. Following is the format.
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.write_bps_device -------------
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do ls -lH /dev/vdisks/$vm; done brw-rw---- 1 root root 254, 23 Feb 18 13:45 /dev/vdisks/kernel1 brw-rw---- 1 root root 254, 24 Feb 18 13:45 /dev/vdisks/kernel2 brw-rw---- 1 root root 254, 25 Feb 18 13:45 /dev/vdisks/kernel3 brw-rw---- 1 root root 254, 26 Feb 18 13:45 /dev/vdisks/kernel4 brw-rw---- 1 root root 254, 27 Feb 18 13:45 /dev/vdisks/kernel5 brw-rw---- 1 root root 254, 28 Feb 18 13:45 /dev/vdisks/kernel6 brw-rw---- 1 root root 254, 29 Feb 18 13:45 /dev/vdisks/kernel7 brw-rw---- 1 root root 254, 30 Feb 18 13:45 /dev/vdisks/kernel8
/bin/echo 254:25 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:26 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:27 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:28 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:29 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device
Then I ran the previous test again. This resulted in an ever increasing load (last I checked was ~ 300) on the host system. (This is perfectly reproducible).
uptime Fri Feb 18 14:42:17 2011 14:42:17 up 12 min, 9 users, load average: 286.51, 142.22, 56.71
Have you run top or something to figure out why load average is shooting up. I suspect that because of throttling limit, IO threads have been blocked and qemu is forking more IO threads. Can you just run top/ps and figure out what's happening.
Again, is it some kind of linear volume group from which you have carved out logical volumes for each virtual machine?
For throttling to begin with, can we do a simple test first. That is run a single virtual machine, put some throttling limit on logical volume and try to do READs. Once READs work, lets test WRITES and check why does system load go up.
Thanks Vivek
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Regards Gui Jianfeng

On Mon, Feb 21, 2011 at 03:36:14PM +0800, Gui Jianfeng wrote:
Dominik,
Would you try "oflag=direct" when you do tests in Guests. And make sure /sys/block/xxx/queue/iosched/group_isolation is set to 1.
oflag=direct in guest might be good for testing and understanding the problem, but in practice we will not have a control over what a user is running inside guest. The only control we will have is to use cache=none for guest and then control any traffic coming out of guest. Thanks Vivek
I guess with such setting, your tests should goes well.
Thanks, Gui
Vivek Goyal wrote:
On Fri, Feb 18, 2011 at 03:42:45PM +0100, Dominik Klein wrote:
Hi Vivek
I don't know whether you follow the libvirt list, I assume you don't. So I thought I'd forward you an E-Mail involving the blkio controller and a terrible situation arising from using it (maybe in a wrong way).
I'd truely appreciate it if you read it and commented on it. Maybe I did something wrong, but maybe also I found a bug in some way.
Hi Dominik,
Thanks for forwarding me this mail. Yes, I am not on libvir-list. I have just now subscribed.
Few questions inline.
-------- Original Message -------- Subject: Re: [libvirt] [PATCH 0/6 v3] Add blkio cgroup support Date: Fri, 18 Feb 2011 14:42:51 +0100 From: Dominik Klein <dk@in-telegence.net> To: libvir-list@redhat.com
Hi
back with some testing results.
how about the start Guest with option "cache=none" to bypass pagecache? This should help i think. I will read up on where to set that and give it a try. Thanks for the hint. So here's what I did and found out:
The host system has 2 12 core CPUs and 128 GB of Ram.
I have 8 test VMs named kernel1 to kernel8. Each VM has 4 VCPUs, 2 GB of RAm and one disk, which is an lv on the host. Cache mode is "none":
So you have only one root SATA disk and setup a linear logical volume on that? I not, can you give more info about the storage configuration?
- I am assuming you are using CFQ on your underlying physical disk.
- What kernel version are you testing with.
- Cache=none mode is good which should make all the IO O_DIRECT on host and should show up as SYNC IO on CFQ without losing io context info. The onlly probelm is intermediate dm layer and if it is changing the io context somehow. I am not sure at this point of time.
- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
- Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1 on your underlying physical device where CFQ is running and see if it makes any difference.
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do virsh dumpxml $vm|grep cache; done <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/> <driver name='qemu' type='raw' cache='none'/>
My goal is to give more I/O time to kernel1 and kernel2 than to the rest of the VMs.
mount -t cgroup -o blkio none /mnt cd /mnt mkdir important mkdir notimportant
echo 1000 > important/blkio.weight echo 100 > notimportant/blkio.weight for vm in kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/notimportant/tasks done done
for vm in kernel1 kernel2; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done
Then I used cssh to connect to all 8 VMs and execute dd if=/dev/zero of=testfile bs=1M count=1500 in all VMs simultaneously.
Results are: kernel1: 47.5593 s, 33.1 MB/s kernel2: 60.1464 s, 26.2 MB/s kernel3: 74.204 s, 21.2 MB/s kernel4: 77.0759 s, 20.4 MB/s kernel5: 65.6309 s, 24.0 MB/s kernel6: 81.1402 s, 19.4 MB/s kernel7: 70.3881 s, 22.3 MB/s kernel8: 77.4475 s, 20.3 MB/s
Results vary a little bit from run to run, but it is nothing spectacular, as weights of 1000 vs. 100 would suggest.
So I went and tried to throttle I/O of kernel3-8 to 10MB/s instead of weighing I/O. First I rebooted everything so that no old configuration of cgroup was left in place and then setup everything except the 100 and 1000 weight configuration.
quote from blkio.txt: ------------ - blkio.throttle.write_bps_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in bytes per second. Rules are per deivce. Following is the format.
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.write_bps_device -------------
for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do ls -lH /dev/vdisks/$vm; done brw-rw---- 1 root root 254, 23 Feb 18 13:45 /dev/vdisks/kernel1 brw-rw---- 1 root root 254, 24 Feb 18 13:45 /dev/vdisks/kernel2 brw-rw---- 1 root root 254, 25 Feb 18 13:45 /dev/vdisks/kernel3 brw-rw---- 1 root root 254, 26 Feb 18 13:45 /dev/vdisks/kernel4 brw-rw---- 1 root root 254, 27 Feb 18 13:45 /dev/vdisks/kernel5 brw-rw---- 1 root root 254, 28 Feb 18 13:45 /dev/vdisks/kernel6 brw-rw---- 1 root root 254, 29 Feb 18 13:45 /dev/vdisks/kernel7 brw-rw---- 1 root root 254, 30 Feb 18 13:45 /dev/vdisks/kernel8
/bin/echo 254:25 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:26 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:27 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:28 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:29 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device /bin/echo 254:30 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device
Then I ran the previous test again. This resulted in an ever increasing load (last I checked was ~ 300) on the host system. (This is perfectly reproducible).
uptime Fri Feb 18 14:42:17 2011 14:42:17 up 12 min, 9 users, load average: 286.51, 142.22, 56.71
Have you run top or something to figure out why load average is shooting up. I suspect that because of throttling limit, IO threads have been blocked and qemu is forking more IO threads. Can you just run top/ps and figure out what's happening.
Again, is it some kind of linear volume group from which you have carved out logical volumes for each virtual machine?
For throttling to begin with, can we do a simple test first. That is run a single virtual machine, put some throttling limit on logical volume and try to do READs. Once READs work, lets test WRITES and check why does system load go up.
Thanks Vivek
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Regards Gui Jianfeng

Hi first of all: Thanks for taking the time. On 02/18/2011 05:31 PM, Vivek Goyal wrote:
So you have only one root SATA disk and setup a linear logical volume on that? I not, can you give more info about the storage configuration?
The logical volumes reside on a JBOD-device [1] connected with a Dell PERC H800 adapter. I could also test this on the internal SAS disks, which are connected with a Dell PERC H700 adapter.
- I am assuming you are using CFQ on your underlying physical disk.
Yes. cat /sys/block/sdb/queue/scheduler noop deadline [cfq]
- What kernel version are you testing with.
2.6.37
- Cache=none mode is good which should make all the IO O_DIRECT on host and should show up as SYNC IO on CFQ without losing io context info. The onlly probelm is intermediate dm layer and if it is changing the io context somehow. I am not sure at this point of time.
- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
Will do, read on.
- Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1 on your underlying physical device where CFQ is running and see if it makes any difference.
echo 1 > /sys/block/sdb/queue/iosched/group_isolation cat /sys/block/sdb/queue/iosched/group_isolation 1 <snip>
Then I ran the previous test again. This resulted in an ever increasing load (last I checked was ~ 300) on the host system. (This is perfectly reproducible).
uptime Fri Feb 18 14:42:17 2011 14:42:17 up 12 min, 9 users, load average: 286.51, 142.22, 56.71
Have you run top or something to figure out why load average is shooting up.
During the tests described in the first email, it was impossible to run anything on that machine. You couldn't even login to the console, the machine had to be disconnected from power to reboot it. I now just started one machine (kernel3) and tried to throttle it as described earlier.
I suspect that because of throttling limit, IO threads have been blocked and qemu is forking more IO threads. Can you just run top/ps and figure out what's happening.
Kind of looks like that. If I run ps -eL -o pid,lwp,%cpu,blocked,time,args|grep kernel3 I see 48 qemu-kvm Threads while that machine is dd'ing zeroes. On the other hand: Without throttling, I see 70. So whether 48 is much here ...
Again, is it some kind of linear volume group from which you have carved out logical volumes for each virtual machine?
Setup is: pv = /dev/sdb vg = /dev/vdisks lv = /dev/vdisks/kernel1...kernel8 The vm is then given /dev/vdisks/kernelX as /dev/vda virtio device.
For throttling to begin with, can we do a simple test first. That is run a single virtual machine, put some throttling limit on logical volume and try to do READs. Once READs work, lets test WRITES and check why does system load go up.
Both reads and write work. Without having to give [io]flag=direct to the dd command by the wy (so cache=none seems worky). Regards Dominik ps. I will now answer your two other emails. pps. We could also talk on IRC. I'm usually in the #virt channel on OFCT. kleind [1] http://www.thomas-krenn.com/en/storage-solutions/storage-systems/jbod-system...

- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
Will do, read on.
Just realized I missed this one ... Had better done it right away. So here goes. Setup as in first email. 8 Machines, 2 important, 6 not important ones with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes. blktrace -d /dev/sdb -w 30 === sdb === CPU 0: 4769 events, 224 KiB data CPU 1: 28079 events, 1317 KiB data CPU 2: 1179 events, 56 KiB data CPU 3: 5529 events, 260 KiB data CPU 4: 295 events, 14 KiB data CPU 5: 649 events, 31 KiB data CPU 6: 185 events, 9 KiB data CPU 7: 180 events, 9 KiB data CPU 8: 17 events, 1 KiB data CPU 9: 12 events, 1 KiB data CPU 10: 6 events, 1 KiB data CPU 11: 55 events, 3 KiB data CPU 12: 28005 events, 1313 KiB data CPU 13: 1542 events, 73 KiB data CPU 14: 4814 events, 226 KiB data CPU 15: 389 events, 19 KiB data CPU 16: 1545 events, 73 KiB data CPU 17: 119 events, 6 KiB data CPU 18: 3019 events, 142 KiB data CPU 19: 62 events, 3 KiB data CPU 20: 800 events, 38 KiB data CPU 21: 17 events, 1 KiB data CPU 22: 243 events, 12 KiB data CPU 23: 1 events, 1 KiB data Total: 81511 events (dropped 0), 3822 KiB data Very constant 296 blocked processes in vmstat during this run. But... apparently no data is written at all (see "bo" column). vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 296 0 125254224 21432 142016 0 0 16 633 181 331 0 0 93 7 0 296 0 125253728 21432 142016 0 0 0 0 17115 33794 0 0 25 75 0 296 0 125254112 21432 142016 0 0 0 0 17084 33721 0 0 25 74 1 296 0 125254352 21440 142012 0 0 0 18 17047 33736 0 0 25 75 0 296 0 125304224 21440 131060 0 0 0 0 17630 33989 0 1 23 76 1 296 0 125306496 21440 130260 0 0 0 0 16810 33401 0 0 20 80 4 296 0 125307208 21440 129856 0 0 0 0 17169 33744 0 0 26 74 0 296 0 125307496 21448 129508 0 0 0 14 17105 33650 0 0 36 64 0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674 0 0 22 78 1 296 0 125307752 21452 129520 0 0 0 0 16875 33438 0 0 29 70 1 296 0 125307776 21452 129520 0 0 0 0 16959 33560 0 0 21 79 1 296 0 125307792 21460 129520 0 0 0 12 16700 33239 0 0 15 85 1 296 0 125307808 21460 129520 0 0 0 0 16750 33274 0 0 25 74 1 296 0 125307808 21460 129520 0 0 0 0 17020 33601 0 0 26 74 1 296 0 125308272 21460 129520 0 0 0 0 17080 33616 0 0 20 80 1 296 0 125308408 21460 129520 0 0 0 0 16428 32972 0 0 42 58 1 296 0 125308016 21460 129524 0 0 0 0 17021 33624 0 0 22 77 Regards Dominik

On 02/21/2011 09:19 AM, Dominik Klein wrote:
- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
Will do, read on.
Just realized I missed this one ... Had better done it right away.
So here goes.
Setup as in first email. 8 Machines, 2 important, 6 not important ones with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes.
blktrace -d /dev/sdb -w 30 === sdb === CPU 0: 4769 events, 224 KiB data CPU 1: 28079 events, 1317 KiB data CPU 2: 1179 events, 56 KiB data CPU 3: 5529 events, 260 KiB data CPU 4: 295 events, 14 KiB data CPU 5: 649 events, 31 KiB data CPU 6: 185 events, 9 KiB data CPU 7: 180 events, 9 KiB data CPU 8: 17 events, 1 KiB data CPU 9: 12 events, 1 KiB data CPU 10: 6 events, 1 KiB data CPU 11: 55 events, 3 KiB data CPU 12: 28005 events, 1313 KiB data CPU 13: 1542 events, 73 KiB data CPU 14: 4814 events, 226 KiB data CPU 15: 389 events, 19 KiB data CPU 16: 1545 events, 73 KiB data CPU 17: 119 events, 6 KiB data CPU 18: 3019 events, 142 KiB data CPU 19: 62 events, 3 KiB data CPU 20: 800 events, 38 KiB data CPU 21: 17 events, 1 KiB data CPU 22: 243 events, 12 KiB data CPU 23: 1 events, 1 KiB data Total: 81511 events (dropped 0), 3822 KiB data
Very constant 296 blocked processes in vmstat during this run. But... apparently no data is written at all (see "bo" column).
vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 296 0 125254224 21432 142016 0 0 16 633 181 331 0 0 93 7 0 296 0 125253728 21432 142016 0 0 0 0 17115 33794 0 0 25 75 0 296 0 125254112 21432 142016 0 0 0 0 17084 33721 0 0 25 74 1 296 0 125254352 21440 142012 0 0 0 18 17047 33736 0 0 25 75 0 296 0 125304224 21440 131060 0 0 0 0 17630 33989 0 1 23 76 1 296 0 125306496 21440 130260 0 0 0 0 16810 33401 0 0 20 80 4 296 0 125307208 21440 129856 0 0 0 0 17169 33744 0 0 26 74 0 296 0 125307496 21448 129508 0 0 0 14 17105 33650 0 0 36 64 0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674 0 0 22 78 1 296 0 125307752 21452 129520 0 0 0 0 16875 33438 0 0 29 70 1 296 0 125307776 21452 129520 0 0 0 0 16959 33560 0 0 21 79 1 296 0 125307792 21460 129520 0 0 0 12 16700 33239 0 0 15 85 1 296 0 125307808 21460 129520 0 0 0 0 16750 33274 0 0 25 74 1 296 0 125307808 21460 129520 0 0 0 0 17020 33601 0 0 26 74 1 296 0 125308272 21460 129520 0 0 0 0 17080 33616 0 0 20 80 1 296 0 125308408 21460 129520 0 0 0 0 16428 32972 0 0 42 58 1 296 0 125308016 21460 129524 0 0 0 0 17021 33624 0 0 22 77
While we're on that ... It is impossible for me now to recover from this state without pulling the power plug. On the VMs console I see messages like INFO: task (kjournald|flush-254|dd|rs:main|...) blocked for more than 120 seconds. While the ssh sessions through which the dd was started seem intact (pressing enter gives a new line), it is impossible to cancel the dd command. Logging in on the VMs console also is impossible. Opening a new ssh session to the host also does not work. Killing the qemu-kvm processes from a session opened earlier leaves zomby processes. Moving the VMs back to the root cgroup makes no difference either. Regards Dominik

On Mon, Feb 21, 2011 at 09:30:32AM +0100, Dominik Klein wrote:
On 02/21/2011 09:19 AM, Dominik Klein wrote:
- Is it possible to capture 10-15 second blktrace on your underlying physical device. That should give me some idea what's happening.
Will do, read on.
Just realized I missed this one ... Had better done it right away.
So here goes.
Setup as in first email. 8 Machines, 2 important, 6 not important ones with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes.
blktrace -d /dev/sdb -w 30 === sdb === CPU 0: 4769 events, 224 KiB data CPU 1: 28079 events, 1317 KiB data CPU 2: 1179 events, 56 KiB data CPU 3: 5529 events, 260 KiB data CPU 4: 295 events, 14 KiB data CPU 5: 649 events, 31 KiB data CPU 6: 185 events, 9 KiB data CPU 7: 180 events, 9 KiB data CPU 8: 17 events, 1 KiB data CPU 9: 12 events, 1 KiB data CPU 10: 6 events, 1 KiB data CPU 11: 55 events, 3 KiB data CPU 12: 28005 events, 1313 KiB data CPU 13: 1542 events, 73 KiB data CPU 14: 4814 events, 226 KiB data CPU 15: 389 events, 19 KiB data CPU 16: 1545 events, 73 KiB data CPU 17: 119 events, 6 KiB data CPU 18: 3019 events, 142 KiB data CPU 19: 62 events, 3 KiB data CPU 20: 800 events, 38 KiB data CPU 21: 17 events, 1 KiB data CPU 22: 243 events, 12 KiB data CPU 23: 1 events, 1 KiB data Total: 81511 events (dropped 0), 3822 KiB data
Very constant 296 blocked processes in vmstat during this run. But... apparently no data is written at all (see "bo" column).
Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no "bo" is bad. That would explain that why your box is not responding and you need to do power reset. - I am assuming that you have not put any throttling limits on root group. Is your system root also on /dev/sdb or on a separate disk altogether. - This sounds like a bug in throttling logic. To narrow it down can you start running "deadline" on end device. If it still happens, it is more or less in throttling layer. - We can also try to remove dm layers and just create partitions on /dev/sdb and export as virtio disks to virtual machines and take dm layer out of picture and see if it still happens. - In one of the mails you mentioned that with 1 virutal machine throttling READs and WRITEs is working for you. So it looks like 1 virtual machine does not hang but once you launch 8 virtual machines it hangs. Can we try increasing the number of vitual machines gragually and confirm that it happens only if some certain number of virtual machines are launched. - Can you also paste me the rules you have put on important and non-important groups. Somehow I suspect that some of the rule has gone horribly bad in the sense that it is very low and effectively no virtual machine is making any progress. - How long does it take to reach in this locked state where bo=0. - you can also try to redirect blktrace output to blkparse and redirect it to standard output and see capture some output by copying pasting last messages. In the mean time, I will try to launch more machines and see if I can reproduce the issue. Thanks Vivek
vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 296 0 125254224 21432 142016 0 0 16 633 181 331 0 0 93 7 0 296 0 125253728 21432 142016 0 0 0 0 17115 33794 0 0 25 75 0 296 0 125254112 21432 142016 0 0 0 0 17084 33721 0 0 25 74 1 296 0 125254352 21440 142012 0 0 0 18 17047 33736 0 0 25 75 0 296 0 125304224 21440 131060 0 0 0 0 17630 33989 0 1 23 76 1 296 0 125306496 21440 130260 0 0 0 0 16810 33401 0 0 20 80 4 296 0 125307208 21440 129856 0 0 0 0 17169 33744 0 0 26 74 0 296 0 125307496 21448 129508 0 0 0 14 17105 33650 0 0 36 64 0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674 0 0 22 78 1 296 0 125307752 21452 129520 0 0 0 0 16875 33438 0 0 29 70 1 296 0 125307776 21452 129520 0 0 0 0 16959 33560 0 0 21 79 1 296 0 125307792 21460 129520 0 0 0 12 16700 33239 0 0 15 85 1 296 0 125307808 21460 129520 0 0 0 0 16750 33274 0 0 25 74 1 296 0 125307808 21460 129520 0 0 0 0 17020 33601 0 0 26 74 1 296 0 125308272 21460 129520 0 0 0 0 17080 33616 0 0 20 80 1 296 0 125308408 21460 129520 0 0 0 0 16428 32972 0 0 42 58 1 296 0 125308016 21460 129524 0 0 0 0 17021 33624 0 0 22 77
While we're on that ... It is impossible for me now to recover from this state without pulling the power plug.
On the VMs console I see messages like INFO: task (kjournald|flush-254|dd|rs:main|...) blocked for more than 120 seconds.
If VMs are completely blocked and not making any progress, it is expected.
While the ssh sessions through which the dd was started seem intact (pressing enter gives a new line), it is impossible to cancel the dd command. Logging in on the VMs console also is impossible.
Opening a new ssh session to the host also does not work. Killing the qemu-kvm processes from a session opened earlier leaves zomby processes. Moving the VMs back to the root cgroup makes no difference either.
Regards Dominik
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Hi to be as verbose and clear as possible, I will write every command I use. This sets everything up for the test: # setup start importantvms="kernel1 kernel2" notimportantvms="kernel3 kernel4 kernel5 kernel6 kernel7 kernel8" for vm in $importantvms $notimportantvms; do virsh domstate $vm|grep -q running || virsh start $vm; done echo 1 > /sys/block/sdb/queue/iosched/group_isolation mount -t cgroup -o blkio none /mnt cd /mnt mkdir important mkdir notimportant for vm in $notimportantvms; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/notimportant/tasks done done for vm in $importantvms; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done #ls -lH /dev/vdisks/kernel[3-8] #brw-rw---- 1 root root 254, 25 Feb 22 13:42 kernel3 #brw-rw---- 1 root root 254, 26 Feb 22 13:42 kernel4 #brw-rw---- 1 root root 254, 27 Feb 22 13:42 kernel5 #brw-rw---- 1 root root 254, 28 Feb 22 13:42 kernel6 #brw-rw---- 1 root root 254, 29 Feb 22 13:43 kernel7 #brw-rw---- 1 root root 254, 30 Feb 22 13:43 kernel8 for i in $(seq 25 30); do /bin/echo 254:$i 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device done # setup complete
Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no "bo" is bad. That would explain that why your box is not responding and you need to do power reset.
- I am assuming that you have not put any throttling limits on root group. Is your system root also on /dev/sdb or on a separate disk altogether.
No throttling on root. Correct. system root is on sda vms are on sdb
- This sounds like a bug in throttling logic. To narrow it down can you start running "deadline" on end device. If it still happens, it is more or less in throttling layer.
cat /sys/block/sdb/queue/scheduler noop deadline [cfq] echo deadline > /sys/block/sdb/queue/scheduler cat /sys/block/sdb/queue/scheduler noop [deadline] cfq This changes things: vmstat before test (nothing going on): r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 130231696 17968 69424 0 0 0 0 16573 32834 0 0 100 0 vmstat during test while all 8 vms are writing (2 unthrottled, 6 at 10M): r b swpd free buff cache si so bi bo in cs us sy id wa 2 9 0 126257984 17968 69424 0 0 8 164462 20114 36751 5 4 52 39 vmstat during test when only throttled vms are still dd'ing 2 21 0 124518928 17976 69424 0 0 0 63876 17410 33670 1 1 59 39 Load is 12-ish during the test and the results are as expected: Throttled VMs write with approx 10M, unthrottled get like 80 each, which sums up to about the 200M max capacity of the device for direct io.
- We can also try to remove dm layers and just create partitions on /dev/sdb and export as virtio disks to virtual machines and take dm layer out of picture and see if it still happens.
Did that. Also happens when scheduler is cfq. Also does not happen when scheduler is deadline. So it does not seem to be a dm issue.
- In one of the mails you mentioned that with 1 virutal machine throttling READs and WRITEs is working for you. So it looks like 1 virtual machine does not hang but once you launch 8 virtual machines it hangs. Can we try increasing the number of vitual machines gragually and confirm that it happens only if some certain number of virtual machines are launched.
Back on cfq here. 1 throttled vm: works, load ~4 2 throttled vms: works, load ~6 3 throttled vms: works, load ~9 4 throttled vms: works, load ~12 6 throttled vms: works, load ~20 The number of blocked threads increases with the number of vms dd'ing. At the beginning of each test, the blocked threads number goes really high (4 vms 160, 6 vms 220), but then drops significantly and stays low. So it seems that when only throttled vms are running, the problem does not occur. 1 throttled + 1 unthrottled vm: works, load ~5 2 throttled + 1 unthrottled vm: boom Constantly 144 blocked threads, bo=0, load increasing to 144. System needs power reset. So, thinking about what I did in the initial setup, I re-tested without the for vm in $importantvms; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done since I don't do anything with that "important" cgroup (yet) anyway. It did not make a difference though.
- Can you also paste me the rules you have put on important and non-important groups. Somehow I suspect that some of the rule has gone horribly bad in the sense that it is very low and effectively no virtual machine is making any progress.
See setup commands in the beginning of this email.
- How long does it take to reach in this locked state where bo=0.
It goes there "instantly", right after the dd commands start.
- you can also try to redirect blktrace output to blkparse and redirect it to standard output and see capture some output by copying pasting last messages.
I hope this is what you meant: blktrace -d /dev/sdb -o - | blkparse -i - Output is in attachment. Regards Dominik

On Tue, Feb 22, 2011 at 02:29:31PM +0100, Dominik Klein wrote:
Hi
to be as verbose and clear as possible, I will write every command I use.
Thanks Dominik for detailed explanation and tests.
This sets everything up for the test:
# setup start
importantvms="kernel1 kernel2" notimportantvms="kernel3 kernel4 kernel5 kernel6 kernel7 kernel8"
for vm in $importantvms $notimportantvms; do virsh domstate $vm|grep -q running || virsh start $vm; done
echo 1 > /sys/block/sdb/queue/iosched/group_isolation
mount -t cgroup -o blkio none /mnt cd /mnt mkdir important mkdir notimportant
for vm in $notimportantvms; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/notimportant/tasks done done
for vm in $importantvms; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done
#ls -lH /dev/vdisks/kernel[3-8] #brw-rw---- 1 root root 254, 25 Feb 22 13:42 kernel3 #brw-rw---- 1 root root 254, 26 Feb 22 13:42 kernel4 #brw-rw---- 1 root root 254, 27 Feb 22 13:42 kernel5 #brw-rw---- 1 root root 254, 28 Feb 22 13:42 kernel6 #brw-rw---- 1 root root 254, 29 Feb 22 13:43 kernel7 #brw-rw---- 1 root root 254, 30 Feb 22 13:43 kernel8
for i in $(seq 25 30); do /bin/echo 254:$i 10000000 > /mnt/notimportant/blkio.throttle.write_bps_device done
# setup complete
Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no "bo" is bad. That would explain that why your box is not responding and you need to do power reset.
- I am assuming that you have not put any throttling limits on root group. Is your system root also on /dev/sdb or on a separate disk altogether.
No throttling on root. Correct.
system root is on sda vms are on sdb
- This sounds like a bug in throttling logic. To narrow it down can you start running "deadline" on end device. If it still happens, it is more or less in throttling layer.
cat /sys/block/sdb/queue/scheduler noop deadline [cfq] echo deadline > /sys/block/sdb/queue/scheduler cat /sys/block/sdb/queue/scheduler noop [deadline] cfq
This changes things:
vmstat before test (nothing going on): r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 130231696 17968 69424 0 0 0 0 16573 32834 0 0 100 0
vmstat during test while all 8 vms are writing (2 unthrottled, 6 at 10M): r b swpd free buff cache si so bi bo in cs us sy id wa 2 9 0 126257984 17968 69424 0 0 8 164462 20114 36751 5 4 52 39
vmstat during test when only throttled vms are still dd'ing 2 21 0 124518928 17976 69424 0 0 0 63876 17410 33670 1 1 59 39
Load is 12-ish during the test and the results are as expected: Throttled VMs write with approx 10M, unthrottled get like 80 each, which sums up to about the 200M max capacity of the device for direct io.
Ok, so deadline works just fine and there are no hangs. So it sounds more like CFQ issue.
- We can also try to remove dm layers and just create partitions on /dev/sdb and export as virtio disks to virtual machines and take dm layer out of picture and see if it still happens.
Did that. Also happens when scheduler is cfq. Also does not happen when scheduler is deadline. So it does not seem to be a dm issue.
Ok, so dm is not the culprit then.
- In one of the mails you mentioned that with 1 virutal machine throttling READs and WRITEs is working for you. So it looks like 1 virtual machine does not hang but once you launch 8 virtual machines it hangs. Can we try increasing the number of vitual machines gragually and confirm that it happens only if some certain number of virtual machines are launched.
Back on cfq here.
1 throttled vm: works, load ~4 2 throttled vms: works, load ~6 3 throttled vms: works, load ~9 4 throttled vms: works, load ~12 6 throttled vms: works, load ~20
The number of blocked threads increases with the number of vms dd'ing.
At the beginning of each test, the blocked threads number goes really high (4 vms 160, 6 vms 220), but then drops significantly and stays low.
So it seems that when only throttled vms are running, the problem does not occur.
1 throttled + 1 unthrottled vm: works, load ~5 2 throttled + 1 unthrottled vm: boom
That't interesting. I will try to run 2 throttled VMs and 1 Unthrottled and see if I can hit the same situation.
Constantly 144 blocked threads, bo=0, load increasing to 144. System needs power reset.
So, thinking about what I did in the initial setup, I re-tested without the
for vm in $importantvms; do cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task for task in *; do /bin/echo $task > /mnt/important/tasks done done
since I don't do anything with that "important" cgroup (yet) anyway. It did not make a difference though.
- Can you also paste me the rules you have put on important and non-important groups. Somehow I suspect that some of the rule has gone horribly bad in the sense that it is very low and effectively no virtual machine is making any progress.
See setup commands in the beginning of this email.
- How long does it take to reach in this locked state where bo=0.
It goes there "instantly", right after the dd commands start.
- you can also try to redirect blktrace output to blkparse and redirect it to standard output and see capture some output by copying pasting last messages.
I hope this is what you meant:
blktrace -d /dev/sdb -o - | blkparse -i -
Yes, that's what I meant. Few observations from this trace. - After a while all the IO coming in seems to be sleeping on get_request(). "S" signifies that. That means some other threads have already consumed the available request descriptors and new threads are being put to sleep and will be woken up when request descriptors are available. Not sure why that is happening here. - CFQ was idling on a queue and idle timer fired and it scheduled another dispatch. After that there is no request dispatched at all. Something went wrong for sure. 8,16 1 0 55.806298452 0 m N cfq idle timer fired 8,16 1 0 55.806300924 0 m N cfq3564 slice expired t=0 8,16 1 0 55.806303380 0 m N cfq3564 sl_used=2 disp=1 charge =2 iops=0 sect=8 8,16 1 0 55.806303770 0 m N cfq3564 del_from_rr 8,16 1 0 55.806304555 0 m N cfq schedule dispatch - Following seems odd 8,16 7 925 55.890489350 3768 Q WS 3407925696 + 480 [qemu-kvm] 8,16 7 926 55.890494793 3768 S WS 3407925696 + 480 [qemu-kvm] 8,16 1 1 1266874889.707163736 0 C R 3390363000 + 48 [0] 8,16 1 0 1266874889.707181482 0 m N cfq3273 complete rqnoidl Time in fourth column jumps from 55.x to 1266874889.X. Sounds like some corruption. - I don't see any throttling messages. They are prefixed by "throtl". So it seems all this IO is happening in root group. I believe it belongs to unthrottled VM. So to me it looks that system reached in bad shape even before throttled VMs were started. - So it sounds more and more like a CFQ issue which happens in conjuction with throttling. I will try to reproduce it. - Need little more info about how did you capture the blktrace. So you started blktrace and then started dd in parallel in all the three VMs and immediately system freezes and these are the only logs we see on console? Thanks Vivek

- Need little more info about how did you capture the blktrace. So you started blktrace and then started dd in parallel in all the three VMs and immediately system freezes and these are the only logs we see on console?
Almost ... The ssh connection through which I start the vmstat and/or blktrace never froze (not before, either). I can use that shell in the situation until I try accessing anything on a block device. So the blktrace was started before dd was started and blktrace was stopped (CTRL+C) while dd was still running. hth Dominik

On Tue, Feb 22, 2011 at 10:24:26AM -0500, Vivek Goyal wrote: [..]
- I don't see any throttling messages. They are prefixed by "throtl". So it seems all this IO is happening in root group. I believe it belongs to unthrottled VM. So to me it looks that system reached in bad shape even before throttled VMs were started.
You are taking trace of /dev/sdb and not /dev/vdisks/kernel3 etc, hence I don't see the throttle messages. So that's fine.
- So it sounds more and more like a CFQ issue which happens in conjuction with throttling. I will try to reproduce it.
I tried a lot but I can't reproduce the issue. So now I shall have to rely on data from you.
- Need little more info about how did you capture the blktrace. So you started blktrace and then started dd in parallel in all the three VMs and immediately system freezes and these are the only logs we see on console?
Can you please apply attached patch. This just makes CFQ output little more verbose and run the test again and capture the trace. - Start the trace on /dev/sdb - Start the dd jobs in virt machines - Wait for system to hang - Press CTRL-C - Make sure there were no lost events otherwise increase the size and number of buffers. Can you also open tracing in another window and also trace one of the throttled dm deivces, say /dev/disks/kernel3. Following the same procedure as above. So let the two traces run in parallel. Thanks Vivek --- block/cfq-iosched.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c 2011-02-22 13:23:25.000000000 -0500 +++ linux-2.6/block/cfq-iosched.c 2011-02-22 14:01:21.515363676 -0500 @@ -498,7 +498,7 @@ static inline bool cfq_bio_sync(struct b static inline void cfq_schedule_dispatch(struct cfq_data *cfqd) { if (cfqd->busy_queues) { - cfq_log(cfqd, "schedule dispatch"); + cfq_log(cfqd, "schedule dispatch: busy_queues=%d rq_queued=%d rq_in_driver=%d", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work); } } @@ -2229,6 +2229,8 @@ static struct cfq_queue *cfq_select_queu { struct cfq_queue *cfqq, *new_cfqq = NULL; + cfq_log(cfqd, "select_queue: busy_queues=%d rq_queued=%d rq_in_driver=%d", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver); + cfqq = cfqd->active_queue; if (!cfqq) goto new_queue; @@ -2499,8 +2501,10 @@ static int cfq_dispatch_requests(struct return cfq_forced_dispatch(cfqd); cfqq = cfq_select_queue(cfqd); - if (!cfqq) + if (!cfqq) { + cfq_log(cfqd, "select: no cfqq selected"); return 0; + } /* * Dispatch a request from this cfqq, if it is allowed @@ -3359,7 +3363,7 @@ static void cfq_insert_request(struct re struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_queue *cfqq = RQ_CFQQ(rq); - cfq_log_cfqq(cfqd, cfqq, "insert_request"); + cfq_log_cfqq(cfqd, cfqq, "insert_request: busy_queues=%d rq_queued=%d rq_in_driver=%d", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver); cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc); rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);

Hi so I ran these tests again. No patch applied yet. And - at least once - it worked. I did everything exactly the same way as before. Since the logs are 8 MB, even when best-bzip2'd, and I don't want everybody to have to download these, I uploaded them to an external hoster: http://www.megaupload.com/?d=SWKTC0V4 Traces were created with blktrace -n 64 -b 16384 -d /dev/sdb -o - | blkparse -i - blktrace -n 64 -b 16384 -d /dev/vdisks/kernel3 -o - | blkparse -i -
Can you please apply attached patch.
Unfortunately not. Cannot be applied to 2.6.37. I guess your source is newer and I fail to find the places in the file to patch manually.
This just makes CFQ output little more verbose and run the test again and capture the trace.
- Start the trace on /dev/sdb - Start the dd jobs in virt machines - Wait for system to hang - Press CTRL-C - Make sure there were no lost events otherwise increase the size and number of buffers.
Tried that. Unfortunately, even with max buffer size of 16 M [1], this leaves some Skips. I also tried to increase the number of buffers over 64, but that produced Oops'es. However, I attached kernel3's blktrace of a case where the error occured. Maybe you can read something from that.
Can you also open tracing in another window and also trace one of the throttled dm deivces, say /dev/disks/kernel3. Following the same procedure as above. So let the two traces run in parallel.
So what next? Regards Dominik [1] http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob_plain;f=... and look for "Invalid buffer"

Hi List Vivek and I talked yesterday via IRC and here's what we found. First of all, I need to define the "freeze"- or "hang"-situation in order to avoid misunderstandings. When the "hang" state occures, this means it is impossible to do any io on the system. vmstat shows 250-300 blocked threads. Therefore, it is not possible to open new ssh connections or log into the servers console. Established ssh connections however, keep working. It is possible to run commands. Just don't touch any files. That immediately hangs the connections. Okay, now that we got that cleared up ... During the "hang" situation, it is possible to change from cfq to deadline scheduler for /dev/sdb (the pv for the lvs). This makes the io happen and the system is responsive as usual. After applying the patch correctly (my mistake), we could see these debug messages from debugfs: qemu-kvm-3955 [004] 823.360594: 8,16 Q WS 3408164104 + 480 [qemu-kvm] qemu-kvm-3955 [004] 823.360594: 8,16 S WS 3408164104 + 480 [qemu-kvm] <...>-3252 [016] 823.361146: 8,16 m N cfq idle timer fired <...>-3252 [016] 823.361151: 8,16 m N cfq3683 slice expired t=0 <...>-3252 [016] 823.361154: 8,16 m N cfq3683 sl_used=2 disp=25 charge=2 iops=0 sect=9944 <...>-3252 [016] 823.361154: 8,16 m N cfq3683 del_from_rr <...>-3252 [016] 823.361161: 8,16 m N cfq schedule dispatch: busy_queues=38 rq_queued=132 rq_in_driver=0 quote Vivek Goyal cfq has 38 busy queues and 132 requests queued and it tries to schedule a dispatch and that dispatch never happens for some reason and cfq is hung /quote So the next idea was to try with 2.6.38-rc6, just in case there is a bug in workqueue logic which got fixed? And it turns out: With 2.6.38-rc6, the problem does not happen. I will see whether I can bisect the kernel patches and see which one was the good one. I have to figure out a way to do that, but if I do it and find out, I will keep you posted. Vivek then asked me to use another 2.6.37 debug patch and re-run the test. Attached are the logs from that run. Regards Dominik On 02/23/2011 02:37 PM, Dominik Klein wrote:
Hi
so I ran these tests again. No patch applied yet. And - at least once - it worked. I did everything exactly the same way as before. Since the logs are 8 MB, even when best-bzip2'd, and I don't want everybody to have to download these, I uploaded them to an external hoster: http://www.megaupload.com/?d=SWKTC0V4
Traces were created with blktrace -n 64 -b 16384 -d /dev/sdb -o - | blkparse -i - blktrace -n 64 -b 16384 -d /dev/vdisks/kernel3 -o - | blkparse -i -
Can you please apply attached patch.
Unfortunately not. Cannot be applied to 2.6.37. I guess your source is newer and I fail to find the places in the file to patch manually.
This just makes CFQ output little more verbose and run the test again and capture the trace.
- Start the trace on /dev/sdb - Start the dd jobs in virt machines - Wait for system to hang - Press CTRL-C - Make sure there were no lost events otherwise increase the size and number of buffers.
Tried that. Unfortunately, even with max buffer size of 16 M [1], this leaves some Skips. I also tried to increase the number of buffers over 64, but that produced Oops'es.
However, I attached kernel3's blktrace of a case where the error occured. Maybe you can read something from that.
Can you also open tracing in another window and also trace one of the throttled dm deivces, say /dev/disks/kernel3. Following the same procedure as above. So let the two traces run in parallel.
So what next?
Regards Dominik
[1] http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob_plain;f=... and look for "Invalid buffer"

I will see whether I can bisect the kernel patches and see which one was the good one. I have to figure out a way to do that, but if I do it and find out, I will keep you posted.
So I tried every patch against cfq-iosched.c since January 5th, 2011 (2.6.37 release date) and none of it made things work. I also tried the only existing patch against blk-throttle.c and it also did not make a difference. Unless you have another hint on where to look for the fix, I'd consider this thing done for now. Regards Dominik

fwiw, with 2.6.37.1, it does not work either.

On Thu, Feb 24, 2011 at 10:18:00AM +0100, Dominik Klein wrote: Hi Dominik, Thanks for the tests and reports. I checked the latest logs also and I see that cfq has scheduled a work but that work never gets scheduled. I never see the trace message which says cfq_kick_queue(). I am ccing it to lkml and tejun to see if he has any suggestions. Tejun, I will give you some details about what we have discussed so far. Dominik is trying blkio throttling feature and trying to throttle some virtual machines. He is using 2.6.37 kernels and once he launches 3 virtual machines he notices that system is kind of frozen. After running some traces we noticed that CFQ has requests but it is not dispatching these to devices any more. This problem does not show up with deadline scheduler and also goes away with 2.6.38-rc6 kernels. CFQ seems to have scheduled a work and that work never gets called back that's why CFQ is frozen. I modifed CFQ a bit and capture following traces. <...>-3118 [022] 230.001571: 8,16 m N cfq idle timer fired <...>-3118 [022] 230.001575: 8,16 m N cfq3834 slice expired t=0 <...>-3118 [022] 230.001577: 8,16 m N cfq3834 sl_used=3 disp=37 charge=3 iops=0 sect=16048 <...>-3118 [022] 230.001578: 8,16 m N cfq3834 del_from_rr <...>-3118 [022] 230.001581: 8,16 m N cfq schedule dispatch: busy_queues=29 rq_queued=128 rq_in_driver=0 q=ffff88080a581ba0 <CFQ hangs after this. No traces> I have put a trace message in cfq_kick_queue(). If work was ever scheduled that trace message should have been present here. Tejun, I was wondering if you have come across similar issue in the past which has been fixed in 2.6.38-rc kernels? Or any suggestions regarding how to debug it further. Thanks Vivek
First of all, I need to define the "freeze"- or "hang"-situation in order to avoid misunderstandings. When the "hang" state occures, this means it is impossible to do any io on the system. vmstat shows 250-300 blocked threads. Therefore, it is not possible to open new ssh connections or log into the servers console. Established ssh connections however, keep working. It is possible to run commands. Just don't touch any files. That immediately hangs the connections.
Okay, now that we got that cleared up ...
During the "hang" situation, it is possible to change from cfq to deadline scheduler for /dev/sdb (the pv for the lvs). This makes the io happen and the system is responsive as usual.
After applying the patch correctly (my mistake), we could see these debug messages from debugfs:
qemu-kvm-3955 [004] 823.360594: 8,16 Q WS 3408164104 + 480 [qemu-kvm] qemu-kvm-3955 [004] 823.360594: 8,16 S WS 3408164104 + 480 [qemu-kvm] <...>-3252 [016] 823.361146: 8,16 m N cfq idle timer fired <...>-3252 [016] 823.361151: 8,16 m N cfq3683 slice expired t=0 <...>-3252 [016] 823.361154: 8,16 m N cfq3683 sl_used=2 disp=25 charge=2 iops=0 sect=9944 <...>-3252 [016] 823.361154: 8,16 m N cfq3683 del_from_rr <...>-3252 [016] 823.361161: 8,16 m N cfq schedule dispatch: busy_queues=38 rq_queued=132 rq_in_driver=0
quote Vivek Goyal cfq has 38 busy queues and 132 requests queued and it tries to schedule a dispatch and that dispatch never happens for some reason and cfq is hung /quote
So the next idea was to try with 2.6.38-rc6, just in case there is a bug in workqueue logic which got fixed?
And it turns out: With 2.6.38-rc6, the problem does not happen.
I will see whether I can bisect the kernel patches and see which one was the good one. I have to figure out a way to do that, but if I do it and find out, I will keep you posted.
Vivek then asked me to use another 2.6.37 debug patch and re-run the test. Attached are the logs from that run.
Regards Dominik
On 02/23/2011 02:37 PM, Dominik Klein wrote:
Hi
so I ran these tests again. No patch applied yet. And - at least once - it worked. I did everything exactly the same way as before. Since the logs are 8 MB, even when best-bzip2'd, and I don't want everybody to have to download these, I uploaded them to an external hoster: http://www.megaupload.com/?d=SWKTC0V4
Traces were created with blktrace -n 64 -b 16384 -d /dev/sdb -o - | blkparse -i - blktrace -n 64 -b 16384 -d /dev/vdisks/kernel3 -o - | blkparse -i -
Can you please apply attached patch.
Unfortunately not. Cannot be applied to 2.6.37. I guess your source is newer and I fail to find the places in the file to patch manually.
This just makes CFQ output little more verbose and run the test again and capture the trace.
- Start the trace on /dev/sdb - Start the dd jobs in virt machines - Wait for system to hang - Press CTRL-C - Make sure there were no lost events otherwise increase the size and number of buffers.
Tried that. Unfortunately, even with max buffer size of 16 M [1], this leaves some Skips. I also tried to increase the number of buffers over 64, but that produced Oops'es.
However, I attached kernel3's blktrace of a case where the error occured. Maybe you can read something from that.
Can you also open tracing in another window and also trace one of the throttled dm deivces, say /dev/disks/kernel3. Following the same procedure as above. So let the two traces run in parallel.
So what next?
Regards Dominik
[1] http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob_plain;f=... and look for "Invalid buffer"
vmstat 2 0 0 0 130316736 16364 74840 0 0 0 0 16891 33021 0 1 99 0 8 0 0 129400944 16364 74840 0 0 0 14 17975 33233 4 3 93 0 0 306 0 127762016 16364 74840 0 0 0 144642 20478 35476 11 6 28 56 1 306 0 127761176 16364 74840 0 0 0 0 17307 33100 3 4 25 69 2 306 0 127760904 16364 74840 0 0 0 0 17300 32947 3 4 24 69 0 306 0 127760768 16364 74840 0 0 0 0 17189 32953 2 3 25 70 1 306 0 127760752 16364 74840 0 0 0 0 17228 32966 2 3 21 74 0 306 0 127760752 16364 74840 0 0 0 0 17138 32938 1 4 24 72 0 306 0 127760736 16364 74840 0 0 0 0 17159 32954 1 4 22 73 1 306 0 127760736 16364 74840 0 0 0 0 17210 33011 1 4 23 73 0 306 0 127760736 16364 74840 0 0 0 0 17167 32980 0 4 22 74 0 306 0 127760736 16364 74840 0 0 0 0 17137 32930 0 4 21 75 1 306 0 127760352 16364 74840 0 0 0 0 17165 32933 0 2 23 75 1 306 0 127760472 16364 74840 0 0 0 0 17079 32928 0 0 25 74 0 306 0 127760560 16364 74840 0 0 0 0 17188 32950 1 2 22 75 2 306 0 127760560 16364 74840 0 0 0 6 17142 32838 1 4 21 74 0 306 0 127760560 16364 74840 0 0 0 16 17213 32896 1 4 21 74 0 306 0 127760560 16364 74840 0 0 0 0 17198 32899 2 4 21 74 0 306 0 127760560 16364 74840 0 0 0 12 17218 32910 1 4 21 74 1 306 0 127760544 16364 74840 0 0 0 0 17203 32884 2 3 21 74 1 306 0 127760552 16364 74840 0 0 0 0 17199 32904 2 4 21 74 0 306 0 127760560 16364 74840 0 0 0 0 17230 32871 2 4 21 74 0 306 0 127760560 16364 74840 0 0 0 0 17177 32860 2 4 21 74 0 306 0 127760640 16364 74840 0 0 0 0 17185 32888 1 4 21 73 0 306 0 127760640 16364 74840 0 0 0 0 17207 32851 2 4 25 70 1 306 0 127760640 16364 74840 0 0 0 0 17023 32914 0 4 25 70
block/blk-core.c | 1 + block/blk.h | 8 +++++++- block/cfq-iosched.c | 11 ++++++++--- 3 files changed, 16 insertions(+), 4 deletions(-)
Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c 2011-02-22 13:23:25.000000000 -0500 +++ linux-2.6/block/cfq-iosched.c 2011-02-23 11:54:49.488446104 -0500 @@ -498,7 +498,7 @@ static inline bool cfq_bio_sync(struct b static inline void cfq_schedule_dispatch(struct cfq_data *cfqd) { if (cfqd->busy_queues) { - cfq_log(cfqd, "schedule dispatch"); + cfq_log(cfqd, "schedule dispatch: busy_queues=%d rq_queued=%d rq_in_driver=%d q=%p", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver, cfqd->queue); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work); } } @@ -2229,6 +2229,8 @@ static struct cfq_queue *cfq_select_queu { struct cfq_queue *cfqq, *new_cfqq = NULL;
+ cfq_log(cfqd, "select_queue: busy_queues=%d rq_queued=%d rq_in_driver=%d", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver); + cfqq = cfqd->active_queue; if (!cfqq) goto new_queue; @@ -2499,8 +2501,10 @@ static int cfq_dispatch_requests(struct return cfq_forced_dispatch(cfqd);
cfqq = cfq_select_queue(cfqd); - if (!cfqq) + if (!cfqq) { + cfq_log(cfqd, "select: no cfqq selected"); return 0; + }
/* * Dispatch a request from this cfqq, if it is allowed @@ -3359,7 +3363,7 @@ static void cfq_insert_request(struct re struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_queue *cfqq = RQ_CFQQ(rq);
- cfq_log_cfqq(cfqd, cfqq, "insert_request"); + cfq_log_cfqq(cfqd, cfqq, "insert_request: busy_queues=%d rq_queued=%d rq_in_driver=%d", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver); cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);
rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]); @@ -3707,6 +3711,7 @@ static void cfq_kick_queue(struct work_s struct request_queue *q = cfqd->queue;
spin_lock_irq(q->queue_lock); + cfq_log(cfqd, "cfq_kick_queue called. busy_queues=%d rq_queued=%d rq_in_driver=%d", cfqd->busy_queues, cfqd->rq_queued, cfqd->rq_in_driver); __blk_run_queue(cfqd->queue); spin_unlock_irq(q->queue_lock); } Index: linux-2.6/block/blk-core.c =================================================================== --- linux-2.6.orig/block/blk-core.c 2011-02-22 13:23:25.000000000 -0500 +++ linux-2.6/block/blk-core.c 2011-02-23 11:50:34.406700216 -0500 @@ -413,6 +413,7 @@ void __blk_run_queue(struct request_queu queue_flag_clear(QUEUE_FLAG_REENTER, q); } else { queue_flag_set(QUEUE_FLAG_PLUGGED, q); + trace_printk("not recursing. Scheduling another dispatch\n"); kblockd_schedule_work(q, &q->unplug_work); } } Index: linux-2.6/block/blk.h =================================================================== --- linux-2.6.orig/block/blk.h 2011-02-18 13:39:47.000000000 -0500 +++ linux-2.6/block/blk.h 2011-02-23 11:53:37.844304978 -0500 @@ -57,6 +57,8 @@ static inline struct request *__elv_next { struct request *rq;
+ trace_printk("elv_next_request called. q=%p\n", q); + while (1) { while (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); @@ -68,8 +70,12 @@ static inline struct request *__elv_next return rq; }
- if (!q->elevator->ops->elevator_dispatch_fn(q, 0)) + trace_printk("No requests in q->queue_head. Will call elevator. q=%p\n", q); + + if (!q->elevator->ops->elevator_dispatch_fn(q, 0)) { + trace_printk("No requests in elevator either q=%p\n", q); return NULL; + } } }

Hello, On Thu, Feb 24, 2011 at 09:23:03AM -0500, Vivek Goyal wrote:
On Thu, Feb 24, 2011 at 10:18:00AM +0100, Dominik Klein wrote:
Hi Dominik,
Thanks for the tests and reports. I checked the latest logs also and I see that cfq has scheduled a work but that work never gets scheduled. I never see the trace message which says cfq_kick_queue().
I am ccing it to lkml and tejun to see if he has any suggestions.
Tejun,
I will give you some details about what we have discussed so far.
Dominik is trying blkio throttling feature and trying to throttle some virtual machines. He is using 2.6.37 kernels and once he launches 3 virtual machines he notices that system is kind of frozen. After running some traces we noticed that CFQ has requests but it is not dispatching these to devices any more.
This problem does not show up with deadline scheduler and also goes away with 2.6.38-rc6 kernels.
Hmmm... Maybe the following commit? commit 7576958a9d5a4a677ad7dd40901cdbb6c1110c98 Author: Tejun Heo <tj@kernel.org> Date: Mon Feb 14 14:04:46 2011 +0100 workqueue: wake up a worker when a rescuer is leaving a gcwq After executing the matching works, a rescuer leaves the gcwq whether there are more pending works or not. This may decrease the concurrency level to zero and stall execution until a new work item is queued on the gcwq. Make rescuer wake up a regular worker when it leaves a gcwq if there are more works to execute, so that execution isn't stalled. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org -- tejun

Hi
Hmmm... Maybe the following commit?
commit 7576958a9d5a4a677ad7dd40901cdbb6c1110c98
Unfortunately not. This does not fix it for me. Vivek introduced me to git bisect and I will try to do that tomorrow to find the changeset. Will report back. Thanks Dominik

On Thu, Feb 24, 2011 at 03:58:22PM +0100, Dominik Klein wrote:
Hi
Hmmm... Maybe the following commit?
commit 7576958a9d5a4a677ad7dd40901cdbb6c1110c98
Unfortunately not. This does not fix it for me.
Vivek introduced me to git bisect and I will try to do that tomorrow to find the changeset.
Maybe watching what the workqueue is doing using the following trace events could be helpful. # grep workqueue /sys/kernel/debug/tracing/available_events workqueue:workqueue_insertion workqueue:workqueue_execution workqueue:workqueue_creation workqueue:workqueue_destruction Thanks. -- tejun

Maybe watching what the workqueue is doing using the following trace events could be helpful.
# grep workqueue /sys/kernel/debug/tracing/available_events workqueue:workqueue_insertion workqueue:workqueue_execution workqueue:workqueue_creation workqueue:workqueue_destruction
Since I've never done this before, I will tell you how I captured the trace so you know what I did and can see whether I did something wrong and can correct me if necessary. echo blk > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/block/sdb/trace/enable echo workqueue_queue_work >> /sys/kernel/debug/tracing/set_event echo workqueue_activate_work >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_start >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_end >> /sys/kernel/debug/tracing/set_event cat /sys/kernel/debug/tracing/trace_pipe And that output i gzip'd. This was taken with 2.6.37 plus the patch you mentioned on a Dell R815 with 2 12 core AMD Opteron 6174 CPUs. If you need any more information, please let me know. Regards Dominik

On Fri, Feb 25, 2011 at 08:24:15AM +0100, Dominik Klein wrote:
Maybe watching what the workqueue is doing using the following trace events could be helpful.
# grep workqueue /sys/kernel/debug/tracing/available_events workqueue:workqueue_insertion workqueue:workqueue_execution workqueue:workqueue_creation workqueue:workqueue_destruction
Since I've never done this before, I will tell you how I captured the trace so you know what I did and can see whether I did something wrong and can correct me if necessary.
echo blk > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/block/sdb/trace/enable echo workqueue_queue_work >> /sys/kernel/debug/tracing/set_event echo workqueue_activate_work >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_start >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_end >> /sys/kernel/debug/tracing/set_event cat /sys/kernel/debug/tracing/trace_pipe
And that output i gzip'd.
This was taken with 2.6.37 plus the patch you mentioned on a Dell R815 with 2 12 core AMD Opteron 6174 CPUs. If you need any more information, please let me know.
Hmmm... well, I have no idea what you were trying to do but here are some info which might be helpful. * queue_work happens when the work item is queued. * activate_work happens when the work item becomes eligible for execution. e.g. If the workqueue's @max_active is limited and maximum number of work items are already in flight, a new item will only get activated after one of the in flight ones retires. * execute_start marks the actual starting of execution. * execute_end marks the end of execution. So, I would look for the matching work function and then try to follow what happens to it after being scheduled and if it doesn't get executed what's going on with the target workqueue. Thanks. -- tejun

Hi
This was taken with 2.6.37 plus the patch you mentioned on a Dell R815 with 2 12 core AMD Opteron 6174 CPUs. If you need any more information, please let me know.
Hmmm... well, I have no idea what you were trying to do
Long story short: I have a couple of virtual machines. Some of them have blkio throttle configured, some don't. To simulate whether the throttling works, I start dd if=/dev/zero of=testfile bs=1M count=1500 in each guest simultaneously. The result is that from that point, no i/o is happening any more. You see the result in the trace. With 2.6.37 (also tried .1 and .2) it does not work but end up like I documented. With 2.6.38-rc1, it does work. With deadline scheduler, it also works in 2.6.37. I am in bisect run 2 currently to find the changeset that fixed it. Will let you know as soon as I do. Regards Dominik ps. I am by no means a kernel hacker and none of the rest of your email made any sense to me. Sorry.
but here are some info which might be helpful.
* queue_work happens when the work item is queued.
* activate_work happens when the work item becomes eligible for execution. e.g. If the workqueue's @max_active is limited and maximum number of work items are already in flight, a new item will only get activated after one of the in flight ones retires.
* execute_start marks the actual starting of execution.
* execute_end marks the end of execution.
So, I would look for the matching work function and then try to follow what happens to it after being scheduled and if it doesn't get executed what's going on with the target workqueue.
Thanks.

Hello, On Fri, Feb 25, 2011 at 12:46:16PM +0100, Dominik Klein wrote:
With 2.6.37 (also tried .1 and .2) it does not work but end up like I documented. With 2.6.38-rc1, it does work. With deadline scheduler, it also works in 2.6.37.
Okay, here's the problematic part. <idle>-0 [013] 1640.975562: workqueue_queue_work: work struct=ffff88080f14f270 function=blk_throtl_work workqueue=ffff88102c8fc700 req_cpu=13 cpu=13 <idle>-0 [013] 1640.975564: workqueue_activate_work: work struct ffff88080f14f270 <...>-477 [013] 1640.975574: workqueue_execute_start: work struct ffff88080f14f270: function blk_throtl_work <idle>-0 [013] 1641.087450: workqueue_queue_work: work struct=ffff88080f14f270 function=blk_throtl_work workqueue=ffff88102c8fc700 req_cpu=13 cpu=13 The workqueue is per-cpu, so we only need to follow cpu=13 cases. @1640, blk_throtl_work() is queued, activated and starts executing but never finishes. The same work item is never executed more than once at the same on the same CPU, so when the next work item is queued, it doesn't get activated until the previous execution is complete. The next thing to do would be finding out why blk_throtl_work() isn't finishing. sysrq-t or /proc/PID/stack should show us where it's stalled. Thanks. -- tejun

On 02/25/2011 02:18 PM, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 12:46:16PM +0100, Dominik Klein wrote:
With 2.6.37 (also tried .1 and .2) it does not work but end up like I documented. With 2.6.38-rc1, it does work. With deadline scheduler, it also works in 2.6.37.
Okay, here's the problematic part.
<idle>-0 [013] 1640.975562: workqueue_queue_work: work struct=ffff88080f14f270 function=blk_throtl_work workqueue=ffff88102c8fc700 req_cpu=13 cpu=13 <idle>-0 [013] 1640.975564: workqueue_activate_work: work struct ffff88080f14f270 <...>-477 [013] 1640.975574: workqueue_execute_start: work struct ffff88080f14f270: function blk_throtl_work <idle>-0 [013] 1641.087450: workqueue_queue_work: work struct=ffff88080f14f270 function=blk_throtl_work workqueue=ffff88102c8fc700 req_cpu=13 cpu=13
The workqueue is per-cpu, so we only need to follow cpu=13 cases. @1640, blk_throtl_work() is queued, activated and starts executing but never finishes. The same work item is never executed more than once at the same on the same CPU, so when the next work item is queued, it doesn't get activated until the previous execution is complete.
The next thing to do would be finding out why blk_throtl_work() isn't finishing. sysrq-t or /proc/PID/stack should show us where it's stalled.
Thanks.
See attached logs of another run. sysctl -w kernel.sysrq=1 echo blk > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/block/sdb/trace/enable echo workqueue_queue_work >> /sys/kernel/debug/tracing/set_event echo workqueue_activate_work >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_start >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_end >> /sys/kernel/debug/tracing/set_event That makes attachment trace_pipe5.gz echo 8 > /proc/sysrq-trigger echo t > /proc/sysrq-trigger That makes attachment console.gz hth Dominik

Hello, On Fri, Feb 25, 2011 at 03:41:47PM +0100, Dominik Klein wrote:
See attached logs of another run.
sysctl -w kernel.sysrq=1
echo blk > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/block/sdb/trace/enable echo workqueue_queue_work >> /sys/kernel/debug/tracing/set_event echo workqueue_activate_work >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_start >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_end >> /sys/kernel/debug/tracing/set_event
That makes attachment trace_pipe5.gz
echo 8 > /proc/sysrq-trigger echo t > /proc/sysrq-trigger
That makes attachment console.gz
So, the following work item never finished. We can tell that pid 549 started execution from the last line. <idle>-0 [017] 1497.601733: workqueue_queue_work: work struct=ffff880809f3fe70 function=blk_throtl_work workqueue=ffff88102c8ba700 req_cpu=17 cpu=17 <idle>-0 [017] 1497.601736: workqueue_activate_work: work struct ffff880809f3fe70 <...>-549 [017] 1497.601754: workqueue_execute_start: work struct ffff880809f3fe70: function blk_throtl_work And the stack trace of pid 549 is... [ 1522.220046] kworker/17:1 D ffff88202fc53600 0 549 2 0x00000000 [ 1522.220046] ffff88082c5bd7c0 0000000000000046 ffff88180a822600 ffff88082c578000 [ 1522.220046] 0000000000013600 ffff88080afaffd8 0000000000013600 0000000000013600 [ 1522.220046] ffff88082c5bda98 ffff88082c5bdaa0 ffff88082c5bd7c0 0000000000013600 [ 1522.220046] Call Trace: [ 1522.220046] [<ffffffff810395c6>] ? __wake_up+0x35/0x46 [ 1522.220046] [<ffffffff81315de3>] ? io_schedule+0x68/0xa7 [ 1522.220046] [<ffffffff81182168>] ? get_request_wait+0xee/0x17d [ 1522.220046] [<ffffffff810604f1>] ? autoremove_wake_function+0x0/0x2a [ 1522.220046] [<ffffffff811826b6>] ? __make_request+0x313/0x45d [ 1522.220046] [<ffffffff81180ebd>] ? generic_make_request+0x30d/0x385 [ 1522.220046] [<ffffffff8105cc79>] ? queue_delayed_work_on+0xfc/0x10a [ 1522.220046] [<ffffffff8118c607>] ? blk_throtl_work+0x312/0x32b [ 1522.220046] [<ffffffff8118c2f5>] ? blk_throtl_work+0x0/0x32b [ 1522.220046] [<ffffffff8105b754>] ? process_one_work+0x1d1/0x2ee [ 1522.220046] [<ffffffff8105d1e3>] ? worker_thread+0x12d/0x247 [ 1522.220046] [<ffffffff8105d0b6>] ? worker_thread+0x0/0x247 [ 1522.220046] [<ffffffff8105d0b6>] ? worker_thread+0x0/0x247 [ 1522.220046] [<ffffffff8106009f>] ? kthread+0x7a/0x82 [ 1522.220046] [<ffffffff8100a824>] ? kernel_thread_helper+0x4/0x10 [ 1522.220046] [<ffffffff81060025>] ? kthread+0x0/0x82 [ 1522.220046] [<ffffffff8100a820>] ? kernel_thread_helper+0x0/0x10 The '?'s are because frame pointer is disabled and means that the stack trace is a guesswork. Can you please turn on CONFIG_FRAME_POINTER just to be sure? But at any rate, it looks like blk_throtl_work() got stuck trying to allocate a request. I don't think workqueue is causing any problem here. It seems like a resource deadlock on request. Vivek, any ideas? Thanks. -- tejun

On Fri, Feb 25, 2011 at 02:18:50PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 12:46:16PM +0100, Dominik Klein wrote:
With 2.6.37 (also tried .1 and .2) it does not work but end up like I documented. With 2.6.38-rc1, it does work. With deadline scheduler, it also works in 2.6.37.
Okay, here's the problematic part.
<idle>-0 [013] 1640.975562: workqueue_queue_work: work struct=ffff88080f14f270 function=blk_throtl_work workqueue=ffff88102c8fc700 req_cpu=13 cpu=13 <idle>-0 [013] 1640.975564: workqueue_activate_work: work struct ffff88080f14f270 <...>-477 [013] 1640.975574: workqueue_execute_start: work struct ffff88080f14f270: function blk_throtl_work <idle>-0 [013] 1641.087450: workqueue_queue_work: work struct=ffff88080f14f270 function=blk_throtl_work workqueue=ffff88102c8fc700 req_cpu=13 cpu=13
The workqueue is per-cpu, so we only need to follow cpu=13 cases. @1640, blk_throtl_work() is queued, activated and starts executing but never finishes. The same work item is never executed more than once at the same on the same CPU, so when the next work item is queued, it doesn't get activated until the previous execution is complete.
The next thing to do would be finding out why blk_throtl_work() isn't finishing. sysrq-t or /proc/PID/stack should show us where it's stalled.
Hi Tejun, blk_throtl_work() calls generic_make_request() to dispatch some bios and I guess blk_throtl_work() has been put to sleep because threre are no request descriptors available and CFQ is frozen so no requests descriptors get freed hence blk_throtl_work() never finishes. Following caught my eye. ksoftirqd/0-3 [000] 1640.983585: 8,16 m N cfq4810 slice expired t=0 ksoftirqd/0-3 [000] 1640.983588: 8,16 m N cfq4810 sl_used=2 disp=6 charge=2 iops=0 sect=2080 ksoftirqd/0-3 [000] 1640.983589: 8,16 m N cfq4810 del_from_rr ksoftirqd/0-3 [000] 1640.983591: 8,16 m N cfq schedule dispatch sshd-3125 [004] 1640.983597: workqueue_queue_work: work struct=ffff88102c3a3110 function=flush_to_ldisc workqueue=ffff88182c834a00 req_cpu=4 cpu=4 sshd-3125 [004] 1640.983598: workqueue_activate_work: work struct ffff88102c3a3110 CFQ tries to schedule a work and but there is no associated "workqueue_queue_work" trace. So it looks like that work never got queued. CFQ calls following. cfq_log(cfqd, "schedule dispatch"); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work); We do see "schedule dispatch" message and kblockd_schedule_work() calls queue_work(). So what happended here? This is strange. I will put one more trace after kblockd_schedule_work() to trace that function returned. Thanks Vivek

Hello, On Fri, Feb 25, 2011 at 09:57:08AM -0500, Vivek Goyal wrote:
blk_throtl_work() calls generic_make_request() to dispatch some bios and I guess blk_throtl_work() has been put to sleep because threre are no request descriptors available and CFQ is frozen so no requests descriptors get freed hence blk_throtl_work() never finishes.
Following caught my eye.
ksoftirqd/0-3 [000] 1640.983585: 8,16 m N cfq4810 slice expired t=0 ksoftirqd/0-3 [000] 1640.983588: 8,16 m N cfq4810 sl_used=2 disp=6 charge=2 iops=0 sect=2080 ksoftirqd/0-3 [000] 1640.983589: 8,16 m N cfq4810 del_from_rr ksoftirqd/0-3 [000] 1640.983591: 8,16 m N cfq schedule dispatch sshd-3125 [004] 1640.983597: workqueue_queue_work: work struct=ffff88102c3a3110 function=flush_to_ldisc workqueue=ffff88182c834a00 req_cpu=4 cpu=4 sshd-3125 [004] 1640.983598: workqueue_activate_work: work struct ffff88102c3a3110
CFQ tries to schedule a work and but there is no associated "workqueue_queue_work" trace. So it looks like that work never got queued.
CFQ calls following.
cfq_log(cfqd, "schedule dispatch"); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
We do see "schedule dispatch" message and kblockd_schedule_work() calls queue_work(). So what happended here? This is strange. I will put one more trace after kblockd_schedule_work() to trace that function returned.
It could be that the unplug work was already queued and in pending state. The second queueing request will be ignored then. So, I think the problem is that blk_throtl_work() occupies kblockd but requires another work item (unplug_work) to make forward progress. In such cases, forward progress cannot be guaranteed. Either blk_throtl_work() or cfq unplug work should use a separate workqueue. Thanks. -- tejun

On Fri, Feb 25, 2011 at 04:03:29PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 09:57:08AM -0500, Vivek Goyal wrote:
blk_throtl_work() calls generic_make_request() to dispatch some bios and I guess blk_throtl_work() has been put to sleep because threre are no request descriptors available and CFQ is frozen so no requests descriptors get freed hence blk_throtl_work() never finishes.
Following caught my eye.
ksoftirqd/0-3 [000] 1640.983585: 8,16 m N cfq4810 slice expired t=0 ksoftirqd/0-3 [000] 1640.983588: 8,16 m N cfq4810 sl_used=2 disp=6 charge=2 iops=0 sect=2080 ksoftirqd/0-3 [000] 1640.983589: 8,16 m N cfq4810 del_from_rr ksoftirqd/0-3 [000] 1640.983591: 8,16 m N cfq schedule dispatch sshd-3125 [004] 1640.983597: workqueue_queue_work: work struct=ffff88102c3a3110 function=flush_to_ldisc workqueue=ffff88182c834a00 req_cpu=4 cpu=4 sshd-3125 [004] 1640.983598: workqueue_activate_work: work struct ffff88102c3a3110
CFQ tries to schedule a work and but there is no associated "workqueue_queue_work" trace. So it looks like that work never got queued.
CFQ calls following.
cfq_log(cfqd, "schedule dispatch"); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
We do see "schedule dispatch" message and kblockd_schedule_work() calls queue_work(). So what happended here? This is strange. I will put one more trace after kblockd_schedule_work() to trace that function returned.
It could be that the unplug work was already queued and in pending state. The second queueing request will be ignored then. So, I think the problem is that blk_throtl_work() occupies kblockd but requires another work item (unplug_work) to make forward progress. In such cases, forward progress cannot be guaranteed. Either blk_throtl_work() or cfq unplug work should use a separate workqueue.
Ok, that would make sense. So blk_throtl_work() can not finish as CFQ is not making progress and no request descriptors are being freed and unplug_work() is not being called because blk_throtl_work() has not finished. So that's cyclic dependency and I should use a separate work queue for queueing throttle related work. I will write a patch. Thanks Vivek

On Fri, Feb 25, 2011 at 10:11:13AM -0500, Vivek Goyal wrote:
On Fri, Feb 25, 2011 at 04:03:29PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 09:57:08AM -0500, Vivek Goyal wrote:
blk_throtl_work() calls generic_make_request() to dispatch some bios and I guess blk_throtl_work() has been put to sleep because threre are no request descriptors available and CFQ is frozen so no requests descriptors get freed hence blk_throtl_work() never finishes.
Following caught my eye.
ksoftirqd/0-3 [000] 1640.983585: 8,16 m N cfq4810 slice expired t=0 ksoftirqd/0-3 [000] 1640.983588: 8,16 m N cfq4810 sl_used=2 disp=6 charge=2 iops=0 sect=2080 ksoftirqd/0-3 [000] 1640.983589: 8,16 m N cfq4810 del_from_rr ksoftirqd/0-3 [000] 1640.983591: 8,16 m N cfq schedule dispatch sshd-3125 [004] 1640.983597: workqueue_queue_work: work struct=ffff88102c3a3110 function=flush_to_ldisc workqueue=ffff88182c834a00 req_cpu=4 cpu=4 sshd-3125 [004] 1640.983598: workqueue_activate_work: work struct ffff88102c3a3110
CFQ tries to schedule a work and but there is no associated "workqueue_queue_work" trace. So it looks like that work never got queued.
CFQ calls following.
cfq_log(cfqd, "schedule dispatch"); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
We do see "schedule dispatch" message and kblockd_schedule_work() calls queue_work(). So what happended here? This is strange. I will put one more trace after kblockd_schedule_work() to trace that function returned.
It could be that the unplug work was already queued and in pending state. The second queueing request will be ignored then. So, I think the problem is that blk_throtl_work() occupies kblockd but requires another work item (unplug_work) to make forward progress. In such cases, forward progress cannot be guaranteed. Either blk_throtl_work() or cfq unplug work should use a separate workqueue.
Ok, that would make sense. So blk_throtl_work() can not finish as CFQ is not making progress and no request descriptors are being freed and unplug_work() is not being called because blk_throtl_work() has not finished. So that's cyclic dependency and I should use a separate work queue for queueing throttle related work. I will write a patch.
The only thing unexplained is why same problem does not happen in 2.6.38-rc kernels. Thanks Vivek

On Fri, Feb 25, 2011 at 10:11:13AM -0500, Vivek Goyal wrote:
On Fri, Feb 25, 2011 at 04:03:29PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 09:57:08AM -0500, Vivek Goyal wrote:
blk_throtl_work() calls generic_make_request() to dispatch some bios and I guess blk_throtl_work() has been put to sleep because threre are no request descriptors available and CFQ is frozen so no requests descriptors get freed hence blk_throtl_work() never finishes.
Following caught my eye.
ksoftirqd/0-3 [000] 1640.983585: 8,16 m N cfq4810 slice expired t=0 ksoftirqd/0-3 [000] 1640.983588: 8,16 m N cfq4810 sl_used=2 disp=6 charge=2 iops=0 sect=2080 ksoftirqd/0-3 [000] 1640.983589: 8,16 m N cfq4810 del_from_rr ksoftirqd/0-3 [000] 1640.983591: 8,16 m N cfq schedule dispatch sshd-3125 [004] 1640.983597: workqueue_queue_work: work struct=ffff88102c3a3110 function=flush_to_ldisc workqueue=ffff88182c834a00 req_cpu=4 cpu=4 sshd-3125 [004] 1640.983598: workqueue_activate_work: work struct ffff88102c3a3110
CFQ tries to schedule a work and but there is no associated "workqueue_queue_work" trace. So it looks like that work never got queued.
CFQ calls following.
cfq_log(cfqd, "schedule dispatch"); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
We do see "schedule dispatch" message and kblockd_schedule_work() calls queue_work(). So what happended here? This is strange. I will put one more trace after kblockd_schedule_work() to trace that function returned.
It could be that the unplug work was already queued and in pending state. The second queueing request will be ignored then. So, I think the problem is that blk_throtl_work() occupies kblockd but requires another work item (unplug_work) to make forward progress. In such cases, forward progress cannot be guaranteed. Either blk_throtl_work() or cfq unplug work should use a separate workqueue.
Ok, that would make sense. So blk_throtl_work() can not finish as CFQ is not making progress and no request descriptors are being freed and unplug_work() is not being called because blk_throtl_work() has not finished. So that's cyclic dependency and I should use a separate work queue for queueing throttle related work. I will write a patch.
Hi Dominik, Can you please try attached patch and see if fixes the issue. Thanks Vivek o Use a separate workqueue for throttle related work and don't reuse kblockd workqueue as there occurs a cycle dependency in cfq unplug work and throttle dispatch work. Yet-to-be-signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/blk-throttle.c | 28 ++++++++++++++++++++-------- include/linux/blkdev.h | 2 -- 2 files changed, 20 insertions(+), 10 deletions(-) Index: linux-2.6/block/blk-throttle.c =================================================================== --- linux-2.6.orig/block/blk-throttle.c 2011-02-21 22:30:39.000000000 -0500 +++ linux-2.6/block/blk-throttle.c 2011-02-25 10:53:51.884672758 -0500 @@ -20,6 +20,10 @@ static int throtl_quantum = 32; /* Throttling is performed over 100ms slice and after that slice is renewed */ static unsigned long throtl_slice = HZ/10; /* 100 ms */ +/* A workqueue to queue throttle related work */ +static struct workqueue_struct *kthrotld_workqueue; +void throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay); + struct throtl_rb_root { struct rb_root rb; struct rb_node *left; @@ -146,6 +150,12 @@ static inline struct throtl_grp *throtl_ return tg; } +int kthrotld_schedule_delayed_work(struct throtl_data *td, + struct delayed_work *dwork, unsigned long delay) +{ + return queue_delayed_work(kthrotld_workqueue, dwork, delay); +} + static void throtl_put_tg(struct throtl_grp *tg) { BUG_ON(atomic_read(&tg->ref) <= 0); @@ -346,10 +356,9 @@ static void throtl_schedule_next_dispatc update_min_dispatch_time(st); if (time_before_eq(st->min_disptime, jiffies)) - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); else - throtl_schedule_delayed_work(td->queue, - (st->min_disptime - jiffies)); + throtl_schedule_delayed_work(td, (st->min_disptime - jiffies)); } static inline void @@ -809,10 +818,9 @@ void blk_throtl_work(struct work_struct } /* Call with queue lock held */ -void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) +void throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay) { - struct throtl_data *td = q->td; struct delayed_work *dwork = &td->throtl_work; if (total_nr_queued(td) > 0) { @@ -821,12 +829,11 @@ void throtl_schedule_delayed_work(struct * Cancel that and schedule a new one. */ __cancel_delayed_work(dwork); - kblockd_schedule_delayed_work(q, dwork, delay); + kthrotld_schedule_delayed_work(td, dwork, delay); throtl_log(td, "schedule work. delay=%lu jiffies=%lu", delay, jiffies); } } -EXPORT_SYMBOL(throtl_schedule_delayed_work); static void throtl_destroy_tg(struct throtl_data *td, struct throtl_grp *tg) @@ -895,7 +902,7 @@ static void throtl_update_blkio_group_co xchg(&tg->limits_changed, true); xchg(&td->limits_changed, true); /* Schedule a work now to process the limit change */ - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); } /* @@ -1113,6 +1120,11 @@ void blk_throtl_exit(struct request_queu static int __init throtl_init(void) { + kthrotld_workqueue = alloc_workqueue("kthrotld", + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0); + if (!kthrotld_workqueue) + panic("Failed to create kthrotld\n"); + blkio_policy_register(&blkio_policy_throtl); return 0; } Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2011-02-21 22:30:39.000000000 -0500 +++ linux-2.6/include/linux/blkdev.h 2011-02-25 10:50:50.706137004 -0500 @@ -1136,7 +1136,6 @@ static inline uint64_t rq_io_start_time_ extern int blk_throtl_init(struct request_queue *q); extern void blk_throtl_exit(struct request_queue *q); extern int blk_throtl_bio(struct request_queue *q, struct bio **bio); -extern void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay); #else /* CONFIG_BLK_DEV_THROTTLING */ static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio) { @@ -1145,7 +1144,6 @@ static inline int blk_throtl_bio(struct static inline int blk_throtl_init(struct request_queue *q) { return 0; } static inline int blk_throtl_exit(struct request_queue *q) { return 0; } -static inline void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) {} #endif /* CONFIG_BLK_DEV_THROTTLING */ #define MODULE_ALIAS_BLOCKDEV(major,minor) \

Hello, On Fri, Feb 25, 2011 at 11:03:53AM -0500, Vivek Goyal wrote:
+int kthrotld_schedule_delayed_work(struct throtl_data *td, + struct delayed_work *dwork, unsigned long delay) +{ + return queue_delayed_work(kthrotld_workqueue, dwork, delay); +} +
I don't think wrapping is necessary. Defining and using a workqueue directly should be enough.
@@ -1113,6 +1120,11 @@ void blk_throtl_exit(struct request_queu
static int __init throtl_init(void) { + kthrotld_workqueue = alloc_workqueue("kthrotld", + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
And I don't think kthrotld needs to be HIGHPRI. Thanks. -- tejun

On Fri, Feb 25, 2011 at 05:09:03PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 11:03:53AM -0500, Vivek Goyal wrote:
+int kthrotld_schedule_delayed_work(struct throtl_data *td, + struct delayed_work *dwork, unsigned long delay) +{ + return queue_delayed_work(kthrotld_workqueue, dwork, delay); +} +
I don't think wrapping is necessary. Defining and using a workqueue directly should be enough.
@@ -1113,6 +1120,11 @@ void blk_throtl_exit(struct request_queu
static int __init throtl_init(void) { + kthrotld_workqueue = alloc_workqueue("kthrotld", + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
And I don't think kthrotld needs to be HIGHPRI.
Ok, regenerating the patch with above change. Anyway I had to as I generated this patch on top of some of my local commits which are not in Linus tree yet. Thanks Vivek

On Fri, Feb 25, 2011 at 05:09:03PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 11:03:53AM -0500, Vivek Goyal wrote:
+int kthrotld_schedule_delayed_work(struct throtl_data *td, + struct delayed_work *dwork, unsigned long delay) +{ + return queue_delayed_work(kthrotld_workqueue, dwork, delay); +} +
I don't think wrapping is necessary. Defining and using a workqueue directly should be enough.
@@ -1113,6 +1120,11 @@ void blk_throtl_exit(struct request_queu
static int __init throtl_init(void) { + kthrotld_workqueue = alloc_workqueue("kthrotld", + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
And I don't think kthrotld needs to be HIGHPRI.
Thanks.
Here is the new patch. Dominik, can you please try it and see if fixes the issue. Thanks Vivek o Use a separate workqueue for throttle related work and don't reuse kblockd workqueue as there occurs a cycle dependency in cfq unplug work and throttle dispatch work. Yet-to-be-signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/blk-throttle.c | 27 ++++++++++++++++----------- include/linux/blkdev.h | 2 -- 2 files changed, 16 insertions(+), 13 deletions(-) Index: linux-2.6/block/blk-throttle.c =================================================================== --- linux-2.6.orig/block/blk-throttle.c 2011-02-25 11:11:49.000000000 -0500 +++ linux-2.6/block/blk-throttle.c 2011-02-25 11:21:51.709326299 -0500 @@ -20,6 +20,10 @@ static int throtl_quantum = 32; /* Throttling is performed over 100ms slice and after that slice is renewed */ static unsigned long throtl_slice = HZ/10; /* 100 ms */ +/* A workqueue to queue throttle related work */ +static struct workqueue_struct *kthrotld_workqueue; +void throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay); + struct throtl_rb_root { struct rb_root rb; struct rb_node *left; @@ -345,10 +349,9 @@ static void throtl_schedule_next_dispatc update_min_dispatch_time(st); if (time_before_eq(st->min_disptime, jiffies)) - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); else - throtl_schedule_delayed_work(td->queue, - (st->min_disptime - jiffies)); + throtl_schedule_delayed_work(td, (st->min_disptime - jiffies)); } static inline void @@ -815,10 +818,9 @@ void blk_throtl_work(struct work_struct } /* Call with queue lock held */ -void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) +void throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay) { - struct throtl_data *td = q->td; struct delayed_work *dwork = &td->throtl_work; if (total_nr_queued(td) > 0) { @@ -827,12 +829,11 @@ void throtl_schedule_delayed_work(struct * Cancel that and schedule a new one. */ __cancel_delayed_work(dwork); - kblockd_schedule_delayed_work(q, dwork, delay); + queue_delayed_work(kthrotld_workqueue, dwork, delay); throtl_log(td, "schedule work. delay=%lu jiffies=%lu", delay, jiffies); } } -EXPORT_SYMBOL(throtl_schedule_delayed_work); static void throtl_destroy_tg(struct throtl_data *td, struct throtl_grp *tg) @@ -920,7 +921,7 @@ static void throtl_update_blkio_group_re smp_mb__after_atomic_inc(); /* Schedule a work now to process the limit change */ - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); } static void throtl_update_blkio_group_write_bps(void *key, @@ -934,7 +935,7 @@ static void throtl_update_blkio_group_wr smp_mb__before_atomic_inc(); atomic_inc(&td->limits_changed); smp_mb__after_atomic_inc(); - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); } static void throtl_update_blkio_group_read_iops(void *key, @@ -948,7 +949,7 @@ static void throtl_update_blkio_group_re smp_mb__before_atomic_inc(); atomic_inc(&td->limits_changed); smp_mb__after_atomic_inc(); - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); } static void throtl_update_blkio_group_write_iops(void *key, @@ -962,7 +963,7 @@ static void throtl_update_blkio_group_wr smp_mb__before_atomic_inc(); atomic_inc(&td->limits_changed); smp_mb__after_atomic_inc(); - throtl_schedule_delayed_work(td->queue, 0); + throtl_schedule_delayed_work(td, 0); } void throtl_shutdown_timer_wq(struct request_queue *q) @@ -1135,6 +1136,10 @@ void blk_throtl_exit(struct request_queu static int __init throtl_init(void) { + kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0); + if (!kthrotld_workqueue) + panic("Failed to create kthrotld\n"); + blkio_policy_register(&blkio_policy_throtl); return 0; } Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2011-02-25 11:11:49.000000000 -0500 +++ linux-2.6/include/linux/blkdev.h 2011-02-25 11:13:03.670455489 -0500 @@ -1136,7 +1136,6 @@ static inline uint64_t rq_io_start_time_ extern int blk_throtl_init(struct request_queue *q); extern void blk_throtl_exit(struct request_queue *q); extern int blk_throtl_bio(struct request_queue *q, struct bio **bio); -extern void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay); extern void throtl_shutdown_timer_wq(struct request_queue *q); #else /* CONFIG_BLK_DEV_THROTTLING */ static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio) @@ -1146,7 +1145,6 @@ static inline int blk_throtl_bio(struct static inline int blk_throtl_init(struct request_queue *q) { return 0; } static inline int blk_throtl_exit(struct request_queue *q) { return 0; } -static inline void throtl_schedule_delayed_work(struct request_queue *q, unsigned long delay) {} static inline void throtl_shutdown_timer_wq(struct request_queue *q) {} #endif /* CONFIG_BLK_DEV_THROTTLING */

On Fri, Feb 25, 2011 at 08:24:15AM +0100, Dominik Klein wrote:
Maybe watching what the workqueue is doing using the following trace events could be helpful.
# grep workqueue /sys/kernel/debug/tracing/available_events workqueue:workqueue_insertion workqueue:workqueue_execution workqueue:workqueue_creation workqueue:workqueue_destruction
Since I've never done this before, I will tell you how I captured the trace so you know what I did and can see whether I did something wrong and can correct me if necessary.
echo blk > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/block/sdb/trace/enable echo workqueue_queue_work >> /sys/kernel/debug/tracing/set_event echo workqueue_activate_work >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_start >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_end >> /sys/kernel/debug/tracing/set_event cat /sys/kernel/debug/tracing/trace_pipe
Just an FYI, If you download trace-cmd from: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git You could do the following: # trace-cmd record -p blk -e workqueue_queue_work -e workqueue_activate_work -e workqueue_execute_start -e workqueue_execute_end <cmd> Where <cmd> could be sh -c "echo 1 > /sys/block/sdb/trace/enable; sleep 10" This will create a trace.dat file that you can read with trace-cmd report Another way, if you want to do more than just that echo is to use: # trace-cmd start -p blk -e workqueue_queue_work -e workqueue_activate_work -e workqueue_execute_start -e workqueue_execute_end # echo 1 > /sys/block/sdb/trace/enable # <do anyting you want> # trace-cmd stop # trace-cmd extract The start just enable ftrace (like you did with the echos). The extract will create the trace.dat file from the ftrace ring buffers. You could alse just cat trace_pipe, which would do the same, or you could do the echos, and then the trace-cmd stop and extract to get the file. It's your choice ;) You can then gzip the trace.dat file and send it to others that can read it as well, as all the information needed to read the trace is recorded in the file. You could even send the data over the network instead of writing the trace.dat locally. On another box: $ trace-cmd listen -p 12345 Then on the target: # trace-cmd record -N host:12345 ... This will send the data to the listening host and the file will be created on the host side. One added benefit of having the trace.dat file. kernelshark can read it. -- Steve
And that output i gzip'd.
This was taken with 2.6.37 plus the patch you mentioned on a Dell R815 with 2 12 core AMD Opteron 6174 CPUs. If you need any more information, please let me know.

On Fri, Feb 25, 2011 at 02:53:08PM -0500, Steven Rostedt wrote:
On Fri, Feb 25, 2011 at 08:24:15AM +0100, Dominik Klein wrote:
Maybe watching what the workqueue is doing using the following trace events could be helpful.
# grep workqueue /sys/kernel/debug/tracing/available_events workqueue:workqueue_insertion workqueue:workqueue_execution workqueue:workqueue_creation workqueue:workqueue_destruction
Since I've never done this before, I will tell you how I captured the trace so you know what I did and can see whether I did something wrong and can correct me if necessary.
echo blk > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/block/sdb/trace/enable echo workqueue_queue_work >> /sys/kernel/debug/tracing/set_event echo workqueue_activate_work >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_start >> /sys/kernel/debug/tracing/set_event echo workqueue_execute_end >> /sys/kernel/debug/tracing/set_event cat /sys/kernel/debug/tracing/trace_pipe
Just an FYI,
If you download trace-cmd from:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git
You could do the following:
# trace-cmd record -p blk -e workqueue_queue_work -e workqueue_activate_work -e workqueue_execute_start -e workqueue_execute_end <cmd>
Where <cmd> could be sh -c "echo 1 > /sys/block/sdb/trace/enable; sleep 10"
This will create a trace.dat file that you can read with trace-cmd report
Another way, if you want to do more than just that echo is to use:
# trace-cmd start -p blk -e workqueue_queue_work -e workqueue_activate_work -e workqueue_execute_start -e workqueue_execute_end # echo 1 > /sys/block/sdb/trace/enable # <do anyting you want> # trace-cmd stop # trace-cmd extract
The start just enable ftrace (like you did with the echos). The extract will create the trace.dat file from the ftrace ring buffers. You could alse just cat trace_pipe, which would do the same, or you could do the echos, and then the trace-cmd stop and extract to get the file. It's your choice ;)
You can then gzip the trace.dat file and send it to others that can read it as well, as all the information needed to read the trace is recorded in the file.
You could even send the data over the network instead of writing the trace.dat locally.
On another box:
$ trace-cmd listen -p 12345
Then on the target:
# trace-cmd record -N host:12345 ...
This will send the data to the listening host and the file will be created on the host side.
Thanks Steve. In this case this feature of sending trace data over network would have helped. We were running into issues where IO scheduler was freezing so we could not read anything from disk(including saved traces). Hence we were directing everything to console and then doing copy paste. So sending it over network would have probably worked even in this case. Will give trace-cmd a try next time. Thanks Vivek

On Fri, 2011-02-25 at 15:18 -0500, Vivek Goyal wrote:
Thanks Steve. In this case this feature of sending trace data over network would have helped. We were running into issues where IO scheduler was freezing so we could not read anything from disk(including saved traces). Hence we were directing everything to console and then doing copy paste.
So sending it over network would have probably worked even in this case.
Will give trace-cmd a try next time.
Hi Vivek, I just pushed out a hack that lets trace-cmd read the blktrace. I found that the blktrace never exported its structure to the /debug/tracing/events/ftrace/blktrace/format. So userspace has no real way to know how to parse it. Instead, I wrote a hack that creates this file semi dynamically, based on the information of other events. I also copied a lot of the blktrace code from the kernel so that it can print out the same format. You need to install the plugin that is built with trace-cmd. It will either be automatically installed if you do a make install, but if you do not have root access, just cp the plugin_blk.so into ~/.trace-cmd/plugins directory (you may need to make that directory yourself). Then when you run trace-cmd report on a file made with the blk tracer, it will give you a nice output. If you already have a trace.dat file from a previous extract, you don't need to run the trace again. The trace-cmd report will work on that file now. I'll be heading out to NYC on Monday for the End Users Conf and this weekend I need to get all my chores done around the house for the misses to let me go ;) Thus, I wont be doing much more till I get back at the end of next week. -- Steve
participants (5)
-
Dominik Klein
-
Gui Jianfeng
-
Steven Rostedt
-
Tejun Heo
-
Vivek Goyal