On 02/21/2011 09:19 AM, Dominik Klein wrote:
>> - Is it possible to capture 10-15 second blktrace on your
underlying
>> physical device. That should give me some idea what's happening.
>
> Will do, read on.
Just realized I missed this one ... Had better done it right away.
So here goes.
Setup as in first email. 8 Machines, 2 important, 6 not important ones
with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes.
blktrace -d /dev/sdb -w 30
=== sdb ===
CPU 0: 4769 events, 224 KiB data
CPU 1: 28079 events, 1317 KiB data
CPU 2: 1179 events, 56 KiB data
CPU 3: 5529 events, 260 KiB data
CPU 4: 295 events, 14 KiB data
CPU 5: 649 events, 31 KiB data
CPU 6: 185 events, 9 KiB data
CPU 7: 180 events, 9 KiB data
CPU 8: 17 events, 1 KiB data
CPU 9: 12 events, 1 KiB data
CPU 10: 6 events, 1 KiB data
CPU 11: 55 events, 3 KiB data
CPU 12: 28005 events, 1313 KiB data
CPU 13: 1542 events, 73 KiB data
CPU 14: 4814 events, 226 KiB data
CPU 15: 389 events, 19 KiB data
CPU 16: 1545 events, 73 KiB data
CPU 17: 119 events, 6 KiB data
CPU 18: 3019 events, 142 KiB data
CPU 19: 62 events, 3 KiB data
CPU 20: 800 events, 38 KiB data
CPU 21: 17 events, 1 KiB data
CPU 22: 243 events, 12 KiB data
CPU 23: 1 events, 1 KiB data
Total: 81511 events (dropped 0), 3822 KiB data
Very constant 296 blocked processes in vmstat during this run. But...
apparently no data is written at all (see "bo" column).
vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 296 0 125254224 21432 142016 0 0 16 633 181 331 0
0 93 7
0 296 0 125253728 21432 142016 0 0 0 0 17115 33794
0 0 25 75
0 296 0 125254112 21432 142016 0 0 0 0 17084 33721
0 0 25 74
1 296 0 125254352 21440 142012 0 0 0 18 17047 33736
0 0 25 75
0 296 0 125304224 21440 131060 0 0 0 0 17630 33989
0 1 23 76
1 296 0 125306496 21440 130260 0 0 0 0 16810 33401
0 0 20 80
4 296 0 125307208 21440 129856 0 0 0 0 17169 33744
0 0 26 74
0 296 0 125307496 21448 129508 0 0 0 14 17105 33650
0 0 36 64
0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674
0 0 22 78
1 296 0 125307752 21452 129520 0 0 0 0 16875 33438
0 0 29 70
1 296 0 125307776 21452 129520 0 0 0 0 16959 33560
0 0 21 79
1 296 0 125307792 21460 129520 0 0 0 12 16700 33239
0 0 15 85
1 296 0 125307808 21460 129520 0 0 0 0 16750 33274
0 0 25 74
1 296 0 125307808 21460 129520 0 0 0 0 17020 33601
0 0 26 74
1 296 0 125308272 21460 129520 0 0 0 0 17080 33616
0 0 20 80
1 296 0 125308408 21460 129520 0 0 0 0 16428 32972
0 0 42 58
1 296 0 125308016 21460 129524 0 0 0 0 17021 33624
0 0 22 77
While we're on that ... It is impossible for me now to recover from this
state without pulling the power plug.
On the VMs console I see messages like
INFO: task (kjournald|flush-254|dd|rs:main|...) blocked for more than
120 seconds.
While the ssh sessions through which the dd was started seem intact
(pressing enter gives a new line), it is impossible to cancel the dd
command. Logging in on the VMs console also is impossible.
Opening a new ssh session to the host also does not work. Killing the
qemu-kvm processes from a session opened earlier leaves zomby processes.
Moving the VMs back to the root cgroup makes no difference either.
Regards
Dominik