On Mon, Feb 21, 2011 at 09:30:32AM +0100, Dominik Klein wrote:
On 02/21/2011 09:19 AM, Dominik Klein wrote:
>>> - Is it possible to capture 10-15 second blktrace on your underlying
>>> physical device. That should give me some idea what's happening.
>>
>> Will do, read on.
>
> Just realized I missed this one ... Had better done it right away.
>
> So here goes.
>
> Setup as in first email. 8 Machines, 2 important, 6 not important ones
> with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes.
>
> blktrace -d /dev/sdb -w 30
> === sdb ===
> CPU 0: 4769 events, 224 KiB data
> CPU 1: 28079 events, 1317 KiB data
> CPU 2: 1179 events, 56 KiB data
> CPU 3: 5529 events, 260 KiB data
> CPU 4: 295 events, 14 KiB data
> CPU 5: 649 events, 31 KiB data
> CPU 6: 185 events, 9 KiB data
> CPU 7: 180 events, 9 KiB data
> CPU 8: 17 events, 1 KiB data
> CPU 9: 12 events, 1 KiB data
> CPU 10: 6 events, 1 KiB data
> CPU 11: 55 events, 3 KiB data
> CPU 12: 28005 events, 1313 KiB data
> CPU 13: 1542 events, 73 KiB data
> CPU 14: 4814 events, 226 KiB data
> CPU 15: 389 events, 19 KiB data
> CPU 16: 1545 events, 73 KiB data
> CPU 17: 119 events, 6 KiB data
> CPU 18: 3019 events, 142 KiB data
> CPU 19: 62 events, 3 KiB data
> CPU 20: 800 events, 38 KiB data
> CPU 21: 17 events, 1 KiB data
> CPU 22: 243 events, 12 KiB data
> CPU 23: 1 events, 1 KiB data
> Total: 81511 events (dropped 0), 3822 KiB data
>
> Very constant 296 blocked processes in vmstat during this run. But...
> apparently no data is written at all (see "bo" column).
Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no
"bo" is bad. That would explain that why your box is not responding
and you need to do power reset.
- I am assuming that you have not put any throttling limits on root group.
Is your system root also on /dev/sdb or on a separate disk altogether.
- This sounds like a bug in throttling logic. To narrow it down can you
start running "deadline" on end device. If it still happens, it is more
or less in throttling layer.
- We can also try to remove dm layers and just create partitions on
/dev/sdb and export as virtio disks to virtual machines and take
dm layer out of picture and see if it still happens.
- In one of the mails you mentioned that with 1 virutal machine throttling
READs and WRITEs is working for you. So it looks like 1 virtual machine
does not hang but once you launch 8 virtual machines it hangs. Can we
try increasing the number of vitual machines gragually and confirm that
it happens only if some certain number of virtual machines are launched.
- Can you also paste me the rules you have put on important and non-important
groups. Somehow I suspect that some of the rule has gone horribly bad
in the sense that it is very low and effectively no virtual machine
is making any progress.
- How long does it take to reach in this locked state where bo=0.
- you can also try to redirect blktrace output to blkparse and redirect
it to standard output and see capture some output by copying pasting
last messages.
In the mean time, I will try to launch more machines and see if I can
reproduce the issue.
Thanks
Vivek
>
> vmstat 2
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy
> id wa
> 0 296 0 125254224 21432 142016 0 0 16 633 181 331 0
> 0 93 7
> 0 296 0 125253728 21432 142016 0 0 0 0 17115 33794
> 0 0 25 75
> 0 296 0 125254112 21432 142016 0 0 0 0 17084 33721
> 0 0 25 74
> 1 296 0 125254352 21440 142012 0 0 0 18 17047 33736
> 0 0 25 75
> 0 296 0 125304224 21440 131060 0 0 0 0 17630 33989
> 0 1 23 76
> 1 296 0 125306496 21440 130260 0 0 0 0 16810 33401
> 0 0 20 80
> 4 296 0 125307208 21440 129856 0 0 0 0 17169 33744
> 0 0 26 74
> 0 296 0 125307496 21448 129508 0 0 0 14 17105 33650
> 0 0 36 64
> 0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674
> 0 0 22 78
> 1 296 0 125307752 21452 129520 0 0 0 0 16875 33438
> 0 0 29 70
> 1 296 0 125307776 21452 129520 0 0 0 0 16959 33560
> 0 0 21 79
> 1 296 0 125307792 21460 129520 0 0 0 12 16700 33239
> 0 0 15 85
> 1 296 0 125307808 21460 129520 0 0 0 0 16750 33274
> 0 0 25 74
> 1 296 0 125307808 21460 129520 0 0 0 0 17020 33601
> 0 0 26 74
> 1 296 0 125308272 21460 129520 0 0 0 0 17080 33616
> 0 0 20 80
> 1 296 0 125308408 21460 129520 0 0 0 0 16428 32972
> 0 0 42 58
> 1 296 0 125308016 21460 129524 0 0 0 0 17021 33624
> 0 0 22 77
While we're on that ... It is impossible for me now to recover from this
state without pulling the power plug.
On the VMs console I see messages like
INFO: task (kjournald|flush-254|dd|rs:main|...) blocked for more than
120 seconds.
If VMs are completely blocked and not making any progress, it is expected.
While the ssh sessions through which the dd was started seem intact
(pressing enter gives a new line), it is impossible to cancel the dd
command. Logging in on the VMs console also is impossible.
Opening a new ssh session to the host also does not work. Killing the
qemu-kvm processes from a session opened earlier leaves zomby processes.
Moving the VMs back to the root cgroup makes no difference either.
Regards
Dominik
--
libvir-list mailing list
libvir-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list