
On Fri, Feb 25, 2011 at 10:11:13AM -0500, Vivek Goyal wrote:
On Fri, Feb 25, 2011 at 04:03:29PM +0100, Tejun Heo wrote:
Hello,
On Fri, Feb 25, 2011 at 09:57:08AM -0500, Vivek Goyal wrote:
blk_throtl_work() calls generic_make_request() to dispatch some bios and I guess blk_throtl_work() has been put to sleep because threre are no request descriptors available and CFQ is frozen so no requests descriptors get freed hence blk_throtl_work() never finishes.
Following caught my eye.
ksoftirqd/0-3 [000] 1640.983585: 8,16 m N cfq4810 slice expired t=0 ksoftirqd/0-3 [000] 1640.983588: 8,16 m N cfq4810 sl_used=2 disp=6 charge=2 iops=0 sect=2080 ksoftirqd/0-3 [000] 1640.983589: 8,16 m N cfq4810 del_from_rr ksoftirqd/0-3 [000] 1640.983591: 8,16 m N cfq schedule dispatch sshd-3125 [004] 1640.983597: workqueue_queue_work: work struct=ffff88102c3a3110 function=flush_to_ldisc workqueue=ffff88182c834a00 req_cpu=4 cpu=4 sshd-3125 [004] 1640.983598: workqueue_activate_work: work struct ffff88102c3a3110
CFQ tries to schedule a work and but there is no associated "workqueue_queue_work" trace. So it looks like that work never got queued.
CFQ calls following.
cfq_log(cfqd, "schedule dispatch"); kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
We do see "schedule dispatch" message and kblockd_schedule_work() calls queue_work(). So what happended here? This is strange. I will put one more trace after kblockd_schedule_work() to trace that function returned.
It could be that the unplug work was already queued and in pending state. The second queueing request will be ignored then. So, I think the problem is that blk_throtl_work() occupies kblockd but requires another work item (unplug_work) to make forward progress. In such cases, forward progress cannot be guaranteed. Either blk_throtl_work() or cfq unplug work should use a separate workqueue.
Ok, that would make sense. So blk_throtl_work() can not finish as CFQ is not making progress and no request descriptors are being freed and unplug_work() is not being called because blk_throtl_work() has not finished. So that's cyclic dependency and I should use a separate work queue for queueing throttle related work. I will write a patch.
The only thing unexplained is why same problem does not happen in 2.6.38-rc kernels. Thanks Vivek