1. 21 Oct, 2015 1 commit
  2. 10 Jun, 2015 1 commit
    • Jens Axboe's avatar
      cfq-iosched: fix the setting of IOPS mode on SSDs · 0bb97947
      Jens Axboe authored
      A previous commit wanted to make CFQ default to IOPS mode on
      non-rotational storage, however it did so when the queue was
      initialized and the non-rotational flag is only set later on
      in the probe.
      Add an elevator hook that gets called off the add_disk() path,
      at that point we know that feature probing has finished, and
      we can reliably check for the various flags that drivers can
      Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
      Tested-by: default avatarRomain Francoise <romain@orebokech.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  3. 02 Jun, 2015 1 commit
  4. 23 Apr, 2015 1 commit
  5. 04 Dec, 2014 1 commit
  6. 23 Oct, 2014 1 commit
  7. 22 Jun, 2014 1 commit
  8. 11 Jun, 2014 1 commit
  9. 10 Jun, 2014 1 commit
  10. 10 Apr, 2014 1 commit
    • Jens Axboe's avatar
      block: fix regression with block enabled tagging · 360f92c2
      Jens Axboe authored
      Martin reported that his test system would not boot with
      current git, it oopsed with this:
      BUG: unable to handle kernel paging request at ffff88046c6c9e80
      IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
      PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
      Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
      rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
      CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
      Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
      Workqueue: events_unbound async_run_entry_fn
      task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
      RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
      RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
      RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
      RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
      RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
      R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
      R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
      FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
       ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
       ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
       ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
      Call Trace:
       [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
       [<ffffffff81293167>] __blk_run_queue+0x37/0x50
       [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
       [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
       [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
       [<ffffffff813bba35>] scsi_execute+0xe5/0x180
       [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
       [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
       [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
       [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
       [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
       [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
       [<ffffffff8106a1c5>] process_one_work+0x175/0x410
       [<ffffffff8106b373>] worker_thread+0x123/0x400
       [<ffffffff8106b250>] ? manage_workers+0x160/0x160
       [<ffffffff8107104e>] kthread+0xce/0xf0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
      Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
      df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
      58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
      RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
       RSP <ffff880273d03a58>
      CR2: ffff88046c6c9e80
      Martin bisected and found this to be the problem patch;
      	commit 6d113398
      	Author: Jan Kara <jack@suse.cz>
      	Date:   Mon Feb 24 16:39:54 2014 +0100
      	    block: Stop abusing rq->csd.list in blk-softirq
      and the problem was immediately apparent. The patch states that
      it is safe to reuse queuelist at completion time, since it is
      no longer used. However, that is not true if a device is using
      block enabled tagging. If that is the case, then the queuelist
      is reused to keep track of busy tags. If a device also ended
      up using softirq completions, we'd reuse ->queuelist for the
      IPI handling while block tagging was still using it. Boom.
      Fix this by adding a new ipi_list list head, and share the
      memory used with the request hash table. The hash table is
      never used after the request is moved to the dispatch list,
      which happens long before any potential completion of the
      request. Add a new request bit for this, so we don't have
      cases that check rq->hash while it could potentially have
      been reused for the IPI completion.
      Reported-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Tested-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  11. 24 Nov, 2013 1 commit
    • Kent Overstreet's avatar
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet authored
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
  12. 08 Nov, 2013 2 commits
    • Tomoki Sekiyama's avatar
      elevator: acquire q->sysfs_lock in elevator_change() · 7c8a3679
      Tomoki Sekiyama authored
      Add locking of q->sysfs_lock into elevator_change() (an exported function)
      to ensure it is held to protect q->elevator from elevator_init(), even if
      elevator_change() is called from non-sysfs paths.
      sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
      version, as the lock is already taken by elv_iosched_store().
      Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tomoki Sekiyama's avatar
      elevator: Fix a race in elevator switching and md device initialization · eb1c160b
      Tomoki Sekiyama authored
      The soft lockup below happens at the boot time of the system using dm
      multipath and the udev rules to switch scheduler.
      [  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
      [  356.127001] RIP: 0010:[<ffffffff81072a7d>]  [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
      [  356.127001] Call Trace:
      [  356.127001]  [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
      [  356.127001]  [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
      [  356.127001]  [<ffffffff810738b2>] del_timer_sync+0x52/0x60
      [  356.127001]  [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
      [  356.127001]  [<ffffffff812c98df>] elevator_exit+0x2f/0x50
      [  356.127001]  [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
      [  356.127001]  [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
      [  356.127001]  [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
      [  356.127001]  [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
      [  356.127001]  [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
      [  356.127001]  [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
      [  356.127001]  [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
      This is caused by a race between md device initialization by multipathd and
      shell script to switch the scheduler using sysfs.
       - multipathd:
         SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
         -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
          q->elevator = elevator_alloc(q, e); // not yet initialized
       - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
         elevator_switch (in the call trace above)
          struct elevator_queue *old = q->elevator;
          q->elevator = elevator_alloc(q, new_e);
          elevator_exit(old);                 // lockup! (*)
       - multipathd: (cont.)
          err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified
      (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
      while timer->base == NULL. In this case, as timer will never initialized,
      it results in lockup.
      This patch introduces acquisition of q->sysfs_lock around elevator_init()
      into blk_init_allocated_queue(), to provide mutual exclusion between
      initialization of the q->scheduler and switching of the scheduler.
      This should fix this bugzilla:
      https://bugzilla.redhat.com/show_bug.cgi?id=902012Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  13. 11 Sep, 2013 1 commit
  14. 03 Jul, 2013 1 commit
    • Jianpeng Ma's avatar
      elevator: Fix a race in elevator switching · d50235b7
      Jianpeng Ma authored
      There's a race between elevator switching and normal io operation.
          Because the allocation of struct elevator_queue and struct elevator_data
          don't in a atomic operation.So there are have chance to use NULL
          For example:
              Thread A:                               Thread B
              blk_queu_bio                            elevator_switch
              spin_lock_irq(q->queue_block)           elevator_alloc
              elv_merge                               elevator_init_fn
          Because call elevator_alloc, it can't hold queue_lock and the
          ->elevator_data is NULL.So at the same time, threadA call elv_merge and
          nedd some info of elevator_data.So the crash happened.
          Move the elevator_alloc into func elevator_init_fn, it make the
          operations in a atomic operation.
          Using the follow method can easy reproduce this bug
          1:dd if=/dev/sdb of=/dev/null
          2:while true;do echo noop > scheduler;echo deadline > scheduler;done
          The test method also use this method.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  15. 23 Mar, 2013 1 commit
  16. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
              list_for_each_entry(pos, head, member)
      The hlist ones were greedy and wanted an extra parameter:
              hlist_for_each_entry(tpos, pos, head, member)
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      Besides the semantic patch, there was some manual work required:
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      -T b;
          <+... when != b
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      - b,
      c) S
      for_each_busy_worker(a, c,
      - b,
      d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      -(a, b)
      + sk_for_each_from(a) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d, e) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 23 Jan, 2013 1 commit
    • Tejun Heo's avatar
      block: don't request module during elevator init · 21c3c5d2
      Tejun Heo authored
      Block layer allows selecting an elevator which is built as a module to
      be selected as system default via kernel param "elevator=".  This is
      achieved by automatically invoking request_module() whenever a new
      block device is initialized and the elevator is not available.
      This led to an interesting deadlock problem involving async and module
      init.  Block device probing running off an async job invokes
      request_module().  While the module is being loaded, it performs
      async_synchronize_full() which ends up waiting for the async job which
      is already waiting for request_module() to finish, leading to
      Invoking request_module() from deep in block device init path is
      already nasty in itself.  It seems best to avoid these situations from
      the beginning by moving on-demand module loading out of block init
      The previous patch made sure that the default elevator module is
      loaded early during boot if available.  This patch removes on-demand
      loading of the default elevator from elevator init path.  As the
      module would have been loaded during boot, userland-visible behavior
      difference should be minimal.
      For more details, please refer to the following thread.
      v2: The bool parameter was named @request_module which conflicted with
          request_module().  This built okay w/ CONFIG_MODULES because
          request_module() was defined as a macro.  W/o CONFIG_MODULES, it
          causes build breakage.  Rename the parameter to @try_loading.
          Reported by Fengguang.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Riesen <raa.lkml@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
  18. 18 Jan, 2013 1 commit
    • Tejun Heo's avatar
      init, block: try to load default elevator module early during boot · bb813f4c
      Tejun Heo authored
      This patch adds default module loading and uses it to load the default
      block elevator.  During boot, it's called right after initramfs or
      initrd is made available and right before control is passed to
      userland.  This ensures that as long as the modules are available in
      the usual places in initramfs, initrd or the root filesystem, the
      default modules are loaded as soon as possible.
      This will replace the on-demand elevator module loading from elevator
      init path.
      v2: Fixed build breakage when !CONFIG_BLOCK.  Reported by kbuild test
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Riesen <raa.lkml@gmail.com>
      Cc: Fengguang We <fengguang.wu@intel.com>
  19. 11 Jan, 2013 1 commit
  20. 09 Nov, 2012 1 commit
    • Shaohua Li's avatar
      block: recursive merge requests · bee0393c
      Shaohua Li authored
      In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
      When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
      and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.
      If we do recursive merge for such interleave access, some workloads throughput
      get improvement. A recent worload I'm checking on is swap, below change
      boostes the throughput around 5% ~ 10%.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  21. 20 Sep, 2012 1 commit
  22. 20 Apr, 2012 1 commit
    • Tejun Heo's avatar
      blkcg: implement per-queue policy activation · a2b1693b
      Tejun Heo authored
      All blkcg policies were assumed to be enabled on all request_queues.
      Due to various implementation obstacles, during the recent blkcg core
      updates, this was temporarily implemented as shooting down all !root
      blkgs on elevator switch and policy [de]registration combined with
      half-broken in-place root blkg updates.  In addition to being buggy
      and racy, this meant losing all blkcg configurations across those
      Now that blkcg is cleaned up enough, this patch replaces the temporary
      implementation with proper per-queue policy activation.  Each blkcg
      policy should call the new blkcg_[de]activate_policy() to enable and
      disable the policy on a specific queue.  blkcg_activate_policy()
      allocates and installs policy data for the policy for all existing
      blkgs.  blkcg_deactivate_policy() does the reverse.  If a policy is
      not enabled for a given queue, blkg printing / config functions skip
      the respective blkg for the queue.
      blkcg_activate_policy() also takes care of root blkg creation, and
      cfq_init_queue() and blk_throtl_init() are updated accordingly.
      This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
      unnecessary.  Dropped.
      v2: cfq_init_queue() was returning uninitialized @ret on root_group
          alloc failure if !CONFIG_CFQ_GROUP_IOSCHED.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  23. 06 Mar, 2012 7 commits
    • Tejun Heo's avatar
      block: implement bio_associate_current() · 852c788f
      Tejun Heo authored
      IO scheduling and cgroup are tied to the issuing task via io_context
      and cgroup of %current.  Unfortunately, there are cases where IOs need
      to be routed via a different task which makes scheduling and cgroup
      limit enforcement applied completely incorrectly.
      For example, all bios delayed by blk-throttle end up being issued by a
      delayed work item and get assigned the io_context of the worker task
      which happens to serve the work item and dumped to the default block
      cgroup.  This is double confusing as bios which aren't delayed end up
      in the correct cgroup and makes using blk-throttle and cfq propio
      together impossible.
      Any code which punts IO issuing to another task is affected which is
      getting more and more common (e.g. btrfs).  As both io_context and
      cgroup are firmly tied to task including userland visible APIs to
      manipulate them, it makes a lot of sense to match up tasks to bios.
      This patch implements bio_associate_current() which associates the
      specified bio with %current.  The bio will record the associated ioc
      and blkcg at that point and block layer will use the recorded ones
      regardless of which task actually ends up issuing the bio.  bio
      release puts the associated ioc and blkcg.
      It grabs and remembers ioc and blkcg instead of the task itself
      because task may already be dead by the time the bio is issued making
      ioc and blkcg inaccessible and those are all block layer cares about.
      elevator_set_req_fn() is updated such that the bio elvdata is being
      allocated for is available to the elevator.
      This doesn't update block cgroup policies yet.  Further patches will
      implement the support.
      -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
           rq_ioc() to fix build breakage.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo authored
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: let blkcg core manage per-queue blkg list and counter · 03aa264a
      Tejun Heo authored
      With the previous patch to move blkg list heads and counters to
      request_queue and blkg, logic to manage them in both policies are
      almost identical and can be moved to blkcg core.
      This patch moves blkg link logic into blkg_lookup_create(), implements
      common blkg unlink code in blkg_destroy(), and updates
      blkg_destory_all() so that it's policy specific and can skip root
      group.  The updated blkg_destroy_all() is now used to both clear queue
      for bypassing and elv switching, and release all blkgs on q exit.
      This patch introduces a race window where policy [de]registration may
      race against queue blkg clearing.  This can only be a problem on cfq
      unload and shouldn't be a real problem in practice (and we have many
      other places where this race already exists).  Future patches will
      remove these unlikely races.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo authored
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
           blk-throtl no longer shoots down root_tg to avoid breaking
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: implement blk_queue_bypass_start/end() · d732580b
      Tejun Heo authored
      Rename and extend elv_queisce_start/end() to
      blk_queue_bypass_start/end() which are exported and supports nesting
      via @q->bypass_depth.  Also add blk_queue_bypass() to test bypass
      This will be further extended and used for blkio_group management.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      elevator: make elevator_init_fn() return 0/-errno · b2fab5ac
      Tejun Heo authored
      elevator_ops->elevator_init_fn() has a weird return value.  It returns
      a void * which the caller should assign to q->elevator->elevator_data
      and %NULL return denotes init failure.
      Update such that it returns integer 0/-errno and sets elevator_data
      directly as necessary.
      This makes the interface more conventional and eases further cleanup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      elevator: clear auxiliary data earlier during elevator switch · 5a5bafdc
      Tejun Heo authored
      Elevator switch tries hard to keep as much as context until new
      elevator is ready so that it can revert to the original state if
      initializing the new elevator fails for some reason.  Unfortunately,
      with more auxiliary contexts to manage, this makes elevator init and
      exit paths too complex and fragile.
      This patch makes elevator_switch() unregister the current elevator and
      flush icq's before start initializing the new one.  As we still keep
      the old elevator itself, the only difference is that we lose icq's on
      rare occassions of switching failure, which isn't critical at all.
      Note that this makes explicit elevator parameter to
      elevator_init_queue() and __elv_register_queue() unnecessary as they
      always can use the current elevator.
      This patch enables block cgroup cleanups.
      -v2: blk_add_trace_msg() prints elevator name from @new_e instead of
           @e->type as the local variable no longer exists.  This caused
           build failure on CONFIG_BLK_DEV_IO_TRACE.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  24. 08 Feb, 2012 1 commit
    • Tejun Heo's avatar
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo authored
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  25. 15 Jan, 2012 1 commit
    • Jens Axboe's avatar
      Revert "block: recursive merge requests" · 5d381efb
      Jens Axboe authored
      This reverts commit 27419322.
      We have some problems related to selection of empty queues
      that need to be resolved, evidence so far points to the
      recursive merge logic making either being the cause or at
      least the accelerator for this. So revert it for now, until
      we figure this out.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  26. 16 Dec, 2011 1 commit
    • Shaohua Li's avatar
      block: recursive merge requests · 27419322
      Shaohua Li authored
      In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
      a+3,.... When the requests are flushed to queue, a and a+1 are merged
      to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
      aren't merged.
      With recursive merge below, the workload throughput gets improved 20%
      and context switch drops 60%.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  27. 13 Dec, 2011 6 commits
    • Tejun Heo's avatar
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo authored
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move icq cache management to block core · 3d3c2379
      Tejun Heo authored
      Let elevators set ->icq_size and ->icq_align in elevator_type and
      elv_register() and elv_unregister() respectively create and destroy
      kmem_cache for icq.
      * elv_register() now can return failure.  All callers updated.
      * icq caches are automatically named "ELVNAME_io_cq".
      * cfq_slab_setup/kill() are collapsed into cfq_init/exit().
      * While at it, minor indentation change for iosched_cfq.elevator_name
        for consistency.
      This will help moving icq management to block core.  This doesn't
      introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo authored
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      * Move cfqd->icq_list to request_queue->icq_list
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: remove elevator_queue->ops · 22f746e2
      Tejun Heo authored
      elevator_queue->ops points to the same ops struct ->elevator_type.ops
      is pointing to.  The only effect of caching it in elevator_queue is
      shorter notation - it doesn't save any indirect derefence.
      Relocate elevator_type->list which used only during module init/exit
      to the end of the structure, rename elevator_queue->elevator_type to
      ->type, and replace elevator_queue->ops with elevator_queue->type.ops.
      This doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: reorder elevator switch sequence · f8fc877d
      Tejun Heo authored
      Elevator switch sequence first attached the new elevator, then tried
      registering it (sysfs) and if that failed attached back the old
      elevator.  However, sysfs registration doesn't require the elevator to
      be attached, so there is no reason to do the "detach, attach new,
      register, maybe re-attach old" sequence.  It can just do "register,
      detach, attach".
      * elevator_init_queue() is updated to set ->elevator_data directly and
        return 0 / -errno.  This allows elevator_exit() on an unattached
      * __elv_unregister_queue() which was necessary to unregister
        unattached q is removed in favor of __elv_register_queue() which can
        register unattached q.
      * elevator_attach() becomes a single assignment and obscures more then
        it helps.  Dropped.
      This will help cleaning up io_context handling across elevator switch.
      This patch doesn't introduce visible behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: remove delayed unlink · b9a19208
      Tejun Heo authored
      Now that all cic's are immediately unlinked from both ioc and queue,
      lazy dropping from lookup path and trimming on elevator unregister are
      unnecessary.  Kill them and remove now unused elevator_ops->trim().
      This also leaves call_for_each_cic() without any user.  Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  28. 19 Oct, 2011 1 commit
    • Tejun Heo's avatar
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo authored
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>