1. 13 Oct, 2018 1 commit
  2. 26 Jun, 2018 1 commit
    • Roman Pen's avatar
      blk-mq: reinit q->tag_set_list entry only after grace period · ba502bf2
      Roman Pen authored
      commit a347c7ad upstream.
      
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba502bf2
  3. 20 Jun, 2018 1 commit
  4. 26 Apr, 2018 1 commit
  5. 12 Apr, 2018 3 commits
    • Ming Lei's avatar
      blk-mq: fix kernel oops in blk_mq_tag_idle() · 8976d64b
      Ming Lei authored
      
      [ Upstream commit 8ab0b7dc ]
      
      HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
      then we need to check it before calling blk_mq_tag_idle(), otherwise
      the following kernel oops can be triggered, so fix it by checking if
      the hw queue is unmapped since it doesn't make sense to idle the tags
      any more after hw queues are unmapped.
      
      [  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
      [  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
      [  440.788359] RIP: 0010:[<ffffffffb730e2b4>]  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
      [  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
      [  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
      [  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
      [  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
      [  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
      [  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
      [  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
      [  440.876169] Call Trace:
      [  440.879818]  [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
      [  440.887051]  [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
      [  440.894465]  [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
      [  440.901881]  [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
      [  440.910068]  [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
      [  440.919026]  [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
      [  440.928079]  [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
      [  440.937126]  [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
      [  440.944517]  [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
      [  440.951607]  [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
      [  440.959760]  [<ffffffffb70b352f>] kthread+0xcf/0xe0
      [  440.966055]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.973715]  [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
      [  440.980586]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
      [  441.011620] RIP  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  441.019301]  RSP <ffff893bf9bcbd10>
      [  441.024052] CR2: 0000000000000008
      Reported-by: default avatarZhang Yi <yizhan@redhat.com>
      Tested-by: default avatarZhang Yi <yizhan@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8976d64b
    • Ming Lei's avatar
      blk-mq: fix race between updating nr_hw_queues and switching io sched · fb1ef85d
      Ming Lei authored
      
      [ Upstream commit fb350e0a ]
      
      In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
      can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
      example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
      freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.
      
      This patch fixes the race be holding q->sysfs_lock.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb1ef85d
    • Ming Lei's avatar
      blk-mq: avoid to map CPU into stale hw queue · eaa07780
      Ming Lei authored
      
      [ Upstream commit 7d4901a9 ]
      
      blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
      previous map isn't cleared yet, and may point to one stale hw queue
      index.
      
      This patch fixes the following issue by clearing the mapping table before
      setting it up in blk_mq_pci_map_queues().
      
      This patches fixes this following issue reported by Zhang Yi:
      
      [  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
      [  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
      [  101.216346] PGD 0 P4D 0
      [  101.219171] Oops: 0000 [#1] SMP
      [  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
      [  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
      [  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
      [  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
      [  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
      [  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
      [  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
      [  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
      [  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
      [  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
      [  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
      [  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
      [  101.387714] Call Trace:
      [  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
      [  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
      [  101.400848]  ? pick_next_task_fair+0x290/0x5f0
      [  101.405807]  ? __switch_to+0x1f5/0x430
      [  101.409988]  ? put_prev_entity+0x2f/0xd0
      [  101.414365]  process_one_work+0x141/0x340
      [  101.418836]  worker_thread+0x47/0x3e0
      [  101.422921]  kthread+0xf5/0x130
      [  101.426424]  ? rescuer_thread+0x380/0x380
      [  101.430896]  ? kthread_associate_blkcg+0x90/0x90
      [  101.436048]  ret_from_fork+0x1f/0x30
      [  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
      [  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
      [  101.468205] CR2: 0000000000000098
      [  101.471907] ---[ end trace 5fe710f98228a3ca ]---
      [  101.482489] Kernel panic - not syncing: Fatal exception
      [  101.488505] Kernel Offset: disabled
      [  101.497752] ---[ end Kernel panic - not syncing: Fatal exception
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eaa07780
  6. 09 Mar, 2018 1 commit
  7. 03 Mar, 2018 1 commit
  8. 11 Sep, 2017 1 commit
    • Jens Axboe's avatar
      block: directly insert blk-mq request from blk_insert_cloned_request() · 157f377b
      Jens Axboe authored
      A NULL pointer crash was reported for the case of having the BFQ IO
      scheduler attached to the underlying blk-mq paths of a DM multipath
      device.  The crash occured in blk_mq_sched_insert_request()'s call to
      e->type->ops.mq.insert_requests().
      
      Paolo Valente correctly summarized why the crash occured with:
      "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
      blk_rq_prep_clone) creates a cloned request without invoking
      e->type->ops.mq.prepare_request for the target elevator e.  The cloned
      request is therefore not initialized for the scheduler, but it is
      however inserted into the scheduler by blk_mq_sched_insert_request."
      
      All said, a request-based DM multipath device's IO scheduler should be
      the only one used -- when the original requests are issued to the
      underlying paths as cloned requests they are inserted directly in the
      underlying dispatch queue(s) rather than through an additional elevator.
      
      But commit bd166ef1 ("blk-mq-sched: add framework for MQ capable IO
      schedulers") switched blk_insert_cloned_request() from using
      blk_mq_insert_request() to blk_mq_sched_insert_request().  Which
      incorrectly added elevator machinery into a call chain that isn't
      supposed to have any.
      
      To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
      that blk_insert_cloned_request() calls to insert the request without
      involving any elevator that may be attached to the cloned request's
      request_queue.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarBart Van Assche <Bart.VanAssche@wdc.com>
      Tested-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      157f377b
  9. 18 Aug, 2017 1 commit
  10. 15 Aug, 2017 1 commit
  11. 10 Aug, 2017 1 commit
  12. 09 Aug, 2017 2 commits
  13. 02 Aug, 2017 1 commit
  14. 01 Aug, 2017 1 commit
    • Jens Axboe's avatar
      blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled · b7a71e66
      Jens Axboe authored
      We recently had a bug in the IPR SCSI driver, where it would end up
      making the SCSI mid layer run the mq hardware queue with interrupts
      disabled. This isn't legal, since the software queue locking relies
      on never being grabbed from interrupt context. Additionally, drivers
      that set BLK_MQ_F_BLOCKING may schedule from this context.
      
      Add a WARN_ON_ONCE() to catch bad users up front.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7a71e66
  15. 29 Jul, 2017 1 commit
  16. 03 Jul, 2017 1 commit
  17. 28 Jun, 2017 1 commit
  18. 27 Jun, 2017 2 commits
  19. 22 Jun, 2017 1 commit
  20. 21 Jun, 2017 7 commits
  21. 20 Jun, 2017 3 commits
    • Goldwyn Rodrigues's avatar
      block: return on congested block device · 03a07c92
      Goldwyn Rodrigues authored
      A new bio operation flag REQ_NOWAIT is introduced to identify bio's
      orignating from iocb with IOCB_NOWAIT. This flag indicates
      to return immediately if a request cannot be made instead
      of retrying.
      
      Stacked devices such as md (the ones with make_request_fn hooks)
      currently are not supported because it may block for housekeeping.
      For example, an md can have a part of the device suspended.
      For this reason, only request based devices are supported.
      In the future, this feature will be expanded to stacked devices
      by teaching them how to handle the REQ_NOWAIT flags.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      03a07c92
    • Ingo Molnar's avatar
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar authored
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2055da97
    • Ingo Molnar's avatar
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar authored
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ac6424b9
  22. 18 Jun, 2017 7 commits