1. 21 Jul, 2019 1 commit
    • Douglas Anderson's avatar
      block, bfq: NULL out the bic when it's no longer valid · 340a4da3
      Douglas Anderson authored
      commit dbc3117d4ca9e17819ac73501e914b8422686750 upstream.
      
      In reboot tests on several devices we were seeing a "use after free"
      when slub_debug or KASAN was enabled.  The kernel complained about:
      
        Unable to handle kernel paging request at virtual address 6b6b6c2b
      
      ...which is a classic sign of use after free under slub_debug.  The
      stack crawl in kgdb looked like:
      
       0  test_bit (addr=<optimized out>, nr=<optimized out>)
       1  bfq_bfqq_busy (bfqq=<optimized out>)
       2  bfq_select_queue (bfqd=<optimized out>)
       3  __bfq_dispatch_request (hctx=<optimized out>)
       4  bfq_dispatch_request (hctx=<optimized out>)
       5  0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
       6  0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
       7  0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
       8  0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>)
       9  0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
       10 0xc024cff4 in worker_thread (__worker=0xec6d4640)
      
      Digging in kgdb, it could be found that, though bfqq looked fine,
      bfqq->bic had been freed.
      
      Through further digging, I postulated that perhaps it is illegal to
      access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
      because the "bic" can be freed at some point in time after this call
      is made.  I confirmed that there certainly were cases where the exact
      crashing code path would access the "bic" after bfq_exit_icq() had
      been called.  Sspecifically I set the "bfqq->bic" to (void *)0x7 and
      saw that the bic was 0x7 at the time of the crash.
      
      To understand a bit more about why this crash was fairly uncommon (I
      saw it only once in a few hundred reboots), you can see that much of
      the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
      access the ->bic anymore.  The only case it doesn't is if
      bfq_put_queue() sees a reference still held.
      
      However, even in the case when bfqq isn't freed, the crash is still
      rare.  Why?  I tracked what happened to the "bic" after the exit
      routine.  It doesn't get freed right away.  Rather,
      put_io_context_active() eventually called put_io_context() which
      queued up freeing on a workqueue.  The freeing then actually happened
      later than that through call_rcu().  Despite all these delays, some
      extra debugging showed that all the hoops could be jumped through in
      time and the memory could be freed causing the original crash.  Phew!
      
      To make a long story short, assuming it truly is illegal to access an
      icq after the "exit_icq" callback is finished, this patch is needed.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      340a4da3
  2. 03 Jul, 2019 2 commits
  3. 15 Jun, 2019 2 commits
    • Paolo Valente's avatar
      block, bfq: increase idling for weight-raised queues · f88e587c
      Paolo Valente authored
      [ Upstream commit 778c02a236a8728bb992de10ed1f12c0be5b7b0e ]
      
      If a sync bfq_queue has a higher weight than some other queue, and
      remains temporarily empty while in service, then, to preserve the
      bandwidth share of the queue, it is necessary to plug I/O dispatching
      until a new request arrives for the queue. In addition, a timeout
      needs to be set, to avoid waiting for ever if the process associated
      with the queue has actually finished its I/O.
      
      Even with the above timeout, the device is however not fed with new
      I/O for a while, if the process has finished its I/O. If this happens
      often, then throughput drops and latencies grow. For this reason, the
      timeout is kept rather low: 8 ms is the current default.
      
      Unfortunately, such a low value may cause, on the opposite end, a
      violation of bandwidth guarantees for a process that happens to issue
      new I/O too late. The higher the system load, the higher the
      probability that this happens to some process. This is a problem in
      scenarios where service guarantees matter more than throughput. One
      important case are weight-raised queues, which need to be granted a
      very high fraction of the bandwidth.
      
      To address this issue, this commit lower-bounds the plugging timeout
      for weight-raised queues to 20 ms. This simple change provides
      relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
      gnome-terminal starts in 0.6 seconds if there is no other I/O in
      progress, the same applications starts in
      - 0.8 seconds, instead of 1.2 seconds, if ten files are being read
        sequentially in parallel
      - 1 second, instead of 2 seconds, if, in parallel, five files are
        being read sequentially, and five more files are being written
        sequentially
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f88e587c
    • Ming Lei's avatar
      blk-mq: move cancel of requeue_work into blk_mq_release · aa331e85
      Ming Lei authored
      [ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ]
      
      With holding queue's kobject refcount, it is safe for driver
      to schedule requeue. However, blk_mq_kick_requeue_list() may
      be called after blk_sync_queue() is done because of concurrent
      requeue activities, then requeue work may not be completed when
      freeing queue, and kernel oops is triggered.
      
      So moving the cancel of requeue_work into blk_mq_release() for
      avoiding race between requeue and freeing queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      aa331e85
  4. 31 May, 2019 1 commit
  5. 17 Apr, 2019 1 commit
  6. 20 Feb, 2019 1 commit
    • Jianchao Wang's avatar
      blk-mq: fix a hung issue when fsync · 883d561c
      Jianchao Wang authored
      [ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ]
      
      Florian reported a io hung issue when fsync(). It should be
      triggered by following race condition.
      
      data + post flush         a flush
      
      blk_flush_complete_seq
        case REQ_FSEQ_DATA
          blk_flush_queue_rq
          issued to driver      blk_mq_dispatch_rq_list
                                  try to issue a flush req
                                  failed due to NON-NCQ command
                                  .queue_rq return BLK_STS_DEV_RESOURCE
      
      request completion
        req->end_io // doesn't check RESTART
        mq_flush_data_end_io
          case REQ_FSEQ_POSTFLUSH
            blk_kick_flush
              do nothing because previous flush
              has not been completed
           blk_mq_run_hw_queue
                                    insert rq to hctx->dispatch
                                    due to RESTART is still set, do nothing
      
      To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
      with blk_mq_sched_restart to check and clear the RESTART flag.
      
      Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
      Reported-by: default avatarFlorian Stecker <m19@florianstecker.de>
      Tested-by: default avatarFlorian Stecker <m19@florianstecker.de>
      Signed-off-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      883d561c
  7. 29 Dec, 2018 2 commits
  8. 21 Dec, 2018 1 commit
  9. 21 Nov, 2018 1 commit
    • Ming Lei's avatar
      SCSI: fix queue cleanup race before queue initialization is done · a0613957
      Ming Lei authored
      commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.
      
      c2856ae2 ("blk-mq: quiesce queue before freeing queue") has
      already fixed this race, however the implied synchronize_rcu()
      in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
      performance regression.
      
      Then 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
      tried to quiesce queue for avoiding unnecessary synchronize_rcu()
      only when queue initialization is done, because it is usual to see
      lots of inexistent LUNs which need to be probed.
      
      However, turns out it isn't safe to quiesce queue only when queue
      initialization is done. Because when one SCSI command is completed,
      the user of sending command can be waken up immediately, then the
      scsi device may be removed, meantime the run queue in scsi_end_request()
      is still in-progress, so kernel panic can be caused.
      
      In Red Hat QE lab, there are several reports about this kind of kernel
      panic triggered during kernel booting.
      
      This patch tries to address the issue by grabing one queue usage
      counter during freeing one request and the following run queue.
      
      Fixes: 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
      Cc: stable <stable@vger.kernel.org>
      Cc: jianchao.wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0613957
  10. 13 Nov, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: correctly charge and reset entity service in all cases · d4d73f1f
      Paolo Valente authored
      [ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]
      
      BFQ schedules entities (which represent either per-process queues or
      groups of queues) as a function of their timestamps. In particular, as
      a function of their (virtual) finish times. The finish time of an
      entity is computed as a function of the budget assigned to the entity,
      assuming, tentatively, that the entity, once in service, will receive
      an amount of service equal to its budget. Then, when the entity is
      expired because it finishes to be served, this finish time is updated
      as a function of the actual service received by the entity. This
      allows the entity to be correctly charged with only the service
      received, and then to be correctly re-scheduled.
      
      Yet an entity may receive service also while not being the entity in
      service (in the scheduling environment of its parent entity), for
      several reasons. If the entity remains with no backlog while receiving
      this 'unofficial' service, then it is expired. Also on such an
      expiration, the finish time of the entity should be updated to account
      for only the service actually received by the entity. Unfortunately,
      such an update is not performed for an entity expiring without being
      the entity in service.
      
      In a similar vein, the service counter of the entity in service is
      reset when the entity is expired, to be ready to be used for next
      service cycle. This reset too should be performed also in case an
      entity is expired because it remains empty after receiving service
      while not being the entity in service. But in this case the reset is
      not performed.
      
      This commit performs the above update of the finish time and reset of
      the service received, also for an entity expiring while not being the
      entity in service.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4d73f1f
  11. 13 Oct, 2018 1 commit
  12. 26 Sep, 2018 3 commits
  13. 19 Sep, 2018 4 commits
  14. 15 Sep, 2018 2 commits
    • Bart Van Assche's avatar
      cfq: Suppress compiler warnings about comparisons · 9dd38052
      Bart Van Assche authored
      [ Upstream commit f7ecb1b1 ]
      
      This patch does not change any functionality but avoids that gcc
      reports the following warnings when building with W=1:
      
      block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_low_latency_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9dd38052
    • Greg Edwards's avatar
      block: bvec_nr_vecs() returns value for wrong slab · d67c7c9d
      Greg Edwards authored
      [ Upstream commit d6c02a9b ]
      
      In commit ed996a52 ("block: simplify and cleanup bvec pool
      handling"), the value of the slab index is incremented by one in
      bvec_alloc() after the allocation is done to indicate an index value of
      0 does not need to be later freed.
      
      bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
      value.  Decrement idx before performing the lookup.
      
      Fixes: ed996a52 ("block: simplify and cleanup bvec pool handling")
      Signed-off-by: default avatarGreg Edwards <gedwards@ddn.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d67c7c9d
  15. 09 Sep, 2018 3 commits
  16. 24 Aug, 2018 1 commit
  17. 17 Aug, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: fix wrong init of saved start time for weight raising · 1a2d9921
      Paolo Valente authored
      commit 4baa8bb1 upstream.
      
      This commit fixes a bug that causes bfq to fail to guarantee a high
      responsiveness on some drives, if there is heavy random read+write I/O
      in the background. More precisely, such a failure allowed this bug to
      be found [1], but the bug may well cause other yet unreported
      anomalies.
      
      BFQ raises the weight of the bfq_queues associated with soft real-time
      applications, to privilege the I/O, and thus reduce latency, for these
      applications. This mechanism is named soft-real-time weight raising in
      BFQ. A soft real-time period may happen to be nested into an
      interactive weight raising period, i.e., it may happen that, when a
      bfq_queue switches to a soft real-time weight-raised state, the
      bfq_queue is already being weight-raised because deemed interactive
      too. In this case, BFQ saves in a special variable
      wr_start_at_switch_to_srt, the time instant when the interactive
      weight-raising period started for the bfq_queue, i.e., the time
      instant when BFQ started to deem the bfq_queue interactive. This value
      is then used to check whether the interactive weight-raising period
      would still be in progress when the soft real-time weight-raising
      period ends.  If so, interactive weight raising is restored for the
      bfq_queue. This restore is useful, in particular, because it prevents
      bfq_queues from losing their interactive weight raising prematurely,
      as a consequence of spurious, short-lived soft real-time
      weight-raising periods caused by wrong detections as soft real-time.
      
      If, instead, a bfq_queue switches to soft-real-time weight raising
      while it *is not* already in an interactive weight-raising period,
      then the variable wr_start_at_switch_to_srt has no meaning during the
      following soft real-time weight-raising period. Unfortunately the
      handling of this case is wrong in BFQ: not only the variable is not
      flagged somehow as meaningless, but it is also set to the time when
      the switch to soft real-time weight-raising occurs. This may cause an
      interactive weight-raising period to be considered mistakenly as still
      in progress, and thus a spurious interactive weight-raising period to
      start for the bfq_queue, at the end of the soft-real-time
      weight-raising period. In particular the spurious interactive
      weight-raising period will be considered as still in progress, if the
      soft-real-time weight-raising period does not last very long. The
      bfq_queue will then be wrongly privileged and, if I/O bound, will
      unjustly steal bandwidth to truly interactive or soft real-time
      bfq_queues, harming responsiveness and low latency.
      
      This commit fixes this issue by just setting wr_start_at_switch_to_srt
      to minus infinity (farthest past time instant according to jiffies
      macros): when the soft-real-time weight-raising period ends, certainly
      no interactive weight-raising period will be considered as still in
      progress.
      
      [1] Background I/O Type: Random - Background I/O mix: Reads and writes
      - Application to start: LibreOffice Writer in
      http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-LaptopSigned-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarMirko Montanari <mirkomontanari91@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a2d9921
  18. 03 Aug, 2018 3 commits
  19. 22 Jul, 2018 1 commit
  20. 11 Jul, 2018 2 commits
  21. 03 Jul, 2018 1 commit
  22. 26 Jun, 2018 1 commit
    • Roman Pen's avatar
      blk-mq: reinit q->tag_set_list entry only after grace period · ba502bf2
      Roman Pen authored
      commit a347c7ad upstream.
      
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba502bf2
  23. 20 Jun, 2018 3 commits
  24. 16 Jun, 2018 1 commit
    • Bart Van Assche's avatar
      blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers · 1cbd5ece
      Bart Van Assche authored
      commit 327ea4ad upstream.
      
      Avoid that complaints similar to the following appear in the kernel log
      if the number of zones is sufficiently large:
      
        fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
        Call Trace:
        dump_stack+0x63/0x88
        warn_alloc+0xf5/0x190
        __alloc_pages_slowpath+0x8f0/0xb0d
        __alloc_pages_nodemask+0x242/0x260
        alloc_pages_current+0x6a/0xb0
        kmalloc_order+0x18/0x50
        kmalloc_order_trace+0x26/0xb0
        __kmalloc+0x20e/0x220
        blkdev_report_zones_ioctl+0xa5/0x1a0
        blkdev_ioctl+0x1ba/0x930
        block_ioctl+0x41/0x50
        do_vfs_ioctl+0xaa/0x610
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x79/0x1b0
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 3ed05a98 ("blk-zoned: implement ioctls")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
      Cc: Damien Le Moal <damien.lemoal@hgst.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cbd5ece