1. 13 Nov, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: correctly charge and reset entity service in all cases · d4d73f1f
      Paolo Valente authored
      [ Upstream commit cbeb869a ]
      
      BFQ schedules entities (which represent either per-process queues or
      groups of queues) as a function of their timestamps. In particular, as
      a function of their (virtual) finish times. The finish time of an
      entity is computed as a function of the budget assigned to the entity,
      assuming, tentatively, that the entity, once in service, will receive
      an amount of service equal to its budget. Then, when the entity is
      expired because it finishes to be served, this finish time is updated
      as a function of the actual service received by the entity. This
      allows the entity to be correctly charged with only the service
      received, and then to be correctly re-scheduled.
      
      Yet an entity may receive service also while not being the entity in
      service (in the scheduling environment of its parent entity), for
      several reasons. If the entity remains with no backlog while receiving
      this 'unofficial' service, then it is expired. Also on such an
      expiration, the finish time of the entity should be updated to account
      for only the service actually received by the entity. Unfortunately,
      such an update is not performed for an entity expiring without being
      the entity in service.
      
      In a similar vein, the service counter of the entity in service is
      reset when the entity is expired, to be ready to be used for next
      service cycle. This reset too should be performed also in case an
      entity is expired because it remains empty after receiving service
      while not being the entity in service. But in this case the reset is
      not performed.
      
      This commit performs the above update of the finish time and reset of
      the service received, also for an entity expiring while not being the
      entity in service.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4d73f1f
  2. 13 Oct, 2018 1 commit
  3. 26 Sep, 2018 3 commits
  4. 19 Sep, 2018 4 commits
  5. 15 Sep, 2018 2 commits
    • Bart Van Assche's avatar
      cfq: Suppress compiler warnings about comparisons · 9dd38052
      Bart Van Assche authored
      [ Upstream commit f7ecb1b1 ]
      
      This patch does not change any functionality but avoids that gcc
      reports the following warnings when building with W=1:
      
      block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_low_latency_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9dd38052
    • Greg Edwards's avatar
      block: bvec_nr_vecs() returns value for wrong slab · d67c7c9d
      Greg Edwards authored
      [ Upstream commit d6c02a9b ]
      
      In commit ed996a52 ("block: simplify and cleanup bvec pool
      handling"), the value of the slab index is incremented by one in
      bvec_alloc() after the allocation is done to indicate an index value of
      0 does not need to be later freed.
      
      bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
      value.  Decrement idx before performing the lookup.
      
      Fixes: ed996a52 ("block: simplify and cleanup bvec pool handling")
      Signed-off-by: default avatarGreg Edwards <gedwards@ddn.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d67c7c9d
  6. 09 Sep, 2018 3 commits
  7. 24 Aug, 2018 1 commit
  8. 17 Aug, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: fix wrong init of saved start time for weight raising · 1a2d9921
      Paolo Valente authored
      commit 4baa8bb1 upstream.
      
      This commit fixes a bug that causes bfq to fail to guarantee a high
      responsiveness on some drives, if there is heavy random read+write I/O
      in the background. More precisely, such a failure allowed this bug to
      be found [1], but the bug may well cause other yet unreported
      anomalies.
      
      BFQ raises the weight of the bfq_queues associated with soft real-time
      applications, to privilege the I/O, and thus reduce latency, for these
      applications. This mechanism is named soft-real-time weight raising in
      BFQ. A soft real-time period may happen to be nested into an
      interactive weight raising period, i.e., it may happen that, when a
      bfq_queue switches to a soft real-time weight-raised state, the
      bfq_queue is already being weight-raised because deemed interactive
      too. In this case, BFQ saves in a special variable
      wr_start_at_switch_to_srt, the time instant when the interactive
      weight-raising period started for the bfq_queue, i.e., the time
      instant when BFQ started to deem the bfq_queue interactive. This value
      is then used to check whether the interactive weight-raising period
      would still be in progress when the soft real-time weight-raising
      period ends.  If so, interactive weight raising is restored for the
      bfq_queue. This restore is useful, in particular, because it prevents
      bfq_queues from losing their interactive weight raising prematurely,
      as a consequence of spurious, short-lived soft real-time
      weight-raising periods caused by wrong detections as soft real-time.
      
      If, instead, a bfq_queue switches to soft-real-time weight raising
      while it *is not* already in an interactive weight-raising period,
      then the variable wr_start_at_switch_to_srt has no meaning during the
      following soft real-time weight-raising period. Unfortunately the
      handling of this case is wrong in BFQ: not only the variable is not
      flagged somehow as meaningless, but it is also set to the time when
      the switch to soft real-time weight-raising occurs. This may cause an
      interactive weight-raising period to be considered mistakenly as still
      in progress, and thus a spurious interactive weight-raising period to
      start for the bfq_queue, at the end of the soft-real-time
      weight-raising period. In particular the spurious interactive
      weight-raising period will be considered as still in progress, if the
      soft-real-time weight-raising period does not last very long. The
      bfq_queue will then be wrongly privileged and, if I/O bound, will
      unjustly steal bandwidth to truly interactive or soft real-time
      bfq_queues, harming responsiveness and low latency.
      
      This commit fixes this issue by just setting wr_start_at_switch_to_srt
      to minus infinity (farthest past time instant according to jiffies
      macros): when the soft-real-time weight-raising period ends, certainly
      no interactive weight-raising period will be considered as still in
      progress.
      
      [1] Background I/O Type: Random - Background I/O mix: Reads and writes
      - Application to start: LibreOffice Writer in
      http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-LaptopSigned-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarMirko Montanari <mirkomontanari91@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a2d9921
  9. 03 Aug, 2018 3 commits
  10. 22 Jul, 2018 1 commit
  11. 11 Jul, 2018 2 commits
  12. 03 Jul, 2018 1 commit
  13. 26 Jun, 2018 1 commit
    • Roman Pen's avatar
      blk-mq: reinit q->tag_set_list entry only after grace period · ba502bf2
      Roman Pen authored
      commit a347c7ad upstream.
      
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba502bf2
  14. 20 Jun, 2018 3 commits
  15. 16 Jun, 2018 1 commit
    • Bart Van Assche's avatar
      blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers · 1cbd5ece
      Bart Van Assche authored
      commit 327ea4ad upstream.
      
      Avoid that complaints similar to the following appear in the kernel log
      if the number of zones is sufficiently large:
      
        fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
        Call Trace:
        dump_stack+0x63/0x88
        warn_alloc+0xf5/0x190
        __alloc_pages_slowpath+0x8f0/0xb0d
        __alloc_pages_nodemask+0x242/0x260
        alloc_pages_current+0x6a/0xb0
        kmalloc_order+0x18/0x50
        kmalloc_order_trace+0x26/0xb0
        __kmalloc+0x20e/0x220
        blkdev_report_zones_ioctl+0xa5/0x1a0
        blkdev_ioctl+0x1ba/0x930
        block_ioctl+0x41/0x50
        do_vfs_ioctl+0xaa/0x610
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x79/0x1b0
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 3ed05a98 ("blk-zoned: implement ioctls")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
      Cc: Damien Le Moal <damien.lemoal@hgst.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cbd5ece
  16. 30 May, 2018 1 commit
  17. 01 May, 2018 1 commit
  18. 26 Apr, 2018 4 commits
    • Jens Axboe's avatar
      blk-mq: fix discard merge with scheduler attached · 73027d80
      Jens Axboe authored
      
      [ Upstream commit 445251d0 ]
      
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        nvme_queue_rq+0x40/0xa00
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        blk_mq_dispatch_rq_list+0x7b/0x4a0
        ? deadline_remove_request+0x49/0xb0
        blk_mq_do_dispatch_sched+0x4f/0xc0
        blk_mq_sched_dispatch_requests+0x106/0x170
        __blk_mq_run_hw_queue+0x53/0xa0
        __blk_mq_delay_run_hw_queue+0x83/0xa0
        blk_mq_run_hw_queue+0x6c/0xd0
        blk_mq_sched_insert_request+0x96/0x140
        __blk_mq_try_issue_directly+0x3d/0x190
        blk_mq_try_issue_directly+0x30/0x70
        blk_mq_make_request+0x1a4/0x6a0
        generic_make_request+0xfd/0x2f0
        ? submit_bio+0x5c/0x110
        submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        submit_bio_wait+0x43/0x60
        ext4_process_freed_data+0x1cd/0x440
        ? account_page_dirtied+0xe2/0x1a0
        ext4_journal_commit_callback+0x4a/0xc0
        jbd2_journal_commit_transaction+0x17e2/0x19e0
        ? kjournald2+0xb0/0x250
        kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
        ret_from_fork+0x1f/0x30
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      
      There are a few issues here:
      
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73027d80
    • Eryu Guan's avatar
      blk-mq-debugfs: don't allow write on attributes with seq_operations set · f25ba4f6
      Eryu Guan authored
      
      [ Upstream commit 6b136a24 ]
      
      Attributes that only implement .seq_ops are read-only, any write to
      them should be rejected. But currently kernel would crash when
      writing to such debugfs entries, e.g.
      
      chmod +w /sys/kernel/debug/block/<dev>/requeue_list
      echo 0 > /sys/kernel/debug/block/<dev>/requeue_list
      chmod -w /sys/kernel/debug/block/<dev>/requeue_list
      
      Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
      such attributes.
      
      Cc: Ming Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f25ba4f6
    • Goldwyn Rodrigues's avatar
      block: Set BIO_TRACE_COMPLETION on new bio during split · c6c6e38a
      Goldwyn Rodrigues authored
      
      [ Upstream commit 20d59023 ]
      
      We inadvertently set it again on the source bio, but we need
      to set it on the new split bio instead.
      
      Fixes: fbbaf700 ("block: trace completion of all bios.")
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6c6e38a
    • Ming Lei's avatar
      blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · a1dfcb01
      Ming Lei authored
      
      [ Upstream commit 7df938fb ]
      
      We know this WARN_ON is harmless and in reality it may be trigged,
      so convert it to printk() and dump_stack() to avoid to confusing
      people.
      
      Also add comment about two releated races here.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1dfcb01
  19. 19 Apr, 2018 1 commit
  20. 12 Apr, 2018 4 commits
    • Paolo Valente's avatar
      block, bfq: put async queues for root bfq groups too · effbffc9
      Paolo Valente authored
      
      [ Upstream commit 52257ffb ]
      
      For each pair [device for which bfq is selected as I/O scheduler,
      group in blkio/io], bfq maintains a corresponding bfq group. Each such
      bfq group contains a set of async queues, with each async queue
      created on demand, i.e., when some I/O request arrives for it.  On
      creation, an async queue gets an extra reference, to make sure that
      the queue is not freed as long as its bfq group exists.  Accordingly,
      to allow the queue to be freed after the group exited, this extra
      reference must released on group exit.
      
      The above holds also for a bfq root group, i.e., for the bfq group
      corresponding to the root blkio/io root for a given device. Yet, by
      mistake, the references to the existing async queues of a root group
      are not released when the latter exits. This causes a memory leak when
      the instance of bfq for a given device exits. In a similar vein,
      bfqg_stats_xfer_dead is not executed for a root group.
      
      This commit fixes bfq_pd_offline so that the latter executes the above
      missing operations for a root group too.
      Reported-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reported-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarDavide Ferrari <davideferrari8@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      effbffc9
    • Ming Lei's avatar
      blk-mq: fix kernel oops in blk_mq_tag_idle() · 8976d64b
      Ming Lei authored
      
      [ Upstream commit 8ab0b7dc ]
      
      HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
      then we need to check it before calling blk_mq_tag_idle(), otherwise
      the following kernel oops can be triggered, so fix it by checking if
      the hw queue is unmapped since it doesn't make sense to idle the tags
      any more after hw queues are unmapped.
      
      [  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
      [  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
      [  440.788359] RIP: 0010:[<ffffffffb730e2b4>]  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
      [  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
      [  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
      [  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
      [  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
      [  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
      [  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
      [  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
      [  440.876169] Call Trace:
      [  440.879818]  [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
      [  440.887051]  [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
      [  440.894465]  [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
      [  440.901881]  [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
      [  440.910068]  [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
      [  440.919026]  [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
      [  440.928079]  [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
      [  440.937126]  [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
      [  440.944517]  [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
      [  440.951607]  [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
      [  440.959760]  [<ffffffffb70b352f>] kthread+0xcf/0xe0
      [  440.966055]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.973715]  [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
      [  440.980586]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
      [  441.011620] RIP  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  441.019301]  RSP <ffff893bf9bcbd10>
      [  441.024052] CR2: 0000000000000008
      Reported-by: default avatarZhang Yi <yizhan@redhat.com>
      Tested-by: default avatarZhang Yi <yizhan@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8976d64b
    • Ming Lei's avatar
      blk-mq: fix race between updating nr_hw_queues and switching io sched · fb1ef85d
      Ming Lei authored
      
      [ Upstream commit fb350e0a ]
      
      In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
      can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
      example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
      freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.
      
      This patch fixes the race be holding q->sysfs_lock.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb1ef85d
    • Ming Lei's avatar
      blk-mq: avoid to map CPU into stale hw queue · eaa07780
      Ming Lei authored
      
      [ Upstream commit 7d4901a9 ]
      
      blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
      previous map isn't cleared yet, and may point to one stale hw queue
      index.
      
      This patch fixes the following issue by clearing the mapping table before
      setting it up in blk_mq_pci_map_queues().
      
      This patches fixes this following issue reported by Zhang Yi:
      
      [  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
      [  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
      [  101.216346] PGD 0 P4D 0
      [  101.219171] Oops: 0000 [#1] SMP
      [  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
      [  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
      [  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
      [  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
      [  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
      [  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
      [  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
      [  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
      [  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
      [  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
      [  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
      [  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
      [  101.387714] Call Trace:
      [  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
      [  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
      [  101.400848]  ? pick_next_task_fair+0x290/0x5f0
      [  101.405807]  ? __switch_to+0x1f5/0x430
      [  101.409988]  ? put_prev_entity+0x2f/0xd0
      [  101.414365]  process_one_work+0x141/0x340
      [  101.418836]  worker_thread+0x47/0x3e0
      [  101.422921]  kthread+0xf5/0x130
      [  101.426424]  ? rescuer_thread+0x380/0x380
      [  101.430896]  ? kthread_associate_blkcg+0x90/0x90
      [  101.436048]  ret_from_fork+0x1f/0x30
      [  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
      [  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
      [  101.468205] CR2: 0000000000000098
      [  101.471907] ---[ end trace 5fe710f98228a3ca ]---
      [  101.482489] Kernel panic - not syncing: Fatal exception
      [  101.488505] Kernel Offset: disabled
      [  101.497752] ---[ end Kernel panic - not syncing: Fatal exception
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eaa07780
  21. 08 Apr, 2018 1 commit