1. 13 Oct, 2018 1 commit
  2. 26 Sep, 2018 3 commits
  3. 19 Sep, 2018 4 commits
  4. 15 Sep, 2018 2 commits
    • Bart Van Assche's avatar
      cfq: Suppress compiler warnings about comparisons · 9dd38052
      Bart Van Assche authored
      [ Upstream commit f7ecb1b109da1006a08d5675debe60990e824432 ]
      
      This patch does not change any functionality but avoids that gcc
      reports the following warnings when building with W=1:
      
      block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_low_latency_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      Signed-off-by: 's avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9dd38052
    • Greg Edwards's avatar
      block: bvec_nr_vecs() returns value for wrong slab · d67c7c9d
      Greg Edwards authored
      [ Upstream commit d6c02a9beb67f13d5f14f23e72fa9981e8b84477 ]
      
      In commit ed996a52 ("block: simplify and cleanup bvec pool
      handling"), the value of the slab index is incremented by one in
      bvec_alloc() after the allocation is done to indicate an index value of
      0 does not need to be later freed.
      
      bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
      value.  Decrement idx before performing the lookup.
      
      Fixes: ed996a52 ("block: simplify and cleanup bvec pool handling")
      Signed-off-by: 's avatarGreg Edwards <gedwards@ddn.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d67c7c9d
  5. 09 Sep, 2018 3 commits
  6. 24 Aug, 2018 1 commit
  7. 17 Aug, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: fix wrong init of saved start time for weight raising · 1a2d9921
      Paolo Valente authored
      commit 4baa8bb13f41307f3eb62fe91f93a1a798ebef53 upstream.
      
      This commit fixes a bug that causes bfq to fail to guarantee a high
      responsiveness on some drives, if there is heavy random read+write I/O
      in the background. More precisely, such a failure allowed this bug to
      be found [1], but the bug may well cause other yet unreported
      anomalies.
      
      BFQ raises the weight of the bfq_queues associated with soft real-time
      applications, to privilege the I/O, and thus reduce latency, for these
      applications. This mechanism is named soft-real-time weight raising in
      BFQ. A soft real-time period may happen to be nested into an
      interactive weight raising period, i.e., it may happen that, when a
      bfq_queue switches to a soft real-time weight-raised state, the
      bfq_queue is already being weight-raised because deemed interactive
      too. In this case, BFQ saves in a special variable
      wr_start_at_switch_to_srt, the time instant when the interactive
      weight-raising period started for the bfq_queue, i.e., the time
      instant when BFQ started to deem the bfq_queue interactive. This value
      is then used to check whether the interactive weight-raising period
      would still be in progress when the soft real-time weight-raising
      period ends.  If so, interactive weight raising is restored for the
      bfq_queue. This restore is useful, in particular, because it prevents
      bfq_queues from losing their interactive weight raising prematurely,
      as a consequence of spurious, short-lived soft real-time
      weight-raising periods caused by wrong detections as soft real-time.
      
      If, instead, a bfq_queue switches to soft-real-time weight raising
      while it *is not* already in an interactive weight-raising period,
      then the variable wr_start_at_switch_to_srt has no meaning during the
      following soft real-time weight-raising period. Unfortunately the
      handling of this case is wrong in BFQ: not only the variable is not
      flagged somehow as meaningless, but it is also set to the time when
      the switch to soft real-time weight-raising occurs. This may cause an
      interactive weight-raising period to be considered mistakenly as still
      in progress, and thus a spurious interactive weight-raising period to
      start for the bfq_queue, at the end of the soft-real-time
      weight-raising period. In particular the spurious interactive
      weight-raising period will be considered as still in progress, if the
      soft-real-time weight-raising period does not last very long. The
      bfq_queue will then be wrongly privileged and, if I/O bound, will
      unjustly steal bandwidth to truly interactive or soft real-time
      bfq_queues, harming responsiveness and low latency.
      
      This commit fixes this issue by just setting wr_start_at_switch_to_srt
      to minus infinity (farthest past time instant according to jiffies
      macros): when the soft-real-time weight-raising period ends, certainly
      no interactive weight-raising period will be considered as still in
      progress.
      
      [1] Background I/O Type: Random - Background I/O mix: Reads and writes
      - Application to start: LibreOffice Writer in
      http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-LaptopSigned-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: 's avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: 's avatarMirko Montanari <mirkomontanari91@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a2d9921
  8. 03 Aug, 2018 3 commits
  9. 22 Jul, 2018 1 commit
    • Alan Jenkins's avatar
      block: do not use interruptible wait anywhere · aa6be396
      Alan Jenkins authored
      commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.
      
      When blk_queue_enter() waits for a queue to unfreeze, or unset the
      PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.
      
      The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
      ("block, scsi: Make SCSI quiesce and resume work reliably").  Note the SCSI
      device is resumed asynchronously, i.e. after un-freezing userspace tasks.
      
      So that commit exposed the bug as a regression in v4.15.  A mysterious
      SIGBUS (or -EIO) sometimes happened during the time the device was being
      resumed.  Most frequently, there was no kernel log message, and we saw Xorg
      or Xwayland killed by SIGBUS.[1]
      
      [1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979
      
      Without this fix, I get an IO error in this test:
      
      # dd if=/dev/sda of=/dev/null iflag=direct & \
        while killall -SIGUSR1 dd; do sleep 0.1; done & \
        echo mem > /sys/power/state ; \
        sleep 5; killall dd  # stop after 5 seconds
      
      The interruptible wait was added to blk_queue_enter in
      commit 3ef28e83 ("block: generic request_queue reference counting").
      Before then, the interruptible wait was only in blk-mq, but I don't think
      it could ever have been correct.
      Reviewed-by: 's avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarAlan Jenkins <alan.christopher.jenkins@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa6be396
  10. 11 Jul, 2018 2 commits
  11. 03 Jul, 2018 1 commit
  12. 26 Jun, 2018 1 commit
    • Roman Pen's avatar
      blk-mq: reinit q->tag_set_list entry only after grace period · ba502bf2
      Roman Pen authored
      commit a347c7ad8edf4c5685154f3fdc3c12fc1db800ba upstream.
      
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: 's avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: 's avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: 's avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba502bf2
  13. 20 Jun, 2018 3 commits
  14. 16 Jun, 2018 1 commit
    • Bart Van Assche's avatar
      blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers · 1cbd5ece
      Bart Van Assche authored
      commit 327ea4adcfa37194739f1ec7c70568944d292281 upstream.
      
      Avoid that complaints similar to the following appear in the kernel log
      if the number of zones is sufficiently large:
      
        fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
        Call Trace:
        dump_stack+0x63/0x88
        warn_alloc+0xf5/0x190
        __alloc_pages_slowpath+0x8f0/0xb0d
        __alloc_pages_nodemask+0x242/0x260
        alloc_pages_current+0x6a/0xb0
        kmalloc_order+0x18/0x50
        kmalloc_order_trace+0x26/0xb0
        __kmalloc+0x20e/0x220
        blkdev_report_zones_ioctl+0xa5/0x1a0
        blkdev_ioctl+0x1ba/0x930
        block_ioctl+0x41/0x50
        do_vfs_ioctl+0xaa/0x610
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x79/0x1b0
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 3ed05a98 ("blk-zoned: implement ioctls")
      Signed-off-by: 's avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
      Cc: Damien Le Moal <damien.lemoal@hgst.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cbd5ece
  15. 30 May, 2018 1 commit
  16. 01 May, 2018 1 commit
  17. 26 Apr, 2018 4 commits
    • Jens Axboe's avatar
      blk-mq: fix discard merge with scheduler attached · 73027d80
      Jens Axboe authored
      
      [ Upstream commit 445251d0f4d329aa061f323546cd6388a3bb7ab5 ]
      
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        nvme_queue_rq+0x40/0xa00
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        blk_mq_dispatch_rq_list+0x7b/0x4a0
        ? deadline_remove_request+0x49/0xb0
        blk_mq_do_dispatch_sched+0x4f/0xc0
        blk_mq_sched_dispatch_requests+0x106/0x170
        __blk_mq_run_hw_queue+0x53/0xa0
        __blk_mq_delay_run_hw_queue+0x83/0xa0
        blk_mq_run_hw_queue+0x6c/0xd0
        blk_mq_sched_insert_request+0x96/0x140
        __blk_mq_try_issue_directly+0x3d/0x190
        blk_mq_try_issue_directly+0x30/0x70
        blk_mq_make_request+0x1a4/0x6a0
        generic_make_request+0xfd/0x2f0
        ? submit_bio+0x5c/0x110
        submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        submit_bio_wait+0x43/0x60
        ext4_process_freed_data+0x1cd/0x440
        ? account_page_dirtied+0xe2/0x1a0
        ext4_journal_commit_callback+0x4a/0xc0
        jbd2_journal_commit_transaction+0x17e2/0x19e0
        ? kjournald2+0xb0/0x250
        kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
        ret_from_fork+0x1f/0x30
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      
      There are a few issues here:
      
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73027d80
    • Eryu Guan's avatar
      blk-mq-debugfs: don't allow write on attributes with seq_operations set · f25ba4f6
      Eryu Guan authored
      
      [ Upstream commit 6b136a24b05c81a24e0b648a4bd938bcd0c4f69e ]
      
      Attributes that only implement .seq_ops are read-only, any write to
      them should be rejected. But currently kernel would crash when
      writing to such debugfs entries, e.g.
      
      chmod +w /sys/kernel/debug/block/<dev>/requeue_list
      echo 0 > /sys/kernel/debug/block/<dev>/requeue_list
      chmod -w /sys/kernel/debug/block/<dev>/requeue_list
      
      Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
      such attributes.
      
      Cc: Ming Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f25ba4f6
    • Goldwyn Rodrigues's avatar
      block: Set BIO_TRACE_COMPLETION on new bio during split · c6c6e38a
      Goldwyn Rodrigues authored
      
      [ Upstream commit 20d59023c5ec4426284af492808bcea1f39787ef ]
      
      We inadvertently set it again on the source bio, but we need
      to set it on the new split bio instead.
      
      Fixes: fbbaf700 ("block: trace completion of all bios.")
      Signed-off-by: 's avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6c6e38a
    • Ming Lei's avatar
      blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · a1dfcb01
      Ming Lei authored
      
      [ Upstream commit 7df938fbc4ee641e70e05002ac67c24b19e86e74 ]
      
      We know this WARN_ON is harmless and in reality it may be trigged,
      so convert it to printk() and dump_stack() to avoid to confusing
      people.
      
      Also add comment about two releated races here.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1dfcb01
  18. 19 Apr, 2018 1 commit
  19. 12 Apr, 2018 4 commits
    • Paolo Valente's avatar
      block, bfq: put async queues for root bfq groups too · effbffc9
      Paolo Valente authored
      
      [ Upstream commit 52257ffbfcaf58d247b13fb148e27ed17c33e526 ]
      
      For each pair [device for which bfq is selected as I/O scheduler,
      group in blkio/io], bfq maintains a corresponding bfq group. Each such
      bfq group contains a set of async queues, with each async queue
      created on demand, i.e., when some I/O request arrives for it.  On
      creation, an async queue gets an extra reference, to make sure that
      the queue is not freed as long as its bfq group exists.  Accordingly,
      to allow the queue to be freed after the group exited, this extra
      reference must released on group exit.
      
      The above holds also for a bfq root group, i.e., for the bfq group
      corresponding to the root blkio/io root for a given device. Yet, by
      mistake, the references to the existing async queues of a root group
      are not released when the latter exits. This causes a memory leak when
      the instance of bfq for a given device exits. In a similar vein,
      bfqg_stats_xfer_dead is not executed for a root group.
      
      This commit fixes bfq_pd_offline so that the latter executes the above
      missing operations for a root group too.
      Reported-by: 's avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reported-by: 's avatarGuoqing Jiang <gqjiang@suse.com>
      Tested-by: 's avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: 's avatarDavide Ferrari <davideferrari8@gmail.com>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      effbffc9
    • Ming Lei's avatar
      blk-mq: fix kernel oops in blk_mq_tag_idle() · 8976d64b
      Ming Lei authored
      
      [ Upstream commit 8ab0b7dc73e1b3e2987d42554b2bff503f692772 ]
      
      HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
      then we need to check it before calling blk_mq_tag_idle(), otherwise
      the following kernel oops can be triggered, so fix it by checking if
      the hw queue is unmapped since it doesn't make sense to idle the tags
      any more after hw queues are unmapped.
      
      [  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
      [  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
      [  440.788359] RIP: 0010:[<ffffffffb730e2b4>]  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
      [  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
      [  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
      [  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
      [  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
      [  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
      [  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
      [  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
      [  440.876169] Call Trace:
      [  440.879818]  [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
      [  440.887051]  [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
      [  440.894465]  [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
      [  440.901881]  [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
      [  440.910068]  [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
      [  440.919026]  [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
      [  440.928079]  [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
      [  440.937126]  [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
      [  440.944517]  [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
      [  440.951607]  [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
      [  440.959760]  [<ffffffffb70b352f>] kthread+0xcf/0xe0
      [  440.966055]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.973715]  [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
      [  440.980586]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
      [  441.011620] RIP  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  441.019301]  RSP <ffff893bf9bcbd10>
      [  441.024052] CR2: 0000000000000008
      Reported-by: 's avatarZhang Yi <yizhan@redhat.com>
      Tested-by: 's avatarZhang Yi <yizhan@redhat.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8976d64b
    • Ming Lei's avatar
      blk-mq: fix race between updating nr_hw_queues and switching io sched · fb1ef85d
      Ming Lei authored
      
      [ Upstream commit fb350e0ad99359768e1e80b4784692031ec340e4 ]
      
      In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
      can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
      example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
      freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.
      
      This patch fixes the race be holding q->sysfs_lock.
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reported-by: 's avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: 's avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb1ef85d
    • Ming Lei's avatar
      blk-mq: avoid to map CPU into stale hw queue · eaa07780
      Ming Lei authored
      
      [ Upstream commit 7d4901a90d02500c8011472a060f9b2e60e6e605 ]
      
      blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
      previous map isn't cleared yet, and may point to one stale hw queue
      index.
      
      This patch fixes the following issue by clearing the mapping table before
      setting it up in blk_mq_pci_map_queues().
      
      This patches fixes this following issue reported by Zhang Yi:
      
      [  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
      [  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
      [  101.216346] PGD 0 P4D 0
      [  101.219171] Oops: 0000 [#1] SMP
      [  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
      [  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
      [  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
      [  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
      [  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
      [  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
      [  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
      [  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
      [  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
      [  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
      [  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
      [  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
      [  101.387714] Call Trace:
      [  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
      [  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
      [  101.400848]  ? pick_next_task_fair+0x290/0x5f0
      [  101.405807]  ? __switch_to+0x1f5/0x430
      [  101.409988]  ? put_prev_entity+0x2f/0xd0
      [  101.414365]  process_one_work+0x141/0x340
      [  101.418836]  worker_thread+0x47/0x3e0
      [  101.422921]  kthread+0xf5/0x130
      [  101.426424]  ? rescuer_thread+0x380/0x380
      [  101.430896]  ? kthread_associate_blkcg+0x90/0x90
      [  101.436048]  ret_from_fork+0x1f/0x30
      [  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
      [  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
      [  101.468205] CR2: 0000000000000098
      [  101.471907] ---[ end trace 5fe710f98228a3ca ]---
      [  101.482489] Kernel panic - not syncing: Fatal exception
      [  101.488505] Kernel Offset: disabled
      [  101.497752] ---[ end Kernel panic - not syncing: Fatal exception
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reported-by: 's avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: 's avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eaa07780
  20. 08 Apr, 2018 2 commits