1. 19 Sep, 2018 4 commits
  2. 15 Sep, 2018 1 commit
  3. 25 Jul, 2018 1 commit
  4. 24 Apr, 2018 1 commit
    • Wanpeng Li's avatar
      block/mq: fix potential deadlock during cpu hotplug · 8d7f1fde
      Wanpeng Li authored
      commit 51d638b1 upstream.
      
      This can be triggered by hot-unplug one cpu.
      
      ======================================================
       [ INFO: possible circular locking dependency detected ]
       4.11.0+ #17 Not tainted
       -------------------------------------------------------
       step_after_susp/2640 is trying to acquire lock:
        (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110
      
       but task is already holding lock:
        (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (cpu_hotplug.lock){+.+.+.}:
              lock_acquire+0x11c/0x230
              __mutex_lock+0x92/0x990
              mutex_lock_nested+0x1b/0x20
              get_online_cpus+0x64/0x80
              blk_mq_init_allocated_queue+0x3a0/0x4e0
              blk_mq_init_queue+0x3a/0x60
              loop_add+0xe5/0x280
              loop_init+0x124/0x177
              do_one_initcall+0x53/0x1c0
              kernel_init_freeable+0x1e3/0x27f
              kernel_init+0xe/0x100
              ret_from_fork+0x31/0x40
      
       -> #0 (all_q_mutex){+.+...}:
              __lock_acquire+0x189a/0x18a0
              lock_acquire+0x11c/0x230
              __mutex_lock+0x92/0x990
              mutex_lock_nested+0x1b/0x20
              blk_mq_queue_reinit_work+0x18/0x110
              blk_mq_queue_reinit_dead+0x1c/0x20
              cpuhp_invoke_callback+0x1f2/0x810
              cpuhp_down_callbacks+0x42/0x80
              _cpu_down+0xb2/0xe0
              freeze_secondary_cpus+0xb6/0x390
              suspend_devices_and_enter+0x3b3/0xa40
              pm_suspend+0x129/0x490
              state_store+0x82/0xf0
              kobj_attr_store+0xf/0x20
              sysfs_kf_write+0x45/0x60
              kernfs_fop_write+0x135/0x1c0
              __vfs_write+0x37/0x160
              vfs_write+0xcd/0x1d0
              SyS_write+0x58/0xc0
              do_syscall_64+0x8f/0x710
              return_from_SYSCALL_64+0x0/0x7a
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(cpu_hotplug.lock);
                                      lock(all_q_mutex);
                                      lock(cpu_hotplug.lock);
         lock(all_q_mutex);
      
        *** DEADLOCK ***
      
       8 locks held by step_after_susp/2640:
        #0:  (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
        #1:  (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
        #2:  (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
        #3:  (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
        #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
        #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
        #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
        #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
      
       stack backtrace:
       CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
       Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
       Call Trace:
        dump_stack+0x99/0xce
        print_circular_bug+0x1fa/0x270
        __lock_acquire+0x189a/0x18a0
        lock_acquire+0x11c/0x230
        ? lock_acquire+0x11c/0x230
        ? blk_mq_queue_reinit_work+0x18/0x110
        ? blk_mq_queue_reinit_work+0x18/0x110
        __mutex_lock+0x92/0x990
        ? blk_mq_queue_reinit_work+0x18/0x110
        ? kmem_cache_free+0x2cb/0x330
        ? anon_transport_class_unregister+0x20/0x20
        ? blk_mq_queue_reinit_work+0x110/0x110
        mutex_lock_nested+0x1b/0x20
        ? mutex_lock_nested+0x1b/0x20
        blk_mq_queue_reinit_work+0x18/0x110
        blk_mq_queue_reinit_dead+0x1c/0x20
        cpuhp_invoke_callback+0x1f2/0x810
        ? __flow_cache_shrink+0x160/0x160
        cpuhp_down_callbacks+0x42/0x80
        _cpu_down+0xb2/0xe0
        freeze_secondary_cpus+0xb6/0x390
        suspend_devices_and_enter+0x3b3/0xa40
        ? rcu_read_lock_sched_held+0x79/0x80
        pm_suspend+0x129/0x490
        state_store+0x82/0xf0
        kobj_attr_store+0xf/0x20
        sysfs_kf_write+0x45/0x60
        kernfs_fop_write+0x135/0x1c0
        __vfs_write+0x37/0x160
        ? rcu_read_lock_sched_held+0x79/0x80
        ? rcu_sync_lockdep_assert+0x2f/0x60
        ? __sb_start_write+0xd9/0x1c0
        ? vfs_write+0x1ad/0x1d0
        vfs_write+0xcd/0x1d0
        SyS_write+0x58/0xc0
        ? rcu_read_lock_sched_held+0x79/0x80
        do_syscall_64+0x8f/0x710
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
      queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
      contend these two locks in the inversion order. This is due to commit eabe0659
      (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
      issue because of hotplug rework, however the hotplug rework is still work-in-progress
      and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
      breaks the linus's tree in the merge window, so this patch reverts the lock order
      and avoids to splat linus's tree.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: 's avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Cc: Thierry Escande <thierry.escande@linaro.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d7f1fde
  5. 13 Apr, 2018 5 commits
    • Ming Lei's avatar
      blk-mq: fix kernel oops in blk_mq_tag_idle() · 3f4e2419
      Ming Lei authored
      
      [ Upstream commit 8ab0b7dc ]
      
      HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
      then we need to check it before calling blk_mq_tag_idle(), otherwise
      the following kernel oops can be triggered, so fix it by checking if
      the hw queue is unmapped since it doesn't make sense to idle the tags
      any more after hw queues are unmapped.
      
      [  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
      [  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
      [  440.788359] RIP: 0010:[<ffffffffb730e2b4>]  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
      [  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
      [  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
      [  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
      [  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
      [  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
      [  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
      [  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
      [  440.876169] Call Trace:
      [  440.879818]  [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
      [  440.887051]  [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
      [  440.894465]  [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
      [  440.901881]  [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
      [  440.910068]  [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
      [  440.919026]  [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
      [  440.928079]  [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
      [  440.937126]  [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
      [  440.944517]  [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
      [  440.951607]  [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
      [  440.959760]  [<ffffffffb70b352f>] kthread+0xcf/0xe0
      [  440.966055]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.973715]  [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
      [  440.980586]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
      [  441.011620] RIP  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  441.019301]  RSP <ffff893bf9bcbd10>
      [  441.024052] CR2: 0000000000000008
      Reported-by: 's avatarZhang Yi <yizhan@redhat.com>
      Tested-by: 's avatarZhang Yi <yizhan@redhat.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f4e2419
    • Dmitry Monakhov's avatar
      bio-integrity: Do not allocate integrity context for bio w/o data · 7f851311
      Dmitry Monakhov authored
      
      [ Upstream commit 3116a23b ]
      
      If bio has no data, such as ones from blkdev_issue_flush(),
      then we have nothing to protect.
      
      This patch prevent bugon like follows:
      
      kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
      kernel BUG at mm/slab.c:2773!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: bcache
      CPU: 0 PID: 4428 Comm: xfs_io Tainted: G        W       4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
      Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
      task: ffff880137786440 task.stack: ffffc90000ba8000
      RIP: 0010:kfree_debugcheck+0x25/0x2a
      RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
      RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
      RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
      R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
      FS:  00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
      Call Trace:
       kfree+0xc8/0x1b3
       bio_integrity_free+0xc3/0x16b
       bio_free+0x25/0x66
       bio_put+0x14/0x26
       blkdev_issue_flush+0x7a/0x85
       blkdev_fsync+0x35/0x42
       vfs_fsync_range+0x8e/0x9f
       vfs_fsync+0x1c/0x1e
       do_fsync+0x31/0x4a
       SyS_fsync+0x10/0x14
       entry_SYSCALL_64_fastpath+0x1f/0xc2
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: 's avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: 's avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: 's avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f851311
    • Ming Lei's avatar
      blk-mq: fix race between updating nr_hw_queues and switching io sched · 3bab65f2
      Ming Lei authored
      
      [ Upstream commit fb350e0a ]
      
      In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
      can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
      example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
      freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.
      
      This patch fixes the race be holding q->sysfs_lock.
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reported-by: 's avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: 's avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bab65f2
    • Dan Carpenter's avatar
      block: fix an error code in add_partition() · e6687645
      Dan Carpenter authored
      
      [ Upstream commit 7bd897cf ]
      
      We don't set an error code on this path.  It means that we return NULL
      instead of an error pointer and the caller does a NULL dereference.
      
      Fixes: 6d1d8050 ("block, partition: add partition_meta_info to hd_struct")
      Signed-off-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6687645
    • Wen Xiong's avatar
      blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op · 2005c4f3
      Wen Xiong authored
      
      [ Upstream commit f36ea50c ]
      
      When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
      "Input/output error". Looks block layer split the bio after calling
      bio_integrity_prep(bio). This patch fixes the issue.
      
      Below is how we debug this issue:
      (1)format nvme to 4K block # size with type 2 DIF
      (2)dd with block size bigger than 1024k.
      oflag=direct
      dd: error writing '/dev/nvme0n1': Input/output error
      
      We added some debug code in nvme device driver. It showed us the first
      op and the second op have the same bi and pi address. This is not
      correct.
      
      1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
      	dsmgmt=0x0, AT=0x0 & RT=0x505
      	Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
      
      2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
      	AT=0x0 & RT=0x605  ==> This op fails and subsequent 5 retires..
      	Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
      
      With the fix, It showed us both of the first op and the second op have
      correct bi and pi address.
      
      1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
      	dsmgmt=0x0, AT=0x0 & RT=0x505
      	Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
      	0x00002828
      2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
      	AT=0x0 & RT=0x605
      	Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
      	0x00003028
      Signed-off-by: 's avatarWen Xiong <wenxiong@linux.vnet.ibm.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2005c4f3
  6. 08 Apr, 2018 2 commits
  7. 24 Mar, 2018 1 commit
    • Peter Zijlstra's avatar
      block/mq: Cure cpu hotplug lock inversion · 18dd7b96
      Peter Zijlstra authored
      
      [ Upstream commit eabe0659 ]
      
      By poking at /debug/sched_features I triggered the following splat:
      
       [] ======================================================
       [] WARNING: possible circular locking dependency detected
       [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
       [] ------------------------------------------------------
       [] bash/2109 is trying to acquire lock:
       []  (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
       []
       [] but task is already holding lock:
       []  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
       []
       [] which lock already depends on the new lock.
       []
       []
       [] the existing dependency chain (in reverse order) is:
       []
       [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
       []        lock_acquire+0x100/0x210
       []        down_write+0x28/0x60
       []        start_creating+0x5e/0xf0
       []        debugfs_create_dir+0x13/0x110
       []        blk_mq_debugfs_register+0x21/0x70
       []        blk_mq_register_dev+0x64/0xd0
       []        blk_register_queue+0x6a/0x170
       []        device_add_disk+0x22d/0x440
       []        loop_add+0x1f3/0x280
       []        loop_init+0x104/0x142
       []        do_one_initcall+0x43/0x180
       []        kernel_init_freeable+0x1de/0x266
       []        kernel_init+0xe/0x100
       []        ret_from_fork+0x31/0x40
       []
       [] -> #1 (all_q_mutex){+.+.+.}:
       []        lock_acquire+0x100/0x210
       []        __mutex_lock+0x6c/0x960
       []        mutex_lock_nested+0x1b/0x20
       []        blk_mq_init_allocated_queue+0x37c/0x4e0
       []        blk_mq_init_queue+0x3a/0x60
       []        loop_add+0xe5/0x280
       []        loop_init+0x104/0x142
       []        do_one_initcall+0x43/0x180
       []        kernel_init_freeable+0x1de/0x266
       []        kernel_init+0xe/0x100
       []        ret_from_fork+0x31/0x40
      
       []  *** DEADLOCK ***
       []
       [] 3 locks held by bash/2109:
       []  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
       []  #1:  (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
       []  #2:  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
       []
       [] stack backtrace:
       [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
       [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
       [] Call Trace:
      
       []  lock_acquire+0x100/0x210
       []  get_online_cpus+0x2a/0x90
       []  static_key_slow_dec+0x1b/0x50
       []  static_key_disable+0x20/0x30
       []  sched_feat_write+0x131/0x170
       []  full_proxy_write+0x97/0xd0
       []  __vfs_write+0x28/0x120
       []  vfs_write+0xb5/0x1a0
       []  SyS_write+0x49/0xa0
       []  entry_SYSCALL_64_fastpath+0x23/0xc2
      
      This is because of the cpu hotplug lock rework. Break the chain at #1
      by reversing the lock acquisition order. This way i_mutex_key#4 no
      longer depends on cpu_hotplug_lock and things are good.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18dd7b96
  8. 22 Mar, 2018 2 commits
  9. 25 Feb, 2018 1 commit
  10. 20 Dec, 2017 2 commits
  11. 14 Dec, 2017 2 commits
    • Ming Lei's avatar
      block: wake up all tasks blocked in get_request() · 1a5a4c6e
      Ming Lei authored
      
      [ Upstream commit 34d9715a ]
      
      Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
      blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
      if there are tasks blocked in get_request(), q->q_usage_counter can
      never become zero. So we have to wake up all these tasks in
      blk_set_queue_dying() first.
      
      Fixes: 3ef28e83 ("block: generic request_queue reference counting")
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a5a4c6e
    • Ming Lei's avatar
      blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() · bc885917
      Ming Lei authored
      
      [ Upstream commit 737f98cf ]
      
      Both q->mq_kobj and sw queues' kobjects should have been initialized
      once, instead of doing that each add_disk context.
      
      Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
      because percpu allocator fills zero to allocated variable.
      
      This patch fixes one issue[1] reported from Omar.
      
      [1] kernel wearning when doing unbind/bind on one scsi-mq device
      
      [   19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
      [   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
      [   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
      [   19.350920] Workqueue: events_unbound async_run_entry_fn
      [   19.350920] Call Trace:
      [   19.350920]  dump_stack+0x63/0x83
      [   19.350920]  kobject_init+0x77/0x90
      [   19.350920]  blk_mq_register_dev+0x40/0x130
      [   19.350920]  blk_register_queue+0xb6/0x190
      [   19.350920]  device_add_disk+0x1ec/0x4b0
      [   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
      [   19.350920]  async_run_entry_fn+0x48/0x150
      [   19.350920]  process_one_work+0x1d0/0x480
      [   19.350920]  worker_thread+0x48/0x4e0
      [   19.350920]  kthread+0x101/0x140
      [   19.350920]  ? process_one_work+0x480/0x480
      [   19.350920]  ? kthread_create_on_node+0x60/0x60
      [   19.350920]  ret_from_fork+0x2c/0x40
      
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: 's avatarMing Lei <tom.leiming@gmail.com>
      Tested-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc885917
  12. 30 Nov, 2017 1 commit
  13. 21 Oct, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      Revert "bsg-lib: don't free job in bsg_prepare_job" · ebbd5ac4
      Greg Kroah-Hartman authored
      This reverts commit eb4375e1 which was
      commit f507b54d upstream.
      
      Ben reports:
      	That function doesn't exist here (it was introduced in 4.13).
      	Instead, this backport has modified bsg_create_job(), creating a
      	leak.  Please revert this on the 3.18, 4.4 and 4.9 stable
      	branches.
      
      So I'm dropping it from here.
      Reported-by: 's avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
      ebbd5ac4
  14. 18 Oct, 2017 3 commits
  15. 08 Oct, 2017 1 commit
  16. 05 Oct, 2017 1 commit
  17. 27 Sep, 2017 1 commit
    • Bart Van Assche's avatar
      block: Relax a check in blk_start_queue() · 120ec1e4
      Bart Van Assche authored
      commit 4ddd56b0 upstream.
      
      Calling blk_start_queue() from interrupt context with the queue
      lock held and without disabling IRQs, as the skd driver does, is
      safe. This patch avoids that loading the skd driver triggers the
      following warning:
      
      WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
      RIP: 0010:blk_start_queue+0x84/0xa0
      Call Trace:
       skd_unquiesce_dev+0x12a/0x1d0 [skd]
       skd_complete_internal+0x1e7/0x5a0 [skd]
       skd_complete_other+0xc2/0xd0 [skd]
       skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
       skd_isr+0x14f/0x180 [skd]
       irq_forced_thread_fn+0x2a/0x70
       irq_thread+0x144/0x1a0
       kthread+0x125/0x140
       ret_from_fork+0x2a/0x40
      
      Fixes: commit a038e253 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
      Signed-off-by: 's avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Andrew Morton <akpm@osdl.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      120ec1e4
  18. 25 Aug, 2017 1 commit
  19. 17 Jun, 2017 1 commit
  20. 14 Jun, 2017 1 commit
    • Hou Tao's avatar
      cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode · 08229c11
      Hou Tao authored
      commit 5be6b756 upstream.
      
      When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
      as the delay of cfq_group's vdisktime if there have been other cfq_groups
      already.
      
      When cfq is under iops mode, commit 9a7f38c4 ("cfq-iosched: Convert
      from jiffies to nanoseconds") could result in a large iops delay and
      lead to an abnormal io schedule delay for the added cfq_group. To fix
      it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
      when iops mode is enabled.
      
      Despite having the same value, the delay of a cfq_queue in idle class
      and the delay of cfq_group are different things, so I define two new
      macros for the delay of a cfq_group under time-slice mode and iops mode.
      
      Fixes: 9a7f38c4 ("cfq-iosched: Convert from jiffies to nanoseconds")
      Signed-off-by: 's avatarHou Tao <houtao1@huawei.com>
      Acked-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08229c11
  21. 20 May, 2017 1 commit
  22. 14 May, 2017 1 commit
    • Ilya Dryomov's avatar
      block: get rid of blk_integrity_revalidate() · 6a762074
      Ilya Dryomov authored
      commit 19b7ccf8 upstream.
      
      Commit 25520d55 ("block: Inline blk_integrity in struct gendisk")
      introduced blk_integrity_revalidate(), which seems to assume ownership
      of the stable pages flag and unilaterally clears it if no blk_integrity
      profile is registered:
      
          if (bi->profile)
                  disk->queue->backing_dev_info->capabilities |=
                          BDI_CAP_STABLE_WRITES;
          else
                  disk->queue->backing_dev_info->capabilities &=
                          ~BDI_CAP_STABLE_WRITES;
      
      It's called from revalidate_disk() and rescan_partitions(), making it
      impossible to enable stable pages for drivers that support partitions
      and don't use blk_integrity: while the call in revalidate_disk() can be
      trivially worked around (see zram, which doesn't support partitions and
      hence gets away with zram_revalidate_disk()), rescan_partitions() can
      be triggered from userspace at any time.  This breaks rbd, where the
      ceph messenger is responsible for generating/verifying CRCs.
      
      Since blk_integrity_{un,}register() "must" be used for (un)registering
      the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
      setting there.  This way drivers that call blk_integrity_register() and
      use integrity infrastructure won't interfere with drivers that don't
      but still want stable pages.
      
      Fixes: 25520d55 ("block: Inline blk_integrity in struct gendisk")
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Tested-by: 's avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: 's avatarIlya Dryomov <idryomov@gmail.com>
      [idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a762074
  23. 18 Apr, 2017 1 commit
    • Gabriel Krisman Bertazi's avatar
      blk-mq: Avoid memory reclaim when remapping queues · d7045cbf
      Gabriel Krisman Bertazi authored
      commit 36e1f3d1 upstream.
      
      While stressing memory and IO at the same time we changed SMT settings,
      we were able to consistently trigger deadlocks in the mm system, which
      froze the entire machine.
      
      I think that under memory stress conditions, the large allocations
      performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
      waiting on the block layer remmaping completion, thus deadlocking the
      system.  The trace below was collected after the machine stalled,
      waiting for the hotplug event completion.
      
      The simplest fix for this is to make allocations in this path
      non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
      issue anymore.
      
      This should apply on top of Jens's for-next branch cleanly.
      
      Changes since v1:
        - Use GFP_NOIO instead of GFP_NOWAIT.
      
       Call Trace:
      [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
      [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
      [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
      [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
      [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
      [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
      [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
      [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
      [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
      [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
      [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
      [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
      [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
      [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
      [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
      [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
      [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
      [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
      [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
      [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
      [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
      [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
      [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
      [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
      [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
      [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
      [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
      [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
      [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
      [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
      [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
      [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
      [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
      [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
      [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
      [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
      [c000000f0160be30] [c000000000009204] system_call+0x38/0xec
      Signed-off-by: 's avatarGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7045cbf
  24. 08 Apr, 2017 2 commits
    • NeilBrown's avatar
      blk: Ensure users for current->bio_list can see the full list. · 5959cded
      NeilBrown authored
      commit f5fe1b51 upstream.
      
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: 's avatarNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Cc: Jack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5959cded
    • NeilBrown's avatar
      blk: improve order of bio handling in generic_make_request() · d5986e00
      NeilBrown authored
      commit 79bd9959 upstream.
      
      To avoid recursion on the kernel stack when stacked block devices
      are in use, generic_make_request() will, when called recursively,
      queue new requests for later handling.  They will be handled when the
      make_request_fn for the current bio completes.
      
      If any bios are submitted by a make_request_fn, these will ultimately
      be handled seqeuntially.  If the handling of one of those generates
      further requests, they will be added to the end of the queue.
      
      This strict first-in-first-out behaviour can lead to deadlocks in
      various ways, normally because a request might need to wait for a
      previous request to the same device to complete.  This can happen when
      they share a mempool, and can happen due to interdependencies
      particular to the device.  Both md and dm have examples where this happens.
      
      These deadlocks can be erradicated by more selective ordering of bios.
      Specifically by handling them in depth-first order.  That is: when the
      handling of one bio generates one or more further bios, they are
      handled immediately after the parent, before any siblings of the
      parent.  That way, when generic_make_request() calls make_request_fn
      for some particular device, we can be certain that all previously
      submited requests for that device have been completely handled and are
      not waiting for anything in the queue of requests maintained in
      generic_make_request().
      
      An easy way to achieve this would be to use a last-in-first-out stack
      instead of a queue.  However this will change the order of consecutive
      bios submitted by a make_request_fn, which could have unexpected consequences.
      Instead we take a slightly more complex approach.
      A fresh queue is created for each call to a make_request_fn.  After it completes,
      any bios for a different device are placed on the front of the main queue, followed
      by any bios for the same device, followed by all bios that were already on
      the queue before the make_request_fn was called.
      This provides the depth-first approach without reordering bios on the same level.
      
      This, by itself, it not enough to remove all deadlocks.  It just makes
      it possible for drivers to take the extra step required themselves.
      
      To avoid deadlocks, drivers must never risk waiting for a request
      after submitting one to generic_make_request.  This includes never
      allocing from a mempool twice in the one call to a make_request_fn.
      
      A common pattern in drivers is to call bio_split() in a loop, handling
      the first part and then looping around to possibly split the next part.
      Instead, a driver that finds it needs to split a bio should queue
      (with generic_make_request) the second part, handle the first part,
      and then return.  The new code in generic_make_request will ensure the
      requests to underlying bios are processed first, then the second bio
      that was split off.  If it splits again, the same process happens.  In
      each case one bio will be completely handled before the next one is attempted.
      
      With this is place, it should be possible to disable the
      punt_bios_to_recover() recovery thread for many block devices, and
      eventually it may be possible to remove it completely.
      
      Ref: http://www.spinics.net/lists/raid/msg54680.htmlTested-by: 's avatarJinpu Wang <jinpu.wang@profitbricks.com>
      Inspired-by: 's avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: 's avatarNeilBrown <neilb@suse.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Cc: Jack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5986e00
  25. 30 Mar, 2017 1 commit
    • Ming Lei's avatar
      blk-mq: don't complete un-started request in timeout handler · 21d17f1b
      Ming Lei authored
      commit 95a49603 upstream.
      
      When iterating busy requests in timeout handler,
      if the STARTED flag of one request isn't set, that means
      the request is being processed in block layer or driver, and
      isn't submitted to hardware yet.
      
      In current implementation of blk_mq_check_expired(),
      if the request queue becomes dying, un-started requests are
      handled as being completed/freed immediately. This way is
      wrong, and can cause rq corruption or double allocation[1][2],
      when doing I/O and removing&resetting NVMe device at the sametime.
      
      This patch fixes several issues reported by Yi Zhang.
      
      [1]. oops log 1
      [  581.789754] ------------[ cut here ]------------
      [  581.789758] kernel BUG at block/blk-mq.c:374!
      [  581.789760] invalid opcode: 0000 [#1] SMP
      [  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
      irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
      intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
      intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
      pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
      sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
      crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
      dm_region_hash dm_log dm_mod
      [  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
      [  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  581.789804] Workqueue: kblockd blk_mq_timeout_work
      [  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
      [  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
      [  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
      [  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
      [  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
      [  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
      [  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
      [  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
      [  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
      [  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
      [  581.789820] Call Trace:
      [  581.789825]  blk_mq_check_expired+0x76/0x80
      [  581.789828]  bt_iter+0x45/0x50
      [  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
      [  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789840]  ? __switch_to+0x140/0x450
      [  581.789841]  blk_mq_timeout_work+0x88/0x170
      [  581.789845]  process_one_work+0x165/0x410
      [  581.789847]  worker_thread+0x137/0x4c0
      [  581.789851]  kthread+0x101/0x140
      [  581.789853]  ? rescuer_thread+0x3b0/0x3b0
      [  581.789855]  ? kthread_park+0x90/0x90
      [  581.789860]  ret_from_fork+0x2c/0x40
      [  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
      8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
      0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
      [  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
      [  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
      
      [2]. oops log2
      [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857373] PGD 0
      [ 6984.857374]
      [ 6984.857376] Oops: 0000 [#1] SMP
      [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
      irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
      iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
      mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
      acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
      sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
      libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
      dm_region_hash dm_log dm_mod
      [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
      4.10.0-2.el7.bz1420297.x86_64 #1
      [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
      [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
      [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
      [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
      [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
      [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
      [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
      [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
      [ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
      [ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
      [ 6984.857443] Call Trace:
      [ 6984.857451]  ? mempool_free+0x2b/0x80
      [ 6984.857455]  ? bio_free+0x4e/0x60
      [ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
      [ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
      [ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
      [ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
      [ 6984.857473]  process_one_work+0x165/0x410
      [ 6984.857475]  worker_thread+0x137/0x4c0
      [ 6984.857478]  kthread+0x101/0x140
      [ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
      [ 6984.857481]  ? kthread_park+0x90/0x90
      [ 6984.857489]  ret_from_fork+0x2c/0x40
      [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
      89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
      8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
      [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
      [ 6984.857512] CR2: 0000000000000010
      [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
      Reported-by: 's avatarYi Zhang <yizhan@redhat.com>
      Tested-by: 's avatarYi Zhang <yizhan@redhat.com>
      Reviewed-by: 's avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: 's avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: 's avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21d17f1b
  26. 22 Mar, 2017 1 commit
    • Mauricio Faria de Oliveira's avatar
      block: allow WRITE_SAME commands with the SG_IO ioctl · 61a153d0
      Mauricio Faria de Oliveira authored
      [ Upstream commit 25cdb645 ]
      
      The WRITE_SAME commands are not present in the blk_default_cmd_filter
      write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
      is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
      [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]
      
      The problem can be reproduced with the sg_write_same command
      
        # sg_write_same --num 1 --xferlen 512 /dev/sda
        #
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
          Write same: pass through os error: Operation not permitted
        #
      
      For comparison, the WRITE_VERIFY command does not observe this problem,
      since it is in that list:
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
        #
      
      So, this patch adds the WRITE_SAME commands to the list, in order
      for the SG_IO ioctl to finish successfully:
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
        #
      
      That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
      (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]),
      which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).
      
      In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
      which are translated to write-same calls in the guest kernel, and then into
      SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:
      
        [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
        [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
        [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
        [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
        [...] blk_update_request: I/O error, dev sda, sector 17096824
      
      Links:
      [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
      [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
      Signed-off-by: 's avatarMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: 's avatarBrahadambal Srinivasan <latha@linux.vnet.ibm.com>
      Reported-by: 's avatarManjunatha H R <manjuhr1@in.ibm.com>
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61a153d0