1. 15 Dec, 2014 1 commit
  2. 11 Dec, 2014 1 commit
    • Maurizio Lombardi's avatar
      bio: modify __bio_add_page() to accept pages that don't start a new segment · fcbf6a08
      Maurizio Lombardi authored
      The original behaviour is to refuse to add a new page if the maximum
      number of segments has been reached, regardless of the fact the page we
      are going to add can be merged into the last segment or not.
      Unfortunately, when the system runs under heavy memory fragmentation
      conditions, a driver may try to add multiple pages to the last segment.
      The original code won't accept them and EBUSY will be reported to
      This patch modifies the function so it refuses to add a page only in case
      the latter starts a new segment and the maximum number of segments has
      already been reached.
      The bug can be easily reproduced with the st driver:
      2) modprobe st buffer_kbs=1024
      3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
         dd: error writing `/dev/st0': Device or resource busy
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Cc: Jet Chen <jet.chen@intel.com>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  3. 10 Dec, 2014 1 commit
    • Takashi Iwai's avatar
      blk-mq: Fix uninitialized kobject at CPU hotplugging · 06a41a99
      Takashi Iwai authored
      When a CPU is hotplugged, the current blk-mq spews a warning like:
        kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong.
        CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
         0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8
         ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58
         ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007
        Call Trace:
         [<ffffffff81005306>] dump_trace+0x86/0x330
         [<ffffffff81005644>] show_stack_log_lvl+0x94/0x170
         [<ffffffff81006d21>] show_stack+0x21/0x50
         [<ffffffff81605f07>] dump_stack+0x41/0x51
         [<ffffffff8132c7a0>] kobject_add+0xa0/0xb0
         [<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0
         [<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60
         [<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190
         [<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70
         [<ffffffff8105fd23>] cpu_notify+0x23/0x50
         [<ffffffff81060037>] _cpu_up+0x157/0x170
         [<ffffffff810600d9>] cpu_up+0x89/0xb0
         [<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80
         [<ffffffff814323cd>] device_online+0x5d/0xa0
         [<ffffffff81432485>] online_store+0x75/0x80
         [<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150
         [<ffffffff811c5532>] vfs_write+0xb2/0x1f0
         [<ffffffff811c5f42>] SyS_write+0x42/0xb0
         [<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b
         [<00007f0132fb24e0>] 0x7f0132fb24e0
      This is indeed because of an uninitialized kobject for blk_mq_ctx.
      The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it
      goes loop over hctx_for_each_ctx(), i.e. it initializes only for
      online CPUs.  Thus, when a CPU is hotplugged, the ctx for the newly
      onlined CPU is registered without initialization.
      This patch fixes the issue by initializing the all ctx kobjects
      belonging to each queue.
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  4. 09 Dec, 2014 6 commits
    • Bart Van Assche's avatar
      blk-mq: Use all available hardware queues · 959f5f5b
      Bart Van Assche authored
      Suppose that a system has two CPU sockets, three cores per socket,
      that it does not support hyperthreading and that four hardware
      queues are provided by a block driver. With the current algorithm
      this will lead to the following assignment of CPU cores to hardware
        HWQ 0: 0 1
        HWQ 1: 2 3
        HWQ 2: 4 5
        HWQ 3: (none)
      This patch changes the queue assignment into:
        HWQ 0: 0 1
        HWQ 1: 2
        HWQ 2: 3 4
        HWQ 3: 5
      In other words, this patch has the following three effects:
      - All four hardware queues are used instead of only three.
      - CPU cores are spread more evenly over hardware queues. For the
        above example the range of the number of CPU cores associated
        with a single HWQ is reduced from [0..2] to [1..2].
      - If the number of HWQ's is a multiple of the number of CPU sockets
        it is now guaranteed that all CPU cores associated with a single
        HWQ reside on the same CPU socket.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Micro-optimize bt_get() · 52f7eb94
      Bart Van Assche authored
      Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
      calls into a single call.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Fix a race between bt_clear_tag() and bt_get() · c38d185d
      Bart Van Assche authored
      What we need is the following two guarantees:
      * Any thread that observes the effect of the test_and_set_bit() by
        __bt_get_word() also observes the preceding addition of 'current'
        to the appropriate wait list. This is guaranteed by the semantics
        of the spin_unlock() operation performed by prepare_and_wait().
        Hence the conversion of test_and_set_bit_lock() into
      * The wait lists are examined by bt_clear() after the tag bit has
        been cleared. clear_bit_unlock() guarantees that any thread that
        observes that the bit has been cleared also observes the store
        operations preceding clear_bit_unlock(). However,
        clear_bit_unlock() does not prevent that the wait lists are examined
        before that the tag bit is cleared. Hence the addition of a memory
        barrier between clear_bit() and the wait list examination.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Avoid that __bt_get_word() wraps multiple times · 9e98e9d7
      Bart Van Assche authored
      If __bt_get_word() is called with last_tag != 0, if the first
      find_next_zero_bit() fails, if after wrap-around the
      test_and_set_bit() call fails and find_next_zero_bit() succeeds,
      if the next test_and_set_bit() call fails and subsequently
      find_next_zero_bit() does not find a zero bit, then another
      wrap-around will occur. Avoid this by introducing an additional
      local variable.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Fix a use-after-free · 45a9c9d9
      Bart Van Assche authored
      blk-mq users are allowed to free the memory request_queue.tag_set
      points at after blk_cleanup_queue() has finished but before
      blk_release_queue() has started. This can happen e.g. in the SCSI
      core. The SCSI core namely embeds the tag_set structure in a SCSI
      host structure. The SCSI host structure is freed by
      scsi_host_dev_release(). This function is called after
      blk_cleanup_queue() finished but can be called before
      This means that it is not safe to access request_queue.tag_set from
      inside blk_release_queue(). Hence remove the blk_sync_queue() call
      from blk_release_queue(). This call is not necessary - outstanding
      requests must have finished before blk_release_queue() is
      called. Additionally, move the blk_mq_free_queue() call from
      blk_release_queue() to blk_cleanup_queue() to avoid that struct
      request_queue.tag_set gets accessed after it has been freed.
      This patch avoids that the following kernel oops can be triggered
      when deleting a SCSI host for which scsi-mq was enabled:
      Call Trace:
       [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
       [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
       [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
       [<ffffffff8124d654>] blk_release_queue+0x84/0xd0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff81245895>] blk_put_queue+0x15/0x20
       [<ffffffff8125c409>] disk_release+0x99/0xd0
       [<ffffffff8133d056>] device_release+0x36/0xb0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff8125a78a>] put_disk+0x1a/0x20
       [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
       [<ffffffff811d56a0>] blkdev_put+0x50/0x160
       [<ffffffff81199eb4>] kill_block_super+0x44/0x70
       [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
       [<ffffffff8119a87e>] deactivate_super+0x4e/0x70
       [<ffffffff811b9833>] cleanup_mnt+0x43/0x90
       [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
       [<ffffffff8107252c>] task_work_run+0xac/0xe0
       [<ffffffff81002c01>] do_notify_resume+0x61/0xa0
       [<ffffffff814d2c58>] int_signal+0x12/0x17
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Ming Lei's avatar
      blk-mq: prevent unmapped hw queue from being scheduled · 19c66e59
      Ming Lei authored
      When one hardware queue has no mapped software queues, it
      shouldn't have been scheduled. Otherwise WARNING or OOPS
      can triggered.
      blk_mq_hw_queue_mapped() helper is introduce for fixing
      the problem.
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  5. 08 Dec, 2014 2 commits
  6. 04 Dec, 2014 1 commit
  7. 02 Dec, 2014 1 commit
    • Darrick J. Wong's avatar
      block: fix regression where bio_integrity_process uses wrong bio_vec iterator · 594416a7
      Darrick J. Wong authored
      bio integrity handling is broken on a system with LVM layered atop a
      DIF/DIX SCSI drive because device mapper clones the bio, modifies the
      clone, and sends the clone to the lower layers for processing.
      However, the clone bio has bi_vcnt == 0, which means that when the sd
      driver calls bio_integrity_process to attach DIX data, the
      for_each_segment_all() call (which uses bi_vcnt) returns immediately
      and random garbage is sent to the disk on a disk write.  The disk of
      course returns an error.
      Therefore, teach bio_integrity_process() to use bio_for_each_segment()
      to iterate the bio_vecs, since the per-bio iterator tracks which
      bio_vecs are associated with that particular bio.  The integrity
      handling code is effectively part of the "driver" (it's not the bio
      owner), so it must use the correct iterator function.
      v2: Fix a compiler warning about abandoned local variables.  This
      patch supersedes "block: bio_integrity_process uses wrong bio_vec
      iterator".  Patch applies against 3.18-rc6.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  8. 01 Dec, 2014 1 commit
  9. 24 Nov, 2014 5 commits
  10. 19 Nov, 2014 1 commit
  11. 17 Nov, 2014 2 commits
  12. 12 Nov, 2014 2 commits
  13. 11 Nov, 2014 3 commits
  14. 10 Nov, 2014 1 commit
  15. 04 Nov, 2014 1 commit
  16. 31 Oct, 2014 1 commit
  17. 29 Oct, 2014 2 commits
    • Jens Axboe's avatar
      blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag · e167dfb5
      Jens Axboe authored
      Drivers can now tell blk-mq if they take advantage of the deferred
      issue through 'last' or not. If they do, don't do queue-direct
      for sync IO. This is a preparation patch for the nvme conversion.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Jens Axboe's avatar
      blk-mq: add a 'list' parameter to ->queue_rq() · 74c45052
      Jens Axboe authored
      Since we have the notion of a 'last' request in a chain, we can use
      this to have the hardware optimize the issuing of requests. Add
      a list_head parameter to queue_rq that the driver can use to
      temporarily store hw commands for issue when 'last' is true. If we
      are doing a chain of requests, pass in a NULL list for the first
      request to force issue of that immediately, then batch the remainder
      for deferred issue until the last request has been sent.
      Instead of adding yet another argument to the hot ->queue_rq path,
      encapsulate the passed arguments in a blk_mq_queue_data structure.
      This is passed as a constant, and has been tested as faster than
      passing 4 (or even 3) args through ->queue_rq. Update drivers for
      the new ->queue_rq() prototype. There are no functional changes
      in this patch for drivers - if they don't use the passed in list,
      then they will just queue requests individually like before.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  18. 23 Oct, 2014 2 commits
    • Sudip Mukherjee's avatar
      block: fix wrong error return in elevator_init() · d32f6b57
      Sudip Mukherjee authored
      while compiling integer err was showing as a set but unused variable.
      elevator_init_fn can be either cfq_init_queue or deadline_init_queue
      or noop_init_queue.
      all three of these functions are returning -ENOMEM if they fail to
      allocate the queue.
      so we should actually be returning the error code rather than
      returning 0 always.
      Signed-off-by: default avatarSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Jan Kara's avatar
      scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND · 84ce0f0e
      Jan Kara authored
      When sg_scsi_ioctl() fails to prepare request to submit in
      blk_rq_map_kern() we jump to a label where we just end up copying
      (luckily zeroed-out) kernel buffer to userspace instead of reporting
      error. Fix the problem by jumping to the right label.
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-scsi@vger.kernel.org
      CC: stable@vger.kernel.org
      Coverity-id: 1226871
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Fixed up the, now unused, out label.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  19. 22 Oct, 2014 1 commit
  20. 21 Oct, 2014 1 commit
    • Christoph Hellwig's avatar
      block: remove artifical max_hw_sectors cap · 34b48db6
      Christoph Hellwig authored
      Set max_sectors to the value the drivers provides as hardware limit by
      default.  Linux had proper I/O throttling for a long time and doesn't
      rely on a artifically small maximum I/O size anymore.  By not limiting
      the I/O size by default we remove an annoying tuning step required for
      most Linux installation.
      Note that both the user, and if absolutely required the driver can still
      impose a limit for FS requests below max_hw_sectors_kb.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  21. 13 Oct, 2014 4 commits