1. 22 Nov, 2017 1 commit
  2. 17 Nov, 2017 1 commit
  3. 26 Oct, 2017 1 commit
    • Byungchul Park's avatar
      block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() · e319e1fb
      Byungchul Park authored
      Darrick posted the following warning and Dave Chinner analyzed it:
      
      > ======================================================
      > WARNING: possible circular locking dependency detected
      > 4.14.0-rc1-fixes #1 Tainted: G        W
      > ------------------------------------------------------
      > loop0/31693 is trying to acquire lock:
      >  (&(&ip->i_mmaplock)->mr_lock){++++}, at: [<ffffffffa00f1b0c>] xfs_ilock+0x23c/0x330 [xfs]
      >
      > but now in release context of a crosslock acquired at the following:
      >  ((complete)&ret.event){+.+.}, at: [<ffffffff81326c1f>] submit_bio_wait+0x7f/0xb0
      >
      > which lock already depends on the new lock.
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #2 ((complete)&ret.event){+.+.}:
      >        lock_acquire+0xab/0x200
      >        wait_for_completion_io+0x4e/0x1a0
      >        submit_bio_wait+0x7f/0xb0
      >        blkdev_issue_zeroout+0x71/0xa0
      >        xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
      >        xfs_bmapi_write+0x374/0x11f0 [xfs]
      >        xfs_iomap_write_direct+0x2ac/0x430 [xfs]
      >        xfs_file_iomap_begin+0x20d/0xd50 [xfs]
      >        iomap_apply+0x43/0xe0
      >        dax_iomap_rw+0x89/0xf0
      >        xfs_file_dax_write+0xcc/0x220 [xfs]
      >        xfs_file_write_iter+0xf0/0x130 [xfs]
      >        __vfs_write+0xd9/0x150
      >        vfs_write+0xc8/0x1c0
      >        SyS_write+0x45/0xa0
      >        entry_SYSCALL_64_fastpath+0x1f/0xbe
      >
      > -> #1 (&xfs_nondir_ilock_class){++++}:
      >        lock_acquire+0xab/0x200
      >        down_write_nested+0x4a/0xb0
      >        xfs_ilock+0x263/0x330 [xfs]
      >        xfs_setattr_size+0x152/0x370 [xfs]
      >        xfs_vn_setattr+0x6b/0x90 [xfs]
      >        notify_change+0x27d/0x3f0
      >        do_truncate+0x5b/0x90
      >        path_openat+0x237/0xa90
      >        do_filp_open+0x8a/0xf0
      >        do_sys_open+0x11c/0x1f0
      >        entry_SYSCALL_64_fastpath+0x1f/0xbe
      >
      > -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
      >        up_write+0x1c/0x40
      >        xfs_iunlock+0x1d0/0x310 [xfs]
      >        xfs_file_fallocate+0x8a/0x310 [xfs]
      >        loop_queue_work+0xb7/0x8d0
      >        kthread_worker_fn+0xb9/0x1f0
      >
      > Chain exists of:
      >   &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
      >
      >  Possible unsafe locking scenario by crosslock:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(&xfs_nondir_ilock_class);
      >   lock((complete)&ret.event);
      >                                lock(&(&ip->i_mmaplock)->mr_lock);
      >                                unlock((complete)&ret.event);
      >
      >                *** DEADLOCK ***
      
      The warning is a false positive, caused by the fact that all
      wait_for_completion()s in submit_bio_wait() are waiting with the same
      lock class.
      
      However, some bios have nothing to do with others, for example in the case
      of loop devices, there's no direct connection between the bios of an upper
      device and the bios of a lower device(=loop device).
      
      The safest way to assign different lock classes to different devices is
      to do it for each gendisk. In other words, this patch assigns a
      lockdep_map per gendisk and uses it when initializing completion in
      submit_bio_wait().
      Analyzed-by: default avatarDave Chinner <david@fromorbit.com>
      Reported-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarByungchul Park <byungchul.park@lge.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johan@kernel.org
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e319e1fb
  4. 25 Oct, 2017 1 commit
  5. 16 Oct, 2017 1 commit
  6. 11 Oct, 2017 14 commits
  7. 06 Oct, 2017 1 commit
  8. 26 Sep, 2017 1 commit
  9. 25 Aug, 2017 1 commit
  10. 23 Aug, 2017 1 commit
    • Christoph Hellwig's avatar
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig authored
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      74d46992
  11. 09 Aug, 2017 1 commit
  12. 02 Aug, 2017 1 commit
  13. 10 Jul, 2017 1 commit
    • Shaohua Li's avatar
      block: call bio_uninit in bio_endio · b222dd2f
      Shaohua Li authored
      bio_free isn't a good place to free cgroup info. There are a
      lot of cases bio is allocated in special way (for example, in stack) and
      never gets called by bio_put hence bio_free, we are leaking memory. This
      patch moves the free to bio endio, which should be called anyway. The
      bio_uninit call in bio_free is kept, in case the bio never gets called
      bio endio.
      
      This assumes ->bi_end_io() doesn't access cgroup info, which seems true
      in my audit.
      
      This along with Christoph's integrity patch should fix the memory leak
      issue.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b222dd2f
  14. 03 Jul, 2017 3 commits
  15. 28 Jun, 2017 1 commit
    • Jens Axboe's avatar
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe authored
      Wen reports significant memory leaks with DIF and O_DIRECT:
      
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      leaking.
      
      /proc/meminfo | grep SUnreclaim...
      
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      ....
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: default avatarWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ae3b3f5
  16. 27 Jun, 2017 1 commit
  17. 18 Jun, 2017 3 commits
  18. 16 Jun, 2017 1 commit
  19. 09 Jun, 2017 1 commit
  20. 11 Apr, 2017 1 commit
  21. 07 Apr, 2017 1 commit
    • NeilBrown's avatar
      block: trace completion of all bios. · fbbaf700
      NeilBrown authored
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fbbaf700
  22. 28 Mar, 2017 1 commit
    • Shaohua Li's avatar
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li authored
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9e234eea
  23. 25 Mar, 2017 1 commit