1. 04 Aug, 2016 1 commit
    • Dan Williams's avatar
      block: fix bdi vs gendisk lifetime mismatch · df08c32c
      Dan Williams authored
      The name for a bdi of a gendisk is derived from the gendisk's devt.
      However, since the gendisk is destroyed before the bdi it leaves a
      window where a new gendisk could dynamically reuse the same devt while a
      bdi with the same name is still live.  Arrange for the bdi to hold a
      reference against its "owner" disk device while it is registered.
      Otherwise we can hit sysfs duplicate name collisions like the following:
      
       WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
       sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'
      
       Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
        0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
        ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
        0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
       Call Trace:
        [<ffffffff8134caec>] dump_stack+0x63/0x87
        [<ffffffff8108c351>] __warn+0xd1/0xf0
        [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
        [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
        [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
        [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
        [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
        [<ffffffff8134ff55>] kobject_add+0x75/0xd0
        [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
        [<ffffffff8148b0a5>] device_add+0x125/0x610
        [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
        [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
        [<ffffffff811b775c>] bdi_register+0x8c/0x180
        [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
        [<ffffffff813317f5>] add_disk+0x175/0x4a0
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Tested-by: default avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      
      Fixed up missing 0 return in bdi_register_owner().
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      df08c32c
  2. 04 Apr, 2016 1 commit
  3. 12 Oct, 2015 1 commit
    • Tejun Heo's avatar
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo authored
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b817525a
  4. 02 Jul, 2015 1 commit
    • Tejun Heo's avatar
      writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8
      Tejun Heo authored
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      To fix the bug, bdi destruction will be updated to not wait for cong's
      to drain, which naturally means that cong's may outlive the associated
      bdi.  This is fine for non-root cong's but is problematic for the root
      cong's which are embedded in their bdi's as they may end up getting
      dereferenced after the containing bdi's are freed.
      
      This patch makes root cong's behave the same as non-root cong's.  They
      are no longer embedded in their bdi's but allocated separately during
      bdi initialization, indexed and reference counted the same way.
      
      * As cong handling is the same for all wb's, wb->congested
        initialization is moved into wb_init().
      
      * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
        bdi->wb_congested is now a pointer pointing to the root cong
        allocated during bdi init and minimal refcnting operations are
        implemented.
      
      * The above makes root wb init paths diverge depending on
        CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().
      
      This patch in itself shouldn't cause any consequential behavior
      differences but prepares for the actual fix.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: default avatarJon Christopherson <jon@jons.org>
      
      Added <linux/slab.h> include to backing-dev.h for kfree() definition.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a13f35e8
  5. 02 Jun, 2015 11 commits
    • Tejun Heo's avatar
      writeback: disassociate inodes from dying bdi_writebacks · e8a7abf5
      Tejun Heo authored
      For the purpose of foreign inode detection, wb's (bdi_writeback's) are
      identified by the associated memcg ID.  As we create a separate wb for
      each memcg, this is enough to identify the active wb's; however, when
      blkcg is enabled or disabled higher up in the hierarchy, the mapping
      between memcg and blkcg changes which in turn creates a new wb to
      service the new mapping.  The old wb is unlinked from index and
      released after all references are drained.  The foreign inode
      detection logic can't detect this condition because both the old and
      new wb's point to the same memcg and thus never decides to move inodes
      attached to the old wb to the new one.
      
      This patch adds logic to initiate switching immediately in
      wbc_attach_and_unlock_inode() if the associated wb is dying.  We can
      make the usual foreign detection logic to distinguish the different
      wb's mapped to the memcg but the dying wb is never gonna be in active
      service again and there's no point in tracking the usage history and
      reaching the switch verdict after enough data points are collected.
      It's already known that the wb has to be switched.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e8a7abf5
    • Tejun Heo's avatar
      writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f
      Tejun Heo authored
      Currently, majority of cgroup writeback support including all the
      above functions are implemented in include/linux/backing-dev.h and
      mm/backing-dev.c; however, the portion closely related to writeback
      logic implemented in include/linux/writeback.h and mm/page-writeback.c
      will expand to support foreign writeback detection and correction.
      
      This patch moves wb[_try]_get() and wb_put() to
      include/linux/backing-dev-defs.h so that they can be used from
      writeback.h and inode_{attach|detach}_wb() to writeback.h and
      page-writeback.c.
      
      This is pure reorganization and doesn't introduce any functional
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      21c6321f
    • Tejun Heo's avatar
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo authored
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      841710aa
    • Tejun Heo's avatar
      writeback: implement bdi_wait_for_completion() · cc395d7f
      Tejun Heo authored
      If the completion of a wb_writeback_work can be waited upon by setting
      its ->done to a struct completion and waiting on it; however, for
      cgroup writeback support, it's necessary to issue multiple work items
      to multiple bdi_writebacks and wait for the completion of all.
      
      This patch implements wb_completion which can wait for multiple work
      items and replaces the struct completion with it.  It can be defined
      using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
      waited for by wb_wait_for_completion().
      
      Nobody currently issues multiple work items and this patch doesn't
      introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      cc395d7f
    • Tejun Heo's avatar
      writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65
      Tejun Heo authored
      bdi_has_dirty_io() used to only reflect whether the root wb
      (bdi_writeback) has dirty inodes.  For cgroup writeback support, it
      needs to take all active wb's into account.  If any wb on the bdi has
      dirty inodes, bdi_has_dirty_io() should return true.
      
      To achieve that, as inode_wb_list_{move|del}_locked() now keep track
      of the dirty state transition of each wb, the number of dirty wbs can
      be counted in the bdi; however, bdi is already aggregating
      wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
      there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
      dip below 1.  bdi_has_dirty_io() can simply test whether
      bdi->tot_write_bandwidth is zero or not.
      
      While this bumps the value of wb->avg_write_bandwidth to one when it
      used to be zero, this shouldn't cause any meaningful behavior
      difference.
      
      bdi_has_dirty_io() is made an inline function which tests whether
      ->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
      value are added to inode_wb_list_{move|del}_locked().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      95a46c65
    • Tejun Heo's avatar
      writeback: implement backing_dev_info->tot_write_bandwidth · 766a9d6e
      Tejun Heo authored
      cgroup writeback support needs to keep track of the sum of
      avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
      distribute write workload.  This patch adds bdi->tot_write_bandwidth
      and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
      and wb_update_write_bandwidth() to adjust it as wb's gain and lose
      dirty inodes and its avg_write_bandwidth gets updated.
      
      As the update events are not synchronized with each other,
      bdi->tot_write_bandwidth is an atomic_long_t.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      766a9d6e
    • Tejun Heo's avatar
      writeback: implement WB_has_dirty_io wb_state flag · d6c10f1f
      Tejun Heo authored
      Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
      has any dirty inode by testing all three IO lists on each invocation
      without actively keeping track.  For cgroup writeback support, a
      single bdi will host multiple wb's each of which will host dirty
      inodes separately and we'll need to make bdi_has_dirty_io(), which
      currently only represents the root wb, aggregate has_dirty_io from all
      member wb's, which requires tracking transitions in has_dirty_io state
      on each wb.
      
      This patch introduces inode_wb_list_{move|del}_locked() to consolidate
      IO list operations leaving queue_io() the only other function which
      directly manipulates IO lists (via move_expired_inodes()).  All three
      functions are updated to call wb_io_lists_[de]populated() which keep
      track of whether the wb has dirty inodes or not and record it using
      the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
      value indicates whether the wb had no dirty inodes before.
      
      mark_inode_dirty() is restructured so that the return value of
      inode_wb_list_move_locked() can be used for deciding whether to wake
      up the wb.
      
      While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
      These functions were returning 0 and 1 before.  Also, add a comment
      explaining the synchronization of wb_state flags.
      
      v2: Updated to accommodate b_dirty_time.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d6c10f1f
    • Tejun Heo's avatar
      writeback: make congestion functions per bdi_writeback · ec8a6f26
      Tejun Heo authored
      Currently, all congestion functions take bdi (backing_dev_info) and
      always operate on the root wb (bdi->wb) and the congestion state from
      the block layer is propagated only for the root blkcg.  This patch
      introduces {set|clear}_wb_congested() and wb_congested() which take a
      bdi_writeback_congested and bdi_writeback respectively.  The bdi
      counteparts are now wrappers invoking the wb based functions on
      @bdi->wb.
      
      While converting clear_bdi_congested() to clear_wb_congested(), the
      local variable declaration order between @wqh and @bit is swapped for
      cosmetic reason.
      
      This patch just adds the new wb based functions.  The following
      patches will apply them.
      
      v2: Updated for bdi_writeback_congested.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ec8a6f26
    • Tejun Heo's avatar
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo authored
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      52ebea74
    • Tejun Heo's avatar
      bdi: separate out congested state into a separate struct · 4aa9c692
      Tejun Heo authored
      Currently, a wb's (bdi_writeback) congestion state is carried in its
      ->state field; however, cgroup writeback support will require multiple
      wb's sharing the same congestion state.  This patch separates out
      congestion state into its own struct - struct bdi_writeback_congested.
      A new field wb field, wb_congested, points to its associated congested
      struct.  The default wb, bdi->wb, always points to bdi->wb_congested.
      
      While this patch adds a layer of indirection, it doesn't introduce any
      behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4aa9c692
    • Tejun Heo's avatar
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo authored
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      66114cad