1. 19 Nov, 2017 2 commits
  2. 06 Oct, 2017 1 commit
  3. 11 Sep, 2017 1 commit
  4. 20 Apr, 2017 5 commits
  5. 23 Mar, 2017 6 commits
    • Jan Kara's avatar
      bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() · b1c51afc
      Jan Kara authored
      Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
      from bdi_unregister() which is not necessarily called from bdi_destroy()
      and thus the name is somewhat misleading.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b1c51afc
    • Jan Kara's avatar
      bdi: Do not wait for cgwbs release in bdi_unregister() · 4514451e
      Jan Kara authored
      Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
      (called from bdi_unregister()). That is however unnecessary now when
      cgwb->bdi is a proper refcounted reference (thus bdi cannot get
      released before all cgwbs are released) and when cgwb_bdi_destroy()
      shuts down writeback directly.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4514451e
    • Jan Kara's avatar
      bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() · 5318ce7d
      Jan Kara authored
      Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
      which also means that writeback has been shutdown on them. Since this
      wait is going away, directly shutdown writeback on cgwbs from
      cgwb_bdi_destroy() to avoid live writeback structures after
      bdi_unregister() has finished. To make that safe with concurrent
      shutdown from cgwb_release_workfn(), we also have to make sure
      wb_shutdown() returns only after the bdi_writeback structure is really
      shutdown.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5318ce7d
    • Jan Kara's avatar
      bdi: Unify bdi->wb_list handling for root wb_writeback · e8cb72b3
      Jan Kara authored
      Currently root wb_writeback structure is added to bdi->wb_list in
      bdi_init() and never removed. That is different from all other
      wb_writeback structures which get added to the list when created and
      removed from it before wb_shutdown().
      
      So move list addition of root bdi_writeback to bdi_register() and list
      removal of all wb_writeback structures to wb_shutdown(). That way a
      wb_writeback structure is on bdi->wb_list if and only if it can handle
      writeback and it will make it easier for us to handle shutdown of all
      wb_writeback structures in bdi_unregister().
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e8cb72b3
    • Jan Kara's avatar
      bdi: Make wb->bdi a proper reference · 810df54a
      Jan Kara authored
      Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
      structures except for the one embedded inside struct backing_dev_info.
      That will allow us to simplify bdi unregistration.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      810df54a
    • Jan Kara's avatar
      bdi: Mark congested->bdi as internal · b7d680d7
      Jan Kara authored
      congested->bdi pointer is used only to be able to remove congested
      structure from bdi->cgwb_congested_tree on structure release. Moreover
      the pointer can become NULL when we unregister the bdi. Rename the field
      to __bdi and add a comment to make it more explicit this is internal
      stuff of memcg writeback code and people should not use the field as
      such use will be likely race prone.
      
      We do not bother with converting congested->bdi to a proper refcounted
      reference. It will be slightly ugly to special-case bdi->wb.congested to
      avoid effectively a cyclic reference of bdi to itself and the reference
      gets cleared from bdi_unregister() making it impossible to reference
      a freed bdi.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b7d680d7
  6. 08 Mar, 2017 2 commits
    • Jan Kara's avatar
      bdi: Fix use-after-free in wb_congested_put() · df23de55
      Jan Kara authored
      bdi_writeback_congested structures get created for each blkcg and bdi
      regardless whether bdi is registered or not. When they are created in
      unregistered bdi and the request queue (and thus bdi) is then destroyed
      while blkg still holds reference to bdi_writeback_congested structure,
      this structure will be referencing freed bdi and last wb_congested_put()
      will try to remove the structure from already freed bdi.
      
      With commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()", SCSI started to destroy bdis without calling
      bdi_unregister() first (previously it was calling bdi_unregister() even
      for unregistered bdis) and thus the code detaching
      bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
      started hitting this use-after-free bug. It is enough to boot a KVM
      instance with virtio-scsi device to trigger this behavior.
      
      Fix the problem by detaching bdi_writeback_congested structures in
      bdi_exit() instead of bdi_unregister(). This is also more logical as
      they can get attached to bdi regardless whether it ever got registered
      or not.
      
      Fixes: 165a5e22Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      df23de55
    • Jan Kara's avatar
      block: Allow bdi re-registration · b6f8fec4
      Jan Kara authored
      SCSI can call device_add_disk() several times for one request queue when
      a device in unbound and bound, creating new gendisk each time. This will
      lead to bdi being repeatedly registered and unregistered. This was not a
      big problem until commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()" since bdi was only registered repeatedly (bdi_register()
      handles repeated calls fine, only we ended up leaking reference to
      gendisk due to overwriting bdi->owner) but unregistered only in
      blk_cleanup_queue() which didn't get called repeatedly. After
      165a5e22 we were doing correct bdi_register() - bdi_unregister()
      cycles however bdi_unregister() is not prepared for it. So make sure
      bdi_unregister() cleans up bdi in such a way that it is prepared for
      a possible following bdi_register() call.
      
      An easy way to provoke this behavior is to enable
      CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
      scsi disk which immediately hangs without this fix.
      
      Fixes: 165a5e22Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b6f8fec4
  7. 23 Feb, 2017 1 commit
  8. 08 Feb, 2017 1 commit
    • Tejun Heo's avatar
      block: fix double-free in the failure path of cgwb_bdi_init() · 5f478e4e
      Tejun Heo authored
      When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
      at bdi->wb_congested.  cgwb_bdi_init() allocates it with kzalloc() and
      doesn't do further initialization.  This usually works fine as the
      reference count gets bumped to 1 by wb_init() and the put from
      wb_exit() releases it.
      
      However, when wb_init() fails, it puts the wb base ref automatically
      freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
      ends up trying to free the same pointer the second time causing a
      double-free.
      
      Fix it by explicitly initilizing the refcnt to 1 and putting the base
      ref from cgwb_bdi_destroy().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5f478e4e
  9. 02 Feb, 2017 1 commit
  10. 08 Nov, 2016 1 commit
  11. 04 Aug, 2016 1 commit
    • Dan Williams's avatar
      block: fix bdi vs gendisk lifetime mismatch · df08c32c
      Dan Williams authored
      The name for a bdi of a gendisk is derived from the gendisk's devt.
      However, since the gendisk is destroyed before the bdi it leaves a
      window where a new gendisk could dynamically reuse the same devt while a
      bdi with the same name is still live.  Arrange for the bdi to hold a
      reference against its "owner" disk device while it is registered.
      Otherwise we can hit sysfs duplicate name collisions like the following:
      
       WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
       sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'
      
       Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
        0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
        ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
        0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
       Call Trace:
        [<ffffffff8134caec>] dump_stack+0x63/0x87
        [<ffffffff8108c351>] __warn+0xd1/0xf0
        [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
        [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
        [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
        [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
        [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
        [<ffffffff8134ff55>] kobject_add+0x75/0xd0
        [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
        [<ffffffff8148b0a5>] device_add+0x125/0x610
        [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
        [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
        [<ffffffff811b775c>] bdi_register+0x8c/0x180
        [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
        [<ffffffff813317f5>] add_disk+0x175/0x4a0
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Tested-by: default avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      
      Fixed up missing 0 return in bdi_register_owner().
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      df08c32c
  12. 28 Jul, 2016 1 commit
    • Mel Gorman's avatar
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman authored
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
  13. 21 May, 2016 1 commit
    • Michal Hocko's avatar
      mm: throttle on IO only when there are too many dirty and writeback pages · ede37713
      Michal Hocko authored
      wait_iff_congested has been used to throttle allocator before it retried
      another round of direct reclaim to allow the writeback to make some
      progress and prevent reclaim from looping over dirty/writeback pages
      without making any progress.
      
      We used to do congestion_wait before commit 0e093d99 ("writeback: do
      not sleep on the congestion queue if there are no congested BDIs or if
      significant congestion is not being encountered in the current zone")
      but that led to undesirable stalls and sleeping for the full timeout
      even when the BDI wasn't congested.  Hence wait_iff_congested was used
      instead.
      
      But it seems that even wait_iff_congested doesn't work as expected.  We
      might have a small file LRU list with all pages dirty/writeback and yet
      the bdi is not congested so this is just a cond_resched in the end and
      can end up triggering pre mature OOM.
      
      This patch replaces the unconditional wait_iff_congested by
      congestion_wait which is executed only if we _know_ that the last round
      of direct reclaim didn't make any progress and dirty+writeback pages are
      more than a half of the reclaimable pages on the zone which might be
      usable for our target allocation.  This shouldn't reintroduce stalls
      fixed by 0e093d99 because congestion_wait is called only when we are
      getting hopeless when sleeping is a better choice than OOM with many
      pages under IO.
      
      We have to preserve logic introduced by commit 373ccbe5 ("mm,
      vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
      progress") into the __alloc_pages_slowpath now that wait_iff_congested
      is not used anymore.  As the only remaining user of wait_iff_congested
      is shrink_inactive_list we can remove the WQ specific short sleep from
      wait_iff_congested because the sleep is needed to be done only once in
      the allocation retry cycle.
      
      [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
       Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ede37713
  14. 31 Mar, 2016 1 commit
  15. 17 Mar, 2016 1 commit
  16. 12 Feb, 2016 1 commit
  17. 06 Feb, 2016 1 commit
    • Tetsuo Handa's avatar
      mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress · 564e81a5
      Tetsuo Handa authored
      Jan Stancek has reported that system occasionally hanging after "oom01"
      testcase from LTP triggers OOM.  Guessing from a result that there is a
      kworker thread doing memory allocation and the values between "Node 0
      Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
      up-to-date for some reason.
      
      According to commit 373ccbe5 ("mm, vmstat: allow WQ concurrency to
      discover memory reclaim doesn't make any progress"), it meant to force
      the kworker thread to take a short sleep, but it by error used
      schedule_timeout(1).  We missed that schedule_timeout() in state
      TASK_RUNNING doesn't do anything.
      
      Fix it by using schedule_timeout_uninterruptible(1) which forces the
      kworker thread to take a short sleep in order to make sure that vmstat
      is up-to-date.
      
      Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      564e81a5
  18. 15 Jan, 2016 1 commit
  19. 12 Dec, 2015 1 commit
    • Michal Hocko's avatar
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 373ccbe5
      Michal Hocko authored
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      373ccbe5
  20. 07 Nov, 2015 1 commit
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  21. 21 Oct, 2015 1 commit
  22. 15 Oct, 2015 1 commit
    • Tejun Heo's avatar
      block: don't release bdi while request_queue has live references · b02176f3
      Tejun Heo authored
      bdi's are initialized in two steps, bdi_init() and bdi_register(), but
      destroyed in a single step by bdi_destroy() which, for a bdi embedded
      in a request_queue, is called during blk_cleanup_queue() which makes
      the queue invisible and starts the draining of remaining usages.
      
      A request_queue's user can access the congestion state of the embedded
      bdi as long as it holds a reference to the queue.  As such, it may
      access the congested state of a queue which finished
      blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
      Because the congested state was embedded in backing_dev_info which in
      turn is embedded in request_queue, accessing the congested state after
      bdi_destroy() was called was fine.  The bdi was destroyed but the
      memory region for the congested state remained accessible till the
      queue got released.
      
      a13f35e8 ("writeback: don't embed root bdi_writeback_congested in
      bdi_writeback") changed the situation.  Now, the root congested state
      which is expected to be pinned while request_queue remains accessible
      is separately reference counted and the base ref is put during
      bdi_destroy().  This means that the root congested state may go away
      prematurely while the queue is between bdi_dstroy() and
      blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
      
      The root cause of this problem is that bdi doesn't distinguish the two
      steps of destruction, unregistration and release, and now the root
      congested state actually requires a separate release step.  To fix the
      issue, this patch separates out bdi_unregister() and bdi_exit() from
      bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
      and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
      simple wrapper calling the two steps back-to-back.
      
      While at it, the prototype of bdi_destroy() is moved right below
      bdi_setup_and_register() so that the counterpart operations are
      located together.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
      Cc: stable@vger.kernel.org # v4.2+
      Reported-and-tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.comReviewed-by: default avatarJan Kara <jack@suse.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b02176f3
  23. 12 Oct, 2015 1 commit
    • Tejun Heo's avatar
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo authored
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b817525a
  24. 18 Aug, 2015 2 commits
    • Tejun Heo's avatar
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo authored
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      interface.
      
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c165b3e3
    • Dave Chinner's avatar
      inode: rename i_wb_list to i_io_list · c7f54084
      Dave Chinner authored
      There's a small consistency problem between the inode and writeback
      naming. Writeback calls the "for IO" inode queues b_io and
      b_more_io, but the inode calls these the "writeback list" or
      i_wb_list. This makes it hard to an new "under writeback" list to
      the inode, or call it an "under IO" list on the bdi because either
      way we'll have writeback on IO and IO on writeback and it'll just be
      confusing. I'm getting confused just writing this!
      
      So, rename the inode "for IO" list variable to i_io_list so we can
      add a new "writeback list" in a subsequent patch.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      c7f54084
  25. 02 Jul, 2015 2 commits
    • Tejun Heo's avatar
      writeback: don't drain bdi_writeback_congested on bdi destruction · a20135ff
      Tejun Heo authored
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      This patch fixes the bug by updating bdi destruction to not wait for
      cong's to drain.  A cong is unlinked from bdi->cgwb_congested_tree on
      bdi destuction regardless of its reference count as the bdi may go
      away any point after destruction.  wb_congested_put() checks whether
      the cong is already unlinked on release.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
      Fixes: 52ebea74 ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
      Tested-by: default avatarJon Christopherson <jon@jons.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a20135ff
    • Tejun Heo's avatar
      writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8
      Tejun Heo authored
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      To fix the bug, bdi destruction will be updated to not wait for cong's
      to drain, which naturally means that cong's may outlive the associated
      bdi.  This is fine for non-root cong's but is problematic for the root
      cong's which are embedded in their bdi's as they may end up getting
      dereferenced after the containing bdi's are freed.
      
      This patch makes root cong's behave the same as non-root cong's.  They
      are no longer embedded in their bdi's but allocated separately during
      bdi initialization, indexed and reference counted the same way.
      
      * As cong handling is the same for all wb's, wb->congested
        initialization is moved into wb_init().
      
      * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
        bdi->wb_congested is now a pointer pointing to the root cong
        allocated during bdi init and minimal refcnting operations are
        implemented.
      
      * The above makes root wb init paths diverge depending on
        CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().
      
      This patch in itself shouldn't cause any consequential behavior
      differences but prepares for the actual fix.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: default avatarJon Christopherson <jon@jons.org>
      
      Added <linux/slab.h> include to backing-dev.h for kfree() definition.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a13f35e8
  26. 04 Jun, 2015 1 commit
  27. 02 Jun, 2015 1 commit
    • Tejun Heo's avatar
      writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f
      Tejun Heo authored
      Currently, majority of cgroup writeback support including all the
      above functions are implemented in include/linux/backing-dev.h and
      mm/backing-dev.c; however, the portion closely related to writeback
      logic implemented in include/linux/writeback.h and mm/page-writeback.c
      will expand to support foreign writeback detection and correction.
      
      This patch moves wb[_try]_get() and wb_put() to
      include/linux/backing-dev-defs.h so that they can be used from
      writeback.h and inode_{attach|detach}_wb() to writeback.h and
      page-writeback.c.
      
      This is pure reorganization and doesn't introduce any functional
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      21c6321f