1. 29 Feb, 2012 1 commit
  2. 01 Feb, 2012 1 commit
  3. 08 Jan, 2012 1 commit
  4. 04 Jan, 2012 1 commit
  5. 18 Dec, 2011 2 commits
  6. 29 Nov, 2011 1 commit
  7. 21 Nov, 2011 1 commit
    • Tejun Heo's avatar
      freezer: implement and use kthread_freezable_should_stop() · 8a32c441
      Tejun Heo authored
      Writeback and thinkpad_acpi have been using thaw_process() to prevent
      deadlock between the freezer and kthread_stop(); unfortunately, this
      is inherently racy - nothing prevents freezing from happening between
      thaw_process() and kthread_stop().
      
      This patch implements kthread_freezable_should_stop() which enters
      refrigerator if necessary but is guaranteed to return if
      kthread_stop() is invoked.  Both thaw_process() users are converted to
      use the new function.
      
      Note that this deadlock condition exists for many of freezable
      kthreads.  They need to be converted to use the new should_stop or
      freezable workqueue.
      
      Tested with synthetic test case.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarHenrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      8a32c441
  8. 30 Oct, 2011 2 commits
  9. 03 Oct, 2011 2 commits
    • Wu Fengguang's avatar
      writeback: per-bdi background threshold · b00949aa
      Wu Fengguang authored
      One thing puzzled me is that in JBOD case, the per-disk writeout
      performance is smaller than the corresponding single-disk case even
      when they have comparable bdi_thresh. Tracing shows find that in single
      disk case, bdi_writeback is always kept high while in JBOD case, it
      could drop low from time to time and correspondingly bdi_reclaimable
      could sometimes rush high.
      
      The fix is to watch bdi_reclaimable and kick background writeback as
      soon as it goes high. This resembles the global background threshold
      but in per-bdi manner. The trick is, as long as bdi_reclaimable does
      not go high, bdi_writeback naturally won't go low because
      bdi_reclaimable+bdi_writeback ~= bdi_thresh.
      
      With less fluctuated writeback pages, JBOD performance is observed to
      increase noticeably in various cases.
      
      vmstat:nr_written values before/after patch:
      
        3.1.0-rc4-wo-underrun+      3.1.0-rc4-bgthresh3+  
      ------------------------  ------------------------  
                     125596480       +25.9%    158179363  JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
                      61790815      +110.4%    130032231  JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
                      58853546        -0.1%     58823828  JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
                     110159811       +24.7%    137355377  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                      69544762       +10.8%     77080047  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                      50644862        +0.5%     50890006  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                      42677090       +28.0%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                      47491324       +13.3%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                      52548986        +0.9%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                      26783091       +36.8%     36650248  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                      35526347       +14.0%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                      44670723        -1.1%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                     127996037       +22.4%    156719990  JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
                      57518856        +3.8%     59677625  JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
                      51919909       +12.2%     58269894  JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
                      86410514       +79.0%    154660433  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                      40132519       +38.6%     55617893  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                      48423248        +7.5%     52042927  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
                     206041046       +44.1%    296846536  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                      72312903       -19.4%     58272885  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                      50635672        -0.5%     50384787  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                      68308534      +115.7%    147324758  JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
                      57882933       +14.5%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                      52183472       +12.8%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                      53788956       +94.2%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                      44493342       +35.5%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                      42641209       +18.9%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      b00949aa
    • Wu Fengguang's avatar
      writeback: add bg_threshold parameter to __bdi_update_bandwidth() · af6a3113
      Wu Fengguang authored
      No behavior change.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      af6a3113
  10. 31 Jul, 2011 1 commit
  11. 24 Jul, 2011 1 commit
    • Wu Fengguang's avatar
      writeback: don't busy retry writeback on new/freeing inodes · fcc5c222
      Wu Fengguang authored
      Fix a system hang bug introduced by commit b7a2441f ("writeback:
      remove writeback_control.more_io") and e8dfc305 ("writeback: elevate
      queue_io() into wb_writeback()") easily reproducible with high memory
      pressure and lots of file creation/deletions, for example, a kernel
      build in limited memory.
      
      It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE 
      state, the flusher will get stuck busy retrying that inode, never
      releasing wb->list_lock. The lock in turn blocks all kinds of other
      tasks when they are trying to grab it.
      
      As put by Jan, it's a safe change regarding data integrity. I_FREEING or
      I_WILL_FREE inodes are written back by iput_final() and it is reclaim
      code that is responsible for eventually removing them. So writeback code
      can safely ignore them. I_NEW inodes should move out of this state when
      they are fully set up and in the writeback round following that, we will
      consider them for writeback. So the change makes sense.                                                         
      
      CC: Jan Kara <jack@suse.cz> 
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      fcc5c222
  12. 20 Jul, 2011 1 commit
    • Dave Chinner's avatar
      superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6
      Dave Chinner authored
      The per-sb shrinker has the same requirement as the writeback
      threads of ensuring that the superblock is usable and pinned for the
      time it takes to run the work. Both need to take a passive reference
      to the sb, take a read lock on the s_umount lock and then only
      continue if an unmount is not in progress.
      
      pin_sb_for_writeback() does this exactly, so move it to fs/super.c
      and rename it to grab_super_passive() and exporting it via
      fs/internal.h for all the VFS code to be able to use.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      12ad3ab6
  13. 10 Jul, 2011 4 commits
    • Wu Fengguang's avatar
      writeback: scale IO chunk size up to half device bandwidth · 1a12d8bd
      Wu Fengguang authored
      Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
      concern of not holding I_SYNC for too long.  (At least, that was the
      comment previously.)  This doesn't make sense now because the only
      time we wait for I_SYNC is if we are calling sync or fsync, and in
      that case we need to write out all of the data anyway.  Previously
      there may have been other code paths that waited on I_SYNC, but not
      any more.					    -- Theodore Ts'o
      
      So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
      will adapt to as large as the storage device can write within 500ms.
      
      XFS is observed to do IO completions in a batch, and the batch size is
      equal to the write chunk size. To avoid dirty pages to suddenly drop
      out of balance_dirty_pages()'s dirty control scope and create large
      fluctuations, the chunk size is also limited to half the control scope.
      
      The balance_dirty_pages() control scrope is
      
      	[(background_thresh + dirty_thresh) / 2, dirty_thresh]
      
      which is by default [15%, 20%] of global dirty pages, whose range size
      is dirty_thresh / DIRTY_FULL_SCOPE.
      
      The adpative write chunk size will be rounded to the nearest 4MB
      boundary.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=13930
      
      CC: Theodore Ts'o <tytso@mit.edu>
      CC: Dave Chinner <david@fromorbit.com>
      CC: Chris Mason <chris.mason@oracle.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      1a12d8bd
    • Wu Fengguang's avatar
      writeback: introduce smoothed global dirty limit · c42843f2
      Wu Fengguang authored
      The start of a heavy weight application (ie. KVM) may instantly knock
      down determine_dirtyable_memory() if the swap is not enabled or full.
      global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
      dirty thresholds that are _much_ lower than the global/bdi dirty pages.
      
      balance_dirty_pages() will then heavily throttle all dirtiers including
      the light ones, until the dirty pages drop below the new dirty thresholds.
      During this _deep_ dirty-exceeded state, the system may appear rather
      unresponsive to the users.
      
      About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
      threshold to heavy dirtiers than light ones, and the dirty pages will
      be throttled around the heavy dirtiers' dirty threshold and reasonably
      below the light dirtiers' dirty threshold. In this state, only the heavy
      dirtiers will be throttled and the dirty pages are carefully controlled
      to not exceed the light dirtiers' dirty threshold. However if the
      threshold itself suddenly drops below the number of dirty pages, the
      light dirtiers will get heavily throttled.
      
      So introduce global_dirty_limit for tracking the global dirty threshold
      with policies
      
      - follow downwards slowly
      - follow up in one shot
      
      global_dirty_limit can effectively mask out the impact of sudden drop of
      dirtyable memory. It will be used in the next patch for two new type of
      dirty limits. Note that the new dirty limits are not going to avoid
      throttling the light dirtiers, but could limit their sleep time to 200ms.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      c42843f2
    • Wu Fengguang's avatar
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang authored
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • Wu Fengguang's avatar
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang authored
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Proposed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  14. 08 Jun, 2011 12 commits
    • Wu Fengguang's avatar
      writeback: trace event writeback_queue_io · e84d0a4f
      Wu Fengguang authored
      Note that it adds a little overheads to account the moved/enqueued
      inodes from b_dirty to b_io. The "moved" accounting may be later used to
      limit the number of inodes that can be moved in one shot, in order to
      keep spinlock hold time under control.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e84d0a4f
    • Wu Fengguang's avatar
      writeback: trace event writeback_single_inode · 251d6a47
      Wu Fengguang authored
      It is valuable to know how the dirty inodes are iterated and their IO size.
      
      "writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
      
      - "state" reflects inode->i_state at the end of writeback_single_inode()
      - "index" reflects mapping->writeback_index after the ->writepages() call
      - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
      - "wrote" is the number of pages actually written
      
      v2: add trace event writeback_single_inode_requeue as proposed by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      251d6a47
    • Wu Fengguang's avatar
      writeback: remove writeback_control.more_io · b7a2441f
      Wu Fengguang authored
      When wbc.more_io was first introduced, it indicates whether there are
      at least one superblock whose s_more_io contains more IO work. Now with
      the per-bdi writeback, it can be replaced with a simple b_more_io test.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      b7a2441f
    • Wu Fengguang's avatar
      writeback: avoid extra sync work at enqueue time · e185dda8
      Wu Fengguang authored
      This removes writeback_control.wb_start and does more straightforward
      sync livelock prevention by setting .older_than_this to prevent extra
      inodes from being enqueued in the first place.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e185dda8
    • Wu Fengguang's avatar
      writeback: elevate queue_io() into wb_writeback() · e8dfc305
      Wu Fengguang authored
      Code refactor for more logical code layout.
      No behavior change.
      
      - remove the mis-named __writeback_inodes_sb()
      
      - wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
        before calling __writeback_inodes_wb()
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e8dfc305
    • Christoph Hellwig's avatar
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig authored
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f758eeab
    • Wu Fengguang's avatar
      writeback: refill b_io iff empty · 424b351f
      Wu Fengguang authored
      There is no point to carry different refill policies between for_kupdate
      and other type of works. Use a consistent "refill b_io iff empty" policy
      which can guarantee fairness in an easy to understand way.
      
      A b_io refill will setup a _fixed_ work set with all currently eligible
      inodes and start a new round of walk through b_io. The "fixed" work set
      means no new inodes will be added to the work set during the walk.
      Only when a complete walk over b_io is done, new inodes that are
      eligible at the time will be enqueued and the walk be started over.
      
      This procedure provides fairness among the inodes because it guarantees
      each inode to be synced once and only once at each round. So all inodes
      will be free from starvations.
      
      This change relies on wb_writeback() to keep retrying as long as we made
      some progress on cleaning some pages and/or inodes. Without that ability,
      the old logic on background works relies on aggressively queuing all
      eligible inodes into b_io at every time. But that's not a guarantee.
      
      The below test script completes a slightly faster now:
      
                   2.6.39-rc3	  2.6.39-rc3-dyn-expire+
      ------------------------------------------------
      all elapsed     256.043      252.367
      stddev           24.381       12.530
      
      tar elapsed      30.097       28.808
      dd  elapsed      13.214       11.782
      
      	#!/bin/zsh
      
      	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/
      
      	umount /dev/sda7
      	mkfs.xfs -f /dev/sda7
      	mount /dev/sda7 /fs
      
      	echo 3 > /proc/sys/vm/drop_caches
      
      	tic=$(cat /proc/uptime|cut -d' ' -f2)
      
      	cd /fs
      	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
      	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &
      
      	wait
      	sync
      	tac=$(cat /proc/uptime|cut -d' ' -f2)
      	echo elapsed: $((tac - tic))
      
      It maintains roughly the same small vs. large file writeout shares, and
      offers large files better chances to be written in nice 4M chunks.
      
      Analyzes from Dave Chinner in great details:
      
      Let's say we have lots of inodes with 100 dirty pages being created,
      and one large writeback going on. We expire 8 new inodes for every
      1024 pages we write back.
      
      With the old code, we do:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      
      	writeback  8 small inodes 800 pages
      		   1 large inode 224 pages -> b_more_io
      
      	b_more_io (large inode) -> b_io (8s, 1l)
      	8 newly expired inodes -> b_io (8s, 1l, 8s)
      	.....
      
      Your new code:
      
      	b_more_io (large inode) -> b_io (1l)
      	8 newly expired inodes -> b_io (1l, 8s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 8s)
      	writeback  8 small inodes 800 pages
      
      	b_io empty: (1800 pages written)
      		b_more_io (large inode) -> b_io (1l)
      		14 newly expired inodes -> b_io (1l, 14s)
      
      	writeback  large inode 1024 pages -> b_more_io
      	(b_io == 14s)
      	writeback  10 small inodes 1000 pages
      		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
      	writeback  5 small inodes 500 pages
      	b_io empty: (2548 pages written)
      		b_more_io (large inode) -> b_io (1l, 1s(24))
      		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
      	......
      
      Rough progression of pages written at b_io refill:
      
      Old code:
      
      	total	large file	% of writeback
      	1024	224		21.9% (fixed)
      
      New code:
      	total	large file	% of writeback
      	1800	1024		~55%
      	2550	1024		~40%
      	3050	1024		~33%
      	3500	1024		~29%
      	3950	1024		~26%
      	4250	1024		~24%
      	4500	1024		~22.7%
      	4700	1024		~21.7%
      	4800	1024		~21.3%
      	4800	1024		~21.3%
      	(pretty much steady state from here)
      
      Ok, so the steady state is reached with a similar percentage of
      writeback to the large file as the existing code. Ok, that's good,
      but providing some evidence that is doesn't change the shared of
      writeback to the large should be in the commit message ;)
      
      The other advantage to this is that we always write 1024 page chunks
      to the large file, rather than smaller "whatever remains" chunks.
      
      CC: Jan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      424b351f
    • Wu Fengguang's avatar
      writeback: the kupdate expire timestamp should be a moving target · ba9aa839
      Wu Fengguang authored
      Dynamically compute the dirty expire timestamp at queue_io() time.
      
      writeback_control.older_than_this used to be determined at entrance to
      the kupdate writeback work. This _static_ timestamp may go stale if the
      kupdate work runs on and on. The flusher may then stuck with some old
      busy inodes, never considering newly expired inodes thereafter.
      
      This has two possible problems:
      
      - It is unfair for a large dirty inode to delay (for a long time) the
        writeback of small dirty inodes.
      
      - As time goes by, the large and busy dirty inode may contain only
        _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
        delaying the expired dirty pages to the end of LRU lists, triggering
        the evil pageout(). Nevertheless this patch merely addresses part
        of the problem.
      
      v2: keep policy changes inside wb_writeback() and keep the
      wbc.older_than_this visibility as suggested by Dave.
      
      CC: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      ba9aa839
    • Wu Fengguang's avatar
      writeback: try more writeback as long as something was written · e6fb6da2
      Wu Fengguang authored
      writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
      they only populate possibly a subset of eligible inodes into b_io at
      entrance time. When the queued set of inodes are all synced, they just
      return, possibly with all queued inode pages written but still
      wbc.nr_to_write > 0.
      
      For kupdate and background writeback, there may be more eligible inodes
      sitting in b_dirty when the current set of b_io inodes are completed. So
      it is necessary to try another round of writeback as long as we made some
      progress in this round. When there are no more eligible inodes, no more
      inodes will be enqueued in queue_io(), hence nothing could/will be
      synced and we may safely bail.
      
      For example, imagine 100 inodes
      
              i0, i1, i2, ..., i90, i91, i99
      
      At queue_io() time, i90-i99 happen to be expired and moved to s_io for
      IO. When finished successfully, if their total size is less than
      MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
      quit the background work (w/o this patch) while it's still over
      background threshold. This will be a fairly normal/frequent case I guess.
      
      Now that we do tagged sync and update inode->dirtied_when after the sync,
      this change won't livelock sync(1).  I actually tried to write 1 page
      per 1ms with this command
      
      	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
      
      and do sync(1) at the same time. The sync completes quickly on ext4,
      xfs, btrfs.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      e6fb6da2
    • Wu Fengguang's avatar
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang authored
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • Wu Fengguang's avatar
      writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb
      Wu Fengguang authored
      Explicitly update .dirtied_when on synced inodes, so that they are no
      longer considered for writeback in the next round.
      
      It can prevent both of the following livelock schemes:
      
      - while true; do echo data >> f; done
      - while true; do touch f;        done (in theory)
      
      The exact livelock condition is, during sync(1):
      
      (1) no new inodes are dirtied
      (2) an inode being actively dirtied
      
      On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
      When finished, it will be redirty_tail()ed because it's still dirty
      and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
      on condition (1). The sync work will then revisit it on the next
      queue_io() and find it eligible again because its old ->dirtied_when
      predates the sync work start time.
      
      We'll do more aggressive "keep writeback as long as we wrote something"
      logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
      b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
      no longer be enough to stop sync livelock.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      94c3dcbb
    • Wu Fengguang's avatar
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang authored
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  15. 27 May, 2011 1 commit
    • Christoph Hellwig's avatar
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig authored
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      aa385729
  16. 31 Mar, 2011 1 commit
  17. 25 Mar, 2011 4 commits
    • Dave Chinner's avatar
      fs: pull inode->i_lock up out of writeback_single_inode · 0f1b1fd8
      Dave Chinner authored
      First thing we do in writeback_single_inode() is take the i_lock and
      the last thing we do is drop it. A caller already holds the i_lock,
      so pull the i_lock out of writeback_single_inode() to reduce the
      round trips on this lock during inode writeback.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0f1b1fd8
    • Dave Chinner's avatar
      fs: move i_wb_list out from under inode_lock · a66979ab
      Dave Chinner authored
      Protect the inode writeback list with a new global lock
      inode_wb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a66979ab
    • Dave Chinner's avatar
      fs: move i_sb_list out from under inode_lock · 55fa6091
      Dave Chinner authored
      Protect the per-sb inode list with a new global lock
      inode_sb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      55fa6091
    • Dave Chinner's avatar
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner authored
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  18. 14 Jan, 2011 3 commits