1. 06 Feb, 2019 1 commit
    • Dave Chinner's avatar
      fs: don't scan the inode cache before SB_BORN is set · 16925957
      Dave Chinner authored
      commit 79f546a696bff2590169fb5684e23d65f4d9f591 upstream.
      
      We recently had an oops reported on a 4.14 kernel in
      xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
      and so the m_perag_tree lookup walked into lala land.  It produces
      an oops down this path during the failed mount:
      
        radix_tree_gang_lookup_tag+0xc4/0x130
        xfs_perag_get_tag+0x37/0xf0
        xfs_reclaim_inodes_count+0x32/0x40
        xfs_fs_nr_cached_objects+0x11/0x20
        super_cache_count+0x35/0xc0
        shrink_slab.part.66+0xb1/0x370
        shrink_node+0x7e/0x1a0
        try_to_free_pages+0x199/0x470
        __alloc_pages_slowpath+0x3a1/0xd20
        __alloc_pages_nodemask+0x1c3/0x200
        cache_grow_begin+0x20b/0x2e0
        fallback_alloc+0x160/0x200
        kmem_cache_alloc+0x111/0x4e0
      
      The problem is that the superblock shrinker is running before the
      filesystem structures it depends on have been fully set up. i.e.
      the shrinker is registered in sget(), before ->fill_super() has been
      called, and the shrinker can call into the filesystem before
      fill_super() does it's setup work. Essentially we are exposed to
      both use-after-free and use-before-initialisation bugs here.
      
      To fix this, add a check for the SB_BORN flag in super_cache_count.
      In general, this flag is not set until ->fs_mount() completes
      successfully, so we know that it is set after the filesystem
      setup has completed. This matches the trylock_super() behaviour
      which will not let super_cache_scan() run if SB_BORN is not set, and
      hence will not allow the superblock shrinker from entering the
      filesystem while it is being set up or after it has failed setup
      and is being torn down.
      
      Cc: stable@kernel.org
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAaron Lu <aaron.lu@linux.alibaba.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16925957
  2. 03 Mar, 2018 1 commit
  3. 28 Oct, 2016 1 commit
  4. 09 Mar, 2016 1 commit
  5. 17 Aug, 2015 2 commits
  6. 15 Aug, 2015 4 commits
    • Oleg Nesterov's avatar
      change sb_writers to use percpu_rw_semaphore · 8129ed29
      Oleg Nesterov authored
      We can remove everything from struct sb_writers except frozen
      and add the array of percpu_rw_semaphore's instead.
      
      This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
      it for get_super_thawed(). We will probably remove it later.
      
      This change tries to address the following problems:
      
      	- Firstly, __sb_start_write() looks simply buggy. It does
      	  __sb_end_write() if it sees ->frozen, but if it migrates
      	  to another CPU before percpu_counter_dec(), sb_wait_write()
      	  can wrongly succeed if there is another task which holds
      	  the same "semaphore": sb_wait_write() can miss the result
      	  of the previous percpu_counter_inc() but see the result
      	  of this percpu_counter_dec().
      
      	- As Dave Hansen reports, it is suboptimal. The trivial
      	  microbenchmark that writes to a tmpfs file in a loop runs
      	  12% faster if we change this code to rely on RCU and kill
      	  the memory barriers.
      
      	- This code doesn't look simple. It would be better to rely
      	  on the generic locking code.
      
      	  According to Dave, this change adds the same performance
      	  improvement.
      
      Note: with this change both freeze_super() and thaw_super() will do
      synchronize_sched_expedited() 3 times. This is just ugly. But:
      
      	- This will be "fixed" by the rcu_sync changes we are going
      	  to merge. After that freeze_super()->percpu_down_write()
      	  will use synchronize_sched(), and thaw_super() won't use
      	  synchronize() at all.
      
      	  This doesn't need any changes in fs/super.c.
      
      	- Once we merge rcu_sync changes, we can also change super.c
      	  so that all wb_write->rw_sem's will share the single ->rss
      	  in struct sb_writes, then freeze_super() will need only one
      	  synchronize_sched().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      8129ed29
    • Oleg Nesterov's avatar
      shift percpu_counter_destroy() into destroy_super_work() · 853b39a7
      Oleg Nesterov authored
      Of course, this patch is ugly as hell. It will be (partially)
      reverted later. We add it to ensure that other WIP changes in
      percpu_rw_semaphore won't break fs/super.c.
      
      We do not even need this change right now, percpu_free_rwsem()
      is fine in atomic context. But we are going to change this, it
      will be might_sleep() after we merge the rcu_sync() patches.
      
      And even after that we do not really need destroy_super_work(),
      we will kill it in any case. Instead, destroy_super_rcu() should
      just check that rss->cb_state == CB_IDLE and do call_rcu() again
      in the (very unlikely) case this is not true.
      
      So this is just the temporary kludge which helps us to avoid the
      conflicts with the changes which will be (hopefully) routed via
      rcu tree.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      853b39a7
    • Oleg Nesterov's avatar
      document rwsem_release() in sb_wait_write() · 0e28e01f
      Oleg Nesterov authored
      Not only we need to avoid the warning from lockdep_sys_exit(), the
      caller of freeze_super() can never release this lock. Another thread
      can do this, so there is another reason for rwsem_release().
      
      Plus the comment should explain why we have to fool lockdep.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      0e28e01f
    • Oleg Nesterov's avatar
      fix the broken lockdep logic in __sb_start_write() · f4b554af
      Oleg Nesterov authored
      1. wait_event(frozen < level) without rwsem_acquire_read() is just
         wrong from lockdep perspective. If we are going to deadlock
         because the caller is buggy, lockdep can't detect this problem.
      
      2. __sb_start_write() can race with thaw_super() + freeze_super(),
         and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
      
      3. The "tell lockdep we are doing trylock" hack doesn't look nice.
      
         I think this is correct, but this logic should be more explicit.
         Yes, the recursive read_lock() is fine if we hold the lock on a
         higher level. But we do not need to fool lockdep. If we can not
         deadlock in this case then try-lock must not fail and we can use
         use wait == F throughout this code.
      
      Note: as Dave Chinner explains, the "trylock" hack and the fat comment
      can be probably removed. But this needs a separate change and it will
      be trivial: just kill __sb_start_write() and rename do_sb_start_write()
      back to __sb_start_write().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      f4b554af
  7. 01 Jul, 2015 1 commit
  8. 14 Apr, 2015 1 commit
    • Vladimir Davydov's avatar
      cleancache: remove limit on the number of cleancache enabled filesystems · 3cb29d11
      Vladimir Davydov authored
      The limit equals 32 and is imposed by the number of entries in the
      fs_poolid_map and shared_fs_poolid_map.  Nowadays it is insufficient,
      because with containers on board a Linux host can have hundreds of
      active fs mounts.
      
      These maps were introduced by commit 49a9ab81 ("mm: cleancache:
      lazy initialization to allow tmem backends to build/run as modules") in
      order to allow compiling cleancache drivers as modules.  Real pool ids
      are stored in these maps while super_block->cleancache_poolid points to
      an entry in the map, so that on cleancache registration we can walk over
      all (if there are <= 32 of them, of course) cleancache-enabled super
      blocks and assign real pool ids.
      
      Actually, there is absolutely no need in these maps, because we can
      iterate over all super blocks immediately using iterate_supers.  This is
      not racy, because cleancache_init_ops is called from mount_fs with
      super_block->s_umount held for writing, while iterate_supers takes this
      semaphore for reading, so if we call iterate_supers after setting
      cleancache_ops, all super blocks that had been created before
      cleancache_register_ops was called will be assigned pool ids by the
      action function of iterate_supers while all newer super blocks will
      receive it in cleancache_init_fs.
      
      This patch therefore removes the maps and hence the artificial limit on
      the number of cleancache enabled filesystems.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb29d11
  9. 22 Feb, 2015 1 commit
    • Konstantin Khlebnikov's avatar
      trylock_super(): replacement for grab_super_passive() · eb6ef3df
      Konstantin Khlebnikov authored
      I've noticed significant locking contention in memory reclaimer around
      sb_lock inside grab_super_passive(). Grab_super_passive() is called from
      two places: in icache/dcache shrinkers (function super_cache_scan) and
      from writeback (function __writeback_inodes_wb). Both are required for
      progress in memory allocator.
      
      Grab_super_passive() acquires sb_lock to increment sb->s_count and check
      sb->s_instances. It seems sb->s_umount locked for read is enough here:
      super-block deactivation always runs under sb->s_umount locked for write.
      Protecting super-block itself isn't a problem: in super_cache_scan() sb
      is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
      are still active. Inside writeback super-block comes from inode from bdi
      writeback list under wb->list_lock.
      
      This patch removes locking sb_lock and checks s_instances under s_umount:
      generic_shutdown_super() unlinks it under sb->s_umount locked for write.
      New variant is called trylock_super() and since it only locks semaphore,
      callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
      they're done.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      eb6ef3df
  10. 13 Feb, 2015 5 commits
    • Vladimir Davydov's avatar
      fs: shrinker: always scan at least one object of each type · 49e7e7ff
      Vladimir Davydov authored
      In super_cache_scan() we divide the number of objects of particular type
      by the total number of objects in order to distribute pressure among As a
      result, in some corner cases we can get nr_to_scan=0 even if there are
      some objects to reclaim, e.g.  dentries=1, inodes=1, fs_objects=1,
      nr_to_scan=1/3=0.
      
      This is unacceptable for per memcg kmem accounting, because this means
      that some objects may never get reclaimed after memcg death, preventing it
      from being freed.
      
      This patch therefore assures that super_cache_scan() will scan at least
      one object of each type if any.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49e7e7ff
    • Vladimir Davydov's avatar
      fs: make shrinker memcg aware · 2acb60a0
      Vladimir Davydov authored
      Now, to make any list_lru-based shrinker memcg aware we should only
      initialize its list_lru as memcg aware.  Let's do it for the general FS
      shrinker (super_block::s_shrink).
      
      There are other FS-specific shrinkers that use list_lru for storing
      objects, such as XFS and GFS2 dquot cache shrinkers, but since they
      reclaim objects that are shared among different cgroups, there is no point
      making them memcg aware.  It's a big question whether we should account
      them to memcg at all.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2acb60a0
    • Vladimir Davydov's avatar
      list_lru: organize all list_lrus to list · c0a5b560
      Vladimir Davydov authored
      To make list_lru memcg aware, we need all list_lrus to be kept on a list
      protected by a mutex, so that we could sleep while walking over the
      list.
      
      Therefore after this change list_lru_destroy may sleep.  Fortunately,
      there is only one user that calls it from an atomic context - it's
      put_super - and we can easily fix it by calling list_lru_destroy before
      put_super in destroy_locked_super - anyway we don't longer need lrus by
      that time.
      
      Another point that should be noted is that list_lru_destroy is allowed
      to be called on an uninitialized zeroed-out object, in which case it is
      a no-op.  Before this patch this was guaranteed by kfree, but now we
      need an explicit check there.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0a5b560
    • Vladimir Davydov's avatar
      fs: consolidate {nr,free}_cached_objects args in shrink_control · 4101b624
      Vladimir Davydov authored
      We are going to make FS shrinkers memcg-aware.  To achieve that, we will
      have to pass the memcg to scan to the nr_cached_objects and
      free_cached_objects VFS methods, which currently take only the NUMA node
      to scan.  Since the shrink_control structure already holds the node, and
      the memcg to scan will be added to it when we introduce memcg-aware
      vmscan, let us consolidate the methods' arguments in this structure to
      keep things clean.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4101b624
    • Vladimir Davydov's avatar
      list_lru: introduce list_lru_shrink_{count,walk} · 503c358c
      Vladimir Davydov authored
      Kmem accounting of memcg is unusable now, because it lacks slab shrinker
      support.  That means when we hit the limit we will get ENOMEM w/o any
      chance to recover.  What we should do then is to call shrink_slab, which
      would reclaim old inode/dentry caches from this cgroup.  This is what
      this patch set is intended to do.
      
      Basically, it does two things.  First, it introduces the notion of
      per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
      cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
      passed the memory cgroup to scan from in shrink_control->memcg.  For
      such shrinkers shrink_slab iterates over the whole cgroup subtree under
      the target cgroup and calls the shrinker for each kmem-active memory
      cgroup.
      
      Secondly, this patch set makes the list_lru structure per-memcg.  It's
      done transparently to list_lru users - everything they have to do is to
      tell list_lru_init that they want memcg-aware list_lru.  Then the
      list_lru will automatically distribute objects among per-memcg lists
      basing on which cgroup the object is accounted to.  This way to make FS
      shrinkers (icache, dcache) memcg-aware we only need to make them use
      memcg-aware list_lru, and this is what this patch set does.
      
      As before, this patch set only enables per-memcg kmem reclaim when the
      pressure goes from memory.limit, not from memory.kmem.limit.  Handling
      memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
      it is still unclear whether we will have this knob in the unified
      hierarchy.
      
      This patch (of 9):
      
      NUMA aware slab shrinkers use the list_lru structure to distribute
      objects coming from different NUMA nodes to different lists.  Whenever
      such a shrinker needs to count or scan objects from a particular node,
      it issues commands like this:
      
              count = list_lru_count_node(lru, sc->nid);
              freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                         isolate_arg, &sc->nr_to_scan);
      
      where sc is an instance of the shrink_control structure passed to it
      from vmscan.
      
      To simplify this, let's add special list_lru functions to be used by
      shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
      consolidate the nid and nr_to_scan arguments in the shrink_control
      structure.
      
      This will also allow us to avoid patching shrinkers that use list_lru
      when we make shrink_slab() per-memcg - all we will have to do is extend
      the shrink_control structure to include the target memcg and make
      list_lru_shrink_{count,walk} handle this appropriately.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      503c358c
  11. 02 Feb, 2015 1 commit
  12. 26 Jan, 2015 1 commit
  13. 20 Jan, 2015 1 commit
  14. 10 Nov, 2014 2 commits
  15. 09 Oct, 2014 1 commit
  16. 08 Sep, 2014 1 commit
    • Tejun Heo's avatar
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo authored
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  17. 07 Aug, 2014 3 commits
  18. 15 Jul, 2014 1 commit
  19. 04 Jun, 2014 2 commits
    • Tim Chen's avatar
      fs/superblock: avoid locking counting inodes and dentries before reclaiming them · d23da150
      Tim Chen authored
      We remove the call to grab_super_passive in call to super_cache_count.
      This becomes a scalability bottleneck as multiple threads are trying to do
      memory reclamation, e.g.  when we are doing large amount of file read and
      page cache is under pressure.  The cached objects quickly got reclaimed
      down to 0 and we are aborting the cache_scan() reclaim.  But counting
      creates a log jam acquiring the sb_lock.
      
      We are holding the shrinker_rwsem which ensures the safety of call to
      list_lru_count_node() and s_op->nr_cached_objects.  The shrinker is
      unregistered now before ->kill_sb() so the operation is safe when we are
      doing unmount.
      
      The impact will depend heavily on the machine and the workload but for a
      small machine using postmark tuned to use 4xRAM size the results were
      
                                        3.15.0-rc5            3.15.0-rc5
                                           vanilla         shrinker-v1r1
      Ops/sec Transactions         21.00 (  0.00%)       24.00 ( 14.29%)
      Ops/sec FilesCreate          39.00 (  0.00%)       44.00 ( 12.82%)
      Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
      Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
      Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
      Ops/sec DataRead/MB          25.97 (  0.00%)       29.10 ( 12.05%)
      Ops/sec DataWrite/MB         49.99 (  0.00%)       56.02 ( 12.06%)
      
      ffsb running in a configuration that is meant to simulate a mail server showed
      
                                       3.15.0-rc5             3.15.0-rc5
                                          vanilla          shrinker-v1r1
      Ops/sec readall           9402.63 (  0.00%)      9567.97 (  1.76%)
      Ops/sec create            4695.45 (  0.00%)      4735.00 (  0.84%)
      Ops/sec delete             173.72 (  0.00%)       179.83 (  3.52%)
      Ops/sec Transactions     14271.80 (  0.00%)     14482.81 (  1.48%)
      Ops/sec Read                37.00 (  0.00%)        37.60 (  1.62%)
      Ops/sec Write               18.20 (  0.00%)        18.30 (  0.55%)
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d23da150
    • Dave Chinner's avatar
      fs/superblock: unregister sb shrinker before ->kill_sb() · 28f2cd4f
      Dave Chinner authored
      This series is aimed at regressions noticed during reclaim activity.  The
      first two patches are shrinker patches that were posted ages ago but never
      merged for reasons that are unclear to me.  I'm posting them again to see
      if there was a reason they were dropped or if they just got lost.  Dave?
      Time?  The last patch adjusts proportional reclaim.  Yuanhan Liu, can you
      retest the vm scalability test cases on a larger machine?  Hugh, does this
      work for you on the memcg test cases?
      
      Based on ext4, I get the following results but unfortunately my larger
      test machines are all unavailable so this is based on a relatively small
      machine.
      
      postmark
                                        3.15.0-rc5            3.15.0-rc5
                                           vanilla       proportion-v1r4
      Ops/sec Transactions         21.00 (  0.00%)       25.00 ( 19.05%)
      Ops/sec FilesCreate          39.00 (  0.00%)       45.00 ( 15.38%)
      Ops/sec CreateTransact       10.00 (  0.00%)       12.00 ( 20.00%)
      Ops/sec FilesDeleted       6202.00 (  0.00%)     6202.00 (  0.00%)
      Ops/sec DeleteTransact       11.00 (  0.00%)       12.00 (  9.09%)
      Ops/sec DataRead/MB          25.97 (  0.00%)       30.02 ( 15.59%)
      Ops/sec DataWrite/MB         49.99 (  0.00%)       57.78 ( 15.58%)
      
      ffsb (mail server simulator)
                                       3.15.0-rc5             3.15.0-rc5
                                          vanilla        proportion-v1r4
      Ops/sec readall           9402.63 (  0.00%)      9805.74 (  4.29%)
      Ops/sec create            4695.45 (  0.00%)      4781.39 (  1.83%)
      Ops/sec delete             173.72 (  0.00%)       177.23 (  2.02%)
      Ops/sec Transactions     14271.80 (  0.00%)     14764.37 (  3.45%)
      Ops/sec Read                37.00 (  0.00%)        38.50 (  4.05%)
      Ops/sec Write               18.20 (  0.00%)        18.50 (  1.65%)
      
      dd of a large file
                                      3.15.0-rc5            3.15.0-rc5
                                         vanilla       proportion-v1r4
      WallTime DownloadTar       75.00 (  0.00%)       61.00 ( 18.67%)
      WallTime DD               423.00 (  0.00%)      401.00 (  5.20%)
      WallTime Delete             2.00 (  0.00%)        5.00 (-150.00%)
      
      stutter (times mmap latency during large amounts of IO)
      
                                  3.15.0-rc5            3.15.0-rc5
                                     vanilla       proportion-v1r4
      Unit >5ms Delays  80252.0000 (  0.00%)  81523.0000 ( -1.58%)
      Unit Mmap min         8.2118 (  0.00%)      8.3206 ( -1.33%)
      Unit Mmap mean       17.4614 (  0.00%)     17.2868 (  1.00%)
      Unit Mmap stddev     24.9059 (  0.00%)     34.6771 (-39.23%)
      Unit Mmap max      2811.6433 (  0.00%)   2645.1398 (  5.92%)
      Unit Mmap 90%        20.5098 (  0.00%)     18.3105 ( 10.72%)
      Unit Mmap 93%        22.9180 (  0.00%)     20.1751 ( 11.97%)
      Unit Mmap 95%        25.2114 (  0.00%)     22.4988 ( 10.76%)
      Unit Mmap 99%        46.1430 (  0.00%)     43.5952 (  5.52%)
      Unit Ideal  Tput     85.2623 (  0.00%)     78.8906 (  7.47%)
      Unit Tput min        44.0666 (  0.00%)     43.9609 (  0.24%)
      Unit Tput mean       45.5646 (  0.00%)     45.2009 (  0.80%)
      Unit Tput stddev      0.9318 (  0.00%)      1.1084 (-18.95%)
      Unit Tput max        46.7375 (  0.00%)     46.7539 ( -0.04%)
      
      This patch (of 3):
      
      We will like to unregister the sb shrinker before ->kill_sb().  This will
      allow cached objects to be counted without call to grab_super_passive() to
      update ref count on sb.  We want to avoid locking during memory
      reclamation especially when we are skipping the memory reclaim when we are
      out of cached objects.
      
      This is safe because grab_super_passive does a try-lock on the
      sb->s_umount now, and so if we are in the unmount process, it won't ever
      block.  That means what used to be a deadlock and races we were avoiding
      by using grab_super_passive() is now:
      
              shrinker                        umount
      
              down_read(shrinker_rwsem)
                                              down_write(sb->s_umount)
                                              shrinker_unregister
                                                down_write(shrinker_rwsem)
                                                  <blocks>
              grab_super_passive(sb)
                down_read_trylock(sb->s_umount)
                  <fails>
              <shrinker aborts>
              ....
              <shrinkers finish running>
              up_read(shrinker_rwsem)
                                                <unblocks>
                                                <removes shrinker>
                                                up_write(shrinker_rwsem)
                                              ->kill_sb()
                                              ....
      
      So it is safe to deregister the shrinker before ->kill_sb().
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28f2cd4f
  20. 16 Apr, 2014 1 commit
  21. 13 Mar, 2014 1 commit
    • Theodore Ts'o's avatar
      fs: push sync_filesystem() down to the file system's remount_fs() · 02b9984d
      Theodore Ts'o authored
      Previously, the no-op "mount -o mount /dev/xxx" operation when the
      file system is already mounted read-write causes an implied,
      unconditional syncfs().  This seems pretty stupid, and it's certainly
      documented or guaraunteed to do this, nor is it particularly useful,
      except in the case where the file system was mounted rw and is getting
      remounted read-only.
      
      However, it's possible that there might be some file systems that are
      actually depending on this behavior.  In most file systems, it's
      probably fine to only call sync_filesystem() when transitioning from
      read-write to read-only, and there are some file systems where this is
      not needed at all (for example, for a pseudo-filesystem or something
      like romfs).
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Jan Kara <jack@suse.cz>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Anders Larsen <al@alarsen.net>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: xfs@oss.sgi.com
      Cc: linux-btrfs@vger.kernel.org
      Cc: linux-cifs@vger.kernel.org
      Cc: samba-technical@lists.samba.org
      Cc: codalist@coda.cs.cmu.edu
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-f2fs-devel@lists.sourceforge.net
      Cc: fuse-devel@lists.sourceforge.net
      Cc: cluster-devel@redhat.com
      Cc: linux-mtd@lists.infradead.org
      Cc: jfs-discussion@lists.sourceforge.net
      Cc: linux-nfs@vger.kernel.org
      Cc: linux-nilfs@vger.kernel.org
      Cc: linux-ntfs-dev@lists.sourceforge.net
      Cc: ocfs2-devel@oss.oracle.com
      Cc: reiserfs-devel@vger.kernel.org
      02b9984d
  22. 31 Jan, 2014 1 commit
    • Andrew Ruder's avatar
      fs/super.c: sync ro remount after blocking writers · 807612db
      Andrew Ruder authored
      Move sync_filesystem() after sb_prepare_remount_readonly().  If writers
      sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly()
      it can cause inodes to be dirtied and writeback to occur well after
      sys_mount() has completely successfully.
      
      This was spotted by corrupted ubifs filesystems on reboot, but appears
      that it can cause issues with any filesystem using writeback.
      
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      CC: Richard Weinberger <richard@nod.at>
      Co-authored-by: Richard Weinberger's avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarAndrew Ruder <andrew.ruder@elecsyscorp.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      807612db
  23. 22 Jan, 2014 1 commit
  24. 09 Nov, 2013 1 commit
  25. 25 Oct, 2013 2 commits
  26. 01 Oct, 2013 1 commit
    • Al Viro's avatar
      fs/super.c: fix lru_list leak for real · c2d22ecd
      Al Viro authored
      Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong;
      the right place is destroy_super().  As it is, we leak them if sget()
      decides that new superblock it has allocated (and never shown to
      anybody) isn't needed and should be freed.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c2d22ecd
  27. 10 Sep, 2013 1 commit