1. 18 Oct, 2018 1 commit
    • Roman Gushchin's avatar
      dcache: account external names as indirectly reclaimable memory · 6d794237
      Roman Gushchin authored
      commit f1782c9b upstream.
      I received a report about suspicious growth of unreclaimable slabs on
      some machines.  I've found that it happens on machines with low memory
      pressure, and these unreclaimable slabs are external names attached to
      External names are allocated using generic kmalloc() function, so they
      are accounted as unreclaimable.  But they are held by dentries, which
      are reclaimable, and they will be reclaimed under the memory pressure.
      In particular, this breaks MemAvailable calculation, as it doesn't take
      unreclaimable slabs into account.  This leads to a silly situation, when
      a machine is almost idle, has no memory pressure and therefore has a big
      dentry cache.  And the resulting MemAvailable is too low to start a new
      To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
      used to track the amount of memory, consumed by external names.  The
      counter is increased in the dentry allocation path, if an external name
      structure is allocated; and it's decreased in the dentry freeing path.
      To reproduce the problem I've used the following Python script:
        import os
        for iter in range (0, 10000000):
                name = ("/some_long_name_%d" % iter) + "_" * 220
            except Exception:
      Without this patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7811688 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    2753052 kB
      With the patch:
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7809516 kB
        $ python indirect.py
        $ cat /proc/meminfo | grep MemAvailable
        MemAvailable:    7749144 kB
      [guro@fb.com: fix indirectly reclaimable memory accounting for CONFIG_SLOB]
        Link: http://lkml.kernel.org/r/20180312194140.19517-1-guro@fb.com
      [guro@fb.com: fix indirectly reclaimable memory accounting]
        Link: http://lkml.kernel.org/r/20180313125701.7955-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20180305133743.12746-5-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 15 Sep, 2018 1 commit
  3. 15 Aug, 2018 2 commits
    • Al Viro's avatar
      make sure that __dentry_kill() always invalidates d_seq, unhashed or not · d5426a38
      Al Viro authored
      commit 4c0d7cd5 upstream.
      RCU pathwalk relies upon the assumption that anything that changes
      ->d_inode of a dentry will invalidate its ->d_seq.  That's almost
      true - the one exception is that the final dput() of already unhashed
      dentry does *not* touch ->d_seq at all.  Unhashing does, though,
      so for anything we'd found by RCU dcache lookup we are fine.
      Unfortunately, we can *start* with an unhashed dentry or jump into
      We could try and be careful in the (few) places where that could
      happen.  Or we could just make the final dput() invalidate the damn
      thing, unhashed or not.  The latter is much simpler and easier to
      backport, so let's do it that way.
      Reported-by: default avatar"Dae R. Jeong" <threeearcat@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Al Viro's avatar
      root dentries need RCU-delayed freeing · abfc0ec6
      Al Viro authored
      commit 90bad5e0 upstream.
      Since mountpoint crossing can happen without leaving lazy mode,
      root dentries do need the same protection against having their
      memory freed without RCU delay as everything else in the tree.
      It's partially hidden by RCU delay between detaching from the
      mount tree and dropping the vfsmount reference, but the starting
      point of pathwalk can be on an already detached mount, in which
      case umount-caused RCU delay has already passed by the time the
      lazy pathwalk grabs rcu_read_lock().  If the starting point
      happens to be at the root of that vfsmount *and* that vfsmount
      covers the entire filesystem, we get trouble.
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 30 May, 2018 3 commits
  5. 12 Apr, 2018 1 commit
    • NeilBrown's avatar
      VFS: close race between getcwd() and d_move() · db470ce8
      NeilBrown authored
      [ Upstream commit 61647823 ]
      d_move() will call __d_drop() and then __d_rehash()
      on the dentry being moved.  This creates a small window
      when the dentry appears to be unhashed.  Many tests
      of d_unhashed() are made under ->d_lock and so are safe
      from racing with this window, but some aren't.
      In particular, getcwd() calls d_unlinked() (which calls
      d_unhashed()) without d_lock protection, so it can race.
      This races has been seen in practice with lustre, which uses d_move() as
      part of name lookup.  See:
      It could race with a regular rename(), and result in ENOENT instead
      of either the 'before' or 'after' name.
      The race can be demonstrated with a simple program which
      has two threads, one renaming a directory back and forth
      while another calls getcwd() within that directory: it should never
      fail, but does.  See:
      We could fix this race by taking d_lock and rechecking when
      d_unhashed() reports true.  Alternately when can remove the window,
      which is the approach this patch takes.
      ___d_drop() is introduce which does *not* clear d_hash.pprev
      so the dentry still appears to be hashed.  __d_drop() calls
      ___d_drop(), then clears d_hash.pprev.
      __d_move() now uses ___d_drop() and only clears d_hash.pprev
      when not rehashing.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 21 Mar, 2018 1 commit
    • Al Viro's avatar
      lock_parent() needs to recheck if dentry got __dentry_kill'ed under it · b071bce3
      Al Viro authored
      commit 3b821409 upstream.
      In case when dentry passed to lock_parent() is protected from freeing only
      by the fact that it's on a shrink list and trylock of parent fails, we
      could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
      between unlocking dentry and locking presumed parent.  We need to recheck
      that dentry is alive once we lock both it and parent *and* postpone
      rcu_read_unlock() until after that point.  Otherwise we could return
      a pointer to struct dentry that already is rcu-scheduled for freeing, with
      ->d_lock held on it; caller's subsequent attempt to unlock it can end
      up with memory corruption.
      Cc: stable@vger.kernel.org # 3.12+, counting backports
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 22 Feb, 2018 1 commit
  8. 25 Dec, 2017 1 commit
  9. 10 Jul, 2017 1 commit
  10. 08 Jul, 2017 1 commit
    • Al Viro's avatar
      dentry name snapshots · 49d31c2f
      Al Viro authored
      take_dentry_name_snapshot() takes a safe snapshot of dentry name;
      if the name is a short one, it gets copied into caller-supplied
      structure, otherwise an extra reference to external name is grabbed
      (those are never modified).  In either case the pointer to stable
      string is stored into the same structure.
      dentry must be held by the caller of take_dentry_name_snapshot(),
      but may be freely dropped afterwards - the snapshot will stay
      until destroyed by release_dentry_name_snapshot().
      Intended use:
      	struct name_snapshot s;
      	take_dentry_name_snapshot(&s, dentry);
      	access s.name
      Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
      to pass down with event.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  11. 06 Jul, 2017 2 commits
    • Pavel Tatashin's avatar
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin authored
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • David Howells's avatar
      VFS: Provide empty name qstr · cdf01226
      David Howells authored
      Provide an empty name (ie. "") qstr for general use.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  12. 30 Jun, 2017 1 commit
  13. 15 Jun, 2017 1 commit
    • Al Viro's avatar
      Hang/soft lockup in d_invalidate with simultaneous calls · 81be24d2
      Al Viro authored
      It's not hard to trigger a bunch of d_invalidate() on the same
      dentry in parallel.  They end up fighting each other - any
      dentry picked for removal by one will be skipped by the rest
      and we'll go for the next iteration through the entire
      subtree, even if everything is being skipped.  Morevoer, we
      immediately go back to scanning the subtree.  The only thing
      we really need is to dissolve all mounts in the subtree and
      as soon as we've nothing left to do, we can just unhash the
      dentry and bugger off.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  14. 03 May, 2017 1 commit
    • Josef Bacik's avatar
      fs: don't set *REFERENCED on single use objects · 563f4001
      Josef Bacik authored
      By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
      inode we create.  This is problematic as this means that it takes two
      trips through the LRU for any of these objects to be reclaimed,
      regardless of their actual lifetime.  With enough pressure from these
      caches we can easily evict our working set from page cache with single
      use objects.  So instead only set *REFERENCED if we've already been
      added to the LRU list.  This means that we've been touched since the
      first time we were accessed, and so more likely to need to hang out in
      To illustrate this issue I wrote the following scripts
      on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
      I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
      creates a new file system and creates 2 6.5gib files in order to take up
      13gib of the 16gib of ram with pagecache.  Then it runs a test program
      that reads these 2 files in a loop, and keeps track of how often it has
      to read bytes for each loop.  On an ideal system with no pressure we
      should have to read 0 bytes indefinitely.  The second thing this script
      does is start a fs_mark job that creates a ton of 0 length files,
      putting pressure on the system with slab only allocations.  On exit the
      script prints out how many bytes were read by the read-file program.
      The results are as follows
      Without patch:
      /mnt/btrfs-test/reads/file1: total read during loops 27262988288
      /mnt/btrfs-test/reads/file2: total read during loops 27262976000
      With patch:
      /mnt/btrfs-test/reads/file2: total read during loops 18640457728
      /mnt/btrfs-test/reads/file1: total read during loops 9565376512
      This patch results in a 50% reduction of the amount of pages evicted
      from our working set.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  15. 10 Jan, 2017 1 commit
    • Eric W. Biederman's avatar
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman authored
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  16. 24 Dec, 2016 1 commit
  17. 04 Dec, 2016 2 commits
  18. 31 Jul, 2016 1 commit
  19. 29 Jul, 2016 2 commits
  20. 24 Jul, 2016 2 commits
    • Wei Fang's avatar
      fs/dcache.c: avoid soft-lockup in dput() · 47be6184
      Wei Fang authored
      We triggered soft-lockup under stress test which
      open/access/write/close one file concurrently on more than
      five different CPUs:
      WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
      [<ffffffc0003986f8>] dput+0x100/0x298
      [<ffffffc00038c2dc>] terminate_walk+0x4c/0x60
      [<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8
      [<ffffffc00038f780>] filename_lookup+0x38/0xf0
      [<ffffffc000391180>] user_path_at_empty+0x78/0xd0
      [<ffffffc0003911f4>] user_path_at+0x1c/0x28
      [<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230
      ->d_lock trylock may failed many times because of concurrently
      operations, and dput() may execute a long time.
      Fix this by replacing cpu_relax() with cond_resched().
      dput() used to be sleepable, so make it sleepable again
      should be safe.
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarWei Fang <fangwei1@huawei.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Miklos Szeredi's avatar
      vfs: new d_init method · 285b102d
      Miklos Szeredi authored
      Allow filesystem to initialize dentry at allocation time.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  21. 21 Jul, 2016 1 commit
  22. 01 Jul, 2016 2 commits
  23. 30 Jun, 2016 1 commit
    • Miklos Szeredi's avatar
      vfs: merge .d_select_inode() into .d_real() · 2d902671
      Miklos Szeredi authored
      The two methods essentially do the same: find the real dentry/inode
      belonging to an overlay dentry.  The difference is in the usage:
      vfs_open() uses ->d_select_inode() and expects the function to perform
      copy-up if necessary based on the open flags argument.
      file_dentry() uses ->d_real() passing in the overlay dentry as well as the
      underlying inode.
      vfs_rename() uses ->d_select_inode() but passes zero flags.  ->d_real()
      with a zero inode would have worked just as well here.
      This patch merges the functionality of ->d_select_inode() into ->d_real()
      by adding an 'open_flags' argument to the latter.
      [Al Viro] Make the signature of d_real() match that of ->d_real() again.
      And constify the inode argument, while we are at it.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  24. 20 Jun, 2016 1 commit
  25. 11 Jun, 2016 2 commits
    • George Spelvin's avatar
      fs/dcache.c: Save one 32-bit multiply in dcache lookup · 703b5faf
      George Spelvin authored
      Noe that we're mixing in the parent pointer earlier, we
      don't need to use hash_32() to mix its bits.  Instead, we can
      just take the msbits of the hash value directly.
      For those applications which use the partial_name_hash(),
      move the multiply to end_name_hash.
      Signed-off-by: default avatarGeorge Spelvin <linux@sciencehorizons.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Linus Torvalds's avatar
      vfs: make the string hashes salt the hash · 8387ff25
      Linus Torvalds authored
      We always mixed in the parent pointer into the dentry name hash, but we
      did it late at lookup time.  It turns out that we can simplify that
      lookup-time action by salting the hash with the parent pointer early
      instead of late.
      A few other users of our string hashes also wanted to mix in their own
      pointers into the hash, and those are updated to use the same mechanism.
      Hash users that don't have any particular initial salt can just use the
      NULL pointer as a no-salt.
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  26. 10 Jun, 2016 1 commit
    • Al Viro's avatar
      much milder d_walk() race · ba65dc5e
      Al Viro authored
      d_walk() relies upon the tree not getting rearranged under it without
      rename_lock being touched.  And we do grab rename_lock around the
      places that change the tree topology.  Unfortunately, branch reordering
      is just as bad from d_walk() POV and we have two places that do it
      without touching rename_lock - one in handling of cursors (for ramfs-style
      directories) and another in autofs.  autofs one is a separate story; this
      commit deals with the cursors.
      	* mark cursor dentries explicitly at allocation time
      	* make __dentry_kill() leave ->d_child.next pointing to the next
      non-cursor sibling, making sure that it won't be moved around unnoticed
      before the parent is relocked on ascend-to-parent path in d_walk().
      	* make d_walk() skip cursors explicitly; strictly speaking it's
      not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
      but it makes analysis easier.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  27. 08 Jun, 2016 1 commit
    • Al Viro's avatar
      fix d_walk()/non-delayed __d_free() race · 3d56c25e
      Al Viro authored
      Ascend-to-parent logics in d_walk() depends on all encountered child
      dentries not getting freed without an RCU delay.  Unfortunately, in
      quite a few cases it is not true, with hard-to-hit oopsable race as
      the result.
      Fortunately, the fix is simiple; right now the rule is "if it ever
      been hashed, freeing must be delayed" and changing it to "if it
      ever had a parent, freeing must be delayed" closes that hole and
      covers all cases the old rule used to cover.  Moreover, pipes and
      sockets remain _not_ covered, so we do not introduce RCU delay in
      the cases which are the reason for having that delay conditional
      in the first place.
      Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  28. 30 May, 2016 1 commit
    • Al Viro's avatar
      unify dentry_iput() and dentry_unlink_inode() · 550dce01
      Al Viro authored
      There is a lot of duplication between dentry_unlink_inode() and dentry_iput().
      The only real difference is that dentry_unlink_inode() bumps ->d_seq and
      dentry_iput() doesn't.  The argument of the latter is known to have been
      unhashed, so anybody who might've found it in RCU lookup would already be
      doomed to a ->d_seq mismatch.  And we want to avoid pointless smp_rmb() there.
      This patch makes dentry_unlink_inode() bump ->d_seq only for hashed dentries.
      It's safe (d_delete() calls that sucker only if we are holding the only
      reference to dentry, so rehash is not going to happen) and it allows
      to use dentry_unlink_inode() in __dentry_kill() and get rid of dentry_iput().
      The interesting question here is profiling; it *is* a hot path, and extra
      conditional jumps in there might or might not be painful.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  29. 29 May, 2016 1 commit
  30. 28 May, 2016 1 commit
    • George Spelvin's avatar
      fs/namei.c: Add hashlen_string() function · fcfd2fbf
      George Spelvin authored
      We'd like to make more use of the highly-optimized dcache hash functions
      throughout the kernel, rather than have every subsystem create its own,
      and a function that hashes basic null-terminated strings is required
      for that.
      (The name is to emphasize that it returns both hash and length.)
      It's actually useful in the dcache itself, specifically d_alloc_name().
      Other uses in the next patch.
      full_name_hash() is also tweaked to make it more generally useful:
      1) Take a "char *" rather than "unsigned char *" argument, to
         be consistent with hash_name().
      2) Handle zero-length inputs.  If we want more callers, we don't want
         to make them worry about corner cases.
      Signed-off-by: default avatarGeorge Spelvin <linux@sciencehorizons.net>
  31. 02 May, 2016 1 commit