1. 08 Jul, 2017 1 commit
    • Al Viro's avatar
      dentry name snapshots · 49d31c2f
      Al Viro authored
      take_dentry_name_snapshot() takes a safe snapshot of dentry name;
      if the name is a short one, it gets copied into caller-supplied
      structure, otherwise an extra reference to external name is grabbed
      (those are never modified).  In either case the pointer to stable
      string is stored into the same structure.
      
      dentry must be held by the caller of take_dentry_name_snapshot(),
      but may be freely dropped afterwards - the snapshot will stay
      until destroyed by release_dentry_name_snapshot().
      
      Intended use:
      	struct name_snapshot s;
      
      	take_dentry_name_snapshot(&s, dentry);
      	...
      	access s.name
      	...
      	release_dentry_name_snapshot(&s);
      
      Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
      to pass down with event.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      49d31c2f
  2. 30 Jun, 2017 1 commit
  3. 15 Jun, 2017 1 commit
    • Al Viro's avatar
      Hang/soft lockup in d_invalidate with simultaneous calls · 81be24d2
      Al Viro authored
      It's not hard to trigger a bunch of d_invalidate() on the same
      dentry in parallel.  They end up fighting each other - any
      dentry picked for removal by one will be skipped by the rest
      and we'll go for the next iteration through the entire
      subtree, even if everything is being skipped.  Morevoer, we
      immediately go back to scanning the subtree.  The only thing
      we really need is to dissolve all mounts in the subtree and
      as soon as we've nothing left to do, we can just unhash the
      dentry and bugger off.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      81be24d2
  4. 03 May, 2017 1 commit
    • Josef Bacik's avatar
      fs: don't set *REFERENCED on single use objects · 563f4001
      Josef Bacik authored
      By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
      inode we create.  This is problematic as this means that it takes two
      trips through the LRU for any of these objects to be reclaimed,
      regardless of their actual lifetime.  With enough pressure from these
      caches we can easily evict our working set from page cache with single
      use objects.  So instead only set *REFERENCED if we've already been
      added to the LRU list.  This means that we've been touched since the
      first time we were accessed, and so more likely to need to hang out in
      cache.
      
      To illustrate this issue I wrote the following scripts
      
      https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure
      
      on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
      I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
      creates a new file system and creates 2 6.5gib files in order to take up
      13gib of the 16gib of ram with pagecache.  Then it runs a test program
      that reads these 2 files in a loop, and keeps track of how often it has
      to read bytes for each loop.  On an ideal system with no pressure we
      should have to read 0 bytes indefinitely.  The second thing this script
      does is start a fs_mark job that creates a ton of 0 length files,
      putting pressure on the system with slab only allocations.  On exit the
      script prints out how many bytes were read by the read-file program.
      The results are as follows
      
      Without patch:
      /mnt/btrfs-test/reads/file1: total read during loops 27262988288
      /mnt/btrfs-test/reads/file2: total read during loops 27262976000
      
      With patch:
      /mnt/btrfs-test/reads/file2: total read during loops 18640457728
      /mnt/btrfs-test/reads/file1: total read during loops 9565376512
      
      This patch results in a 50% reduction of the amount of pages evicted
      from our working set.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      563f4001
  5. 10 Jan, 2017 1 commit
    • Eric W. Biederman's avatar
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman authored
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
  6. 24 Dec, 2016 1 commit
  7. 04 Dec, 2016 2 commits
  8. 31 Jul, 2016 1 commit
  9. 29 Jul, 2016 2 commits
  10. 24 Jul, 2016 2 commits
    • Wei Fang's avatar
      fs/dcache.c: avoid soft-lockup in dput() · 47be6184
      Wei Fang authored
      We triggered soft-lockup under stress test which
      open/access/write/close one file concurrently on more than
      five different CPUs:
      
      WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
      ...
      [<ffffffc0003986f8>] dput+0x100/0x298
      [<ffffffc00038c2dc>] terminate_walk+0x4c/0x60
      [<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8
      [<ffffffc00038f780>] filename_lookup+0x38/0xf0
      [<ffffffc000391180>] user_path_at_empty+0x78/0xd0
      [<ffffffc0003911f4>] user_path_at+0x1c/0x28
      [<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230
      
      ->d_lock trylock may failed many times because of concurrently
      operations, and dput() may execute a long time.
      
      Fix this by replacing cpu_relax() with cond_resched().
      dput() used to be sleepable, so make it sleepable again
      should be safe.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarWei Fang <fangwei1@huawei.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      47be6184
    • Miklos Szeredi's avatar
      vfs: new d_init method · 285b102d
      Miklos Szeredi authored
      Allow filesystem to initialize dentry at allocation time.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      285b102d
  11. 21 Jul, 2016 1 commit
  12. 01 Jul, 2016 2 commits
  13. 30 Jun, 2016 1 commit
    • Miklos Szeredi's avatar
      vfs: merge .d_select_inode() into .d_real() · 2d902671
      Miklos Szeredi authored
      The two methods essentially do the same: find the real dentry/inode
      belonging to an overlay dentry.  The difference is in the usage:
      
      vfs_open() uses ->d_select_inode() and expects the function to perform
      copy-up if necessary based on the open flags argument.
      
      file_dentry() uses ->d_real() passing in the overlay dentry as well as the
      underlying inode.
      
      vfs_rename() uses ->d_select_inode() but passes zero flags.  ->d_real()
      with a zero inode would have worked just as well here.
      
      This patch merges the functionality of ->d_select_inode() into ->d_real()
      by adding an 'open_flags' argument to the latter.
      
      [Al Viro] Make the signature of d_real() match that of ->d_real() again.
      And constify the inode argument, while we are at it.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      2d902671
  14. 20 Jun, 2016 1 commit
  15. 11 Jun, 2016 2 commits
    • George Spelvin's avatar
      fs/dcache.c: Save one 32-bit multiply in dcache lookup · 703b5faf
      George Spelvin authored
      Noe that we're mixing in the parent pointer earlier, we
      don't need to use hash_32() to mix its bits.  Instead, we can
      just take the msbits of the hash value directly.
      
      For those applications which use the partial_name_hash(),
      move the multiply to end_name_hash.
      Signed-off-by: default avatarGeorge Spelvin <linux@sciencehorizons.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      703b5faf
    • Linus Torvalds's avatar
      vfs: make the string hashes salt the hash · 8387ff25
      Linus Torvalds authored
      We always mixed in the parent pointer into the dentry name hash, but we
      did it late at lookup time.  It turns out that we can simplify that
      lookup-time action by salting the hash with the parent pointer early
      instead of late.
      
      A few other users of our string hashes also wanted to mix in their own
      pointers into the hash, and those are updated to use the same mechanism.
      
      Hash users that don't have any particular initial salt can just use the
      NULL pointer as a no-salt.
      
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8387ff25
  16. 10 Jun, 2016 1 commit
    • Al Viro's avatar
      much milder d_walk() race · ba65dc5e
      Al Viro authored
      d_walk() relies upon the tree not getting rearranged under it without
      rename_lock being touched.  And we do grab rename_lock around the
      places that change the tree topology.  Unfortunately, branch reordering
      is just as bad from d_walk() POV and we have two places that do it
      without touching rename_lock - one in handling of cursors (for ramfs-style
      directories) and another in autofs.  autofs one is a separate story; this
      commit deals with the cursors.
      	* mark cursor dentries explicitly at allocation time
      	* make __dentry_kill() leave ->d_child.next pointing to the next
      non-cursor sibling, making sure that it won't be moved around unnoticed
      before the parent is relocked on ascend-to-parent path in d_walk().
      	* make d_walk() skip cursors explicitly; strictly speaking it's
      not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
      but it makes analysis easier.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ba65dc5e
  17. 08 Jun, 2016 1 commit
    • Al Viro's avatar
      fix d_walk()/non-delayed __d_free() race · 3d56c25e
      Al Viro authored
      Ascend-to-parent logics in d_walk() depends on all encountered child
      dentries not getting freed without an RCU delay.  Unfortunately, in
      quite a few cases it is not true, with hard-to-hit oopsable race as
      the result.
      
      Fortunately, the fix is simiple; right now the rule is "if it ever
      been hashed, freeing must be delayed" and changing it to "if it
      ever had a parent, freeing must be delayed" closes that hole and
      covers all cases the old rule used to cover.  Moreover, pipes and
      sockets remain _not_ covered, so we do not introduce RCU delay in
      the cases which are the reason for having that delay conditional
      in the first place.
      
      Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3d56c25e
  18. 30 May, 2016 1 commit
    • Al Viro's avatar
      unify dentry_iput() and dentry_unlink_inode() · 550dce01
      Al Viro authored
      There is a lot of duplication between dentry_unlink_inode() and dentry_iput().
      The only real difference is that dentry_unlink_inode() bumps ->d_seq and
      dentry_iput() doesn't.  The argument of the latter is known to have been
      unhashed, so anybody who might've found it in RCU lookup would already be
      doomed to a ->d_seq mismatch.  And we want to avoid pointless smp_rmb() there.
      
      This patch makes dentry_unlink_inode() bump ->d_seq only for hashed dentries.
      It's safe (d_delete() calls that sucker only if we are holding the only
      reference to dentry, so rehash is not going to happen) and it allows
      to use dentry_unlink_inode() in __dentry_kill() and get rid of dentry_iput().
      
      The interesting question here is profiling; it *is* a hot path, and extra
      conditional jumps in there might or might not be painful.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      550dce01
  19. 29 May, 2016 1 commit
  20. 28 May, 2016 1 commit
    • George Spelvin's avatar
      fs/namei.c: Add hashlen_string() function · fcfd2fbf
      George Spelvin authored
      We'd like to make more use of the highly-optimized dcache hash functions
      throughout the kernel, rather than have every subsystem create its own,
      and a function that hashes basic null-terminated strings is required
      for that.
      
      (The name is to emphasize that it returns both hash and length.)
      
      It's actually useful in the dcache itself, specifically d_alloc_name().
      Other uses in the next patch.
      
      full_name_hash() is also tweaked to make it more generally useful:
      1) Take a "char *" rather than "unsigned char *" argument, to
         be consistent with hash_name().
      2) Handle zero-length inputs.  If we want more callers, we don't want
         to make them worry about corner cases.
      Signed-off-by: default avatarGeorge Spelvin <linux@sciencehorizons.net>
      fcfd2fbf
  21. 02 May, 2016 7 commits
    • Al Viro's avatar
      parallel lookups: actual switch to rwsem · 9902af79
      Al Viro authored
      ta-da!
      
      The main issue is the lack of down_write_killable(), so the places
      like readdir.c switched to plain inode_lock(); once killable
      variants of rwsem primitives appear, that'll be dealt with.
      
      lockdep side also might need more work
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9902af79
    • Al Viro's avatar
      parallel lookups machinery, part 4 (and last) · d9171b93
      Al Viro authored
      If we *do* run into an in-lookup match, we need to wait for it to
      cease being in-lookup.  Fortunately, we do have unused space in
      in-lookup dentries - d_lru is never looked at until it stops being
      in-lookup.
      
      So we can stash a pointer to wait_queue_head from stack frame of
      the caller of ->lookup().  Some precautions are needed while
      waiting, but it's not that hard - we do hold a reference to dentry
      we are waiting for, so it can't go away.  If it's found to be
      in-lookup the wait_queue_head is still alive and will remain so
      at least while ->d_lock is held.  Moreover, the condition we
      are waiting for becomes true at the same point where everything
      on that wq gets woken up, so we can just add ourselves to the
      queue once.
      
      d_alloc_parallel() gets a pointer to wait_queue_head_t from its
      caller; lookup_slow() adjusted, d_add_ci() taught to use
      d_alloc_parallel() if the dentry passed to it happens to be
      in-lookup one (i.e. if it's been called from the parallel lookup).
      
      That's pretty much it - all that remains is to switch ->i_mutex
      to rwsem and have lookup_slow() take it shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d9171b93
    • Al Viro's avatar
      parallel lookups machinery, part 3 · 94bdd655
      Al Viro authored
      We will need to be able to check if there is an in-lookup
      dentry with matching parent/name.  Right now it's impossible,
      but as soon as start locking directories shared such beasts
      will appear.
      
      Add a secondary hash for locating those.  Hash chains go through
      the same space where d_alias will be once it's not in-lookup anymore.
      Search is done under the same bitlock we use for modifications -
      with the primary hash we can rely on d_rehash() into the wrong
      chain being the worst that could happen, but here the pointers are
      buggered once it's removed from the chain.  On the other hand,
      the chains are not going to be long and normally we'll end up
      adding to the chain anyway.  That allows us to avoid bothering with
      ->d_lock when doing the comparisons - everything is stable until
      removed from chain.
      
      New helper: d_alloc_parallel().  Right now it allocates, verifies
      that no hashed and in-lookup matches exist and adds to in-lookup
      hash.
      
      Returns ERR_PTR() for error, hashed match (in the unlikely case it's
      been found) or new dentry.  In-lookup matches trigger BUG() for
      now; that will change in the next commit when we introduce waiting
      for ongoing lookup to finish.  Note that in-lookup matches won't be
      possible until we actually go for shared locking.
      
      lookup_slow() switched to use of d_alloc_parallel().
      
      Again, these commits are separated only for making it easier to
      review.  All this machinery will start doing something useful only
      when we go for shared locking; it's just that the combination is
      too large for my taste.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      94bdd655
    • Al Viro's avatar
      parallel lookups machinery, part 2 · 84e710da
      Al Viro authored
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      84e710da
    • Al Viro's avatar
      beginning of transition to parallel lookups - marking in-lookup dentries · 85c7f810
      Al Viro authored
      marked as such when (would be) parallel lookup is about to pass them
      to actual ->lookup(); unmarked when
      	* __d_add() is about to make it hashed, positive or not.
      	* __d_move() (from d_splice_alias(), directly or via
      __d_unalias()) puts a preexisting dentry in its place
      	* in caller of ->lookup() if it has escaped all of the
      above.  Bug (WARN_ON, actually) if it reaches the final dput()
      or d_instantiate() while still marked such.
      
      As the result, we are guaranteed that for as long as the flag is
      set, dentry will
      	* remain negative unhashed with positive refcount
      	* never have its ->d_alias looked at
      	* never have its ->d_lru looked at
      	* never have its ->d_parent and ->d_name changed
      
      Right now we have at most one such for any given parent directory.
      With parallel lookups that restriction will weaken to
      	* only exist when parent is locked shared
      	* at most one with given (parent,name) pair (comparison of
      names is according to ->d_compare())
      	* only exist when there's no hashed dentry with the same
      (parent,name)
      
      Transition will take the next several commits; unfortunately, we'll
      only be able to switch to rwsem at the end of this series.  The
      reason for not making it a single patch is to simplify review.
      
      New primitives: d_in_lookup() (a predicate checking if dentry is in
      the in-lookup state) and d_lookup_done() (tells the system that
      we are done with lookup and if it's still marked as in-lookup, it
      should cease to be such).
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      85c7f810
    • Al Viro's avatar
      __d_add(): don't drop/regain ->d_lock · 0568d705
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0568d705
    • Al Viro's avatar
  22. 28 Mar, 2016 1 commit
  23. 26 Mar, 2016 1 commit
    • Miklos Szeredi's avatar
      fs: add file_dentry() · d101a125
      Miklos Szeredi authored
      This series fixes bugs in nfs and ext4 due to 4bacc9c9 ("overlayfs:
      Make f_path always point to the overlay and f_inode to the underlay").
      
      Regular files opened on overlayfs will result in the file being opened on
      the underlying filesystem, while f_path points to the overlayfs
      mount/dentry.
      
      This confuses filesystems which get the dentry from struct file and assume
      it's theirs.
      
      Add a new helper, file_dentry() [*], to get the filesystem's own dentry
      from the file.  This checks file->f_path.dentry->d_flags against
      DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not
      set (this is the common, non-overlayfs case).
      
      In the uncommon case it will call into overlayfs's ->d_real() to get the
      underlying dentry, matching file_inode(file).
      
      The reason we need to check against the inode is that if the file is copied
      up while being open, d_real() would return the upper dentry, while the open
      file comes from the lower dentry.
      
      [*] If possible, it's better simply to use file_inode() instead.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Cc: <stable@vger.kernel.org> # v4.2
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Daniel Axtens <dja@axtens.net>
      d101a125
  24. 14 Mar, 2016 5 commits
  25. 29 Feb, 2016 1 commit