1. 22 Jun, 2019 1 commit
  2. 17 Jul, 2018 1 commit
    • Linus Torvalds's avatar
      Fix up non-directory creation in SGID directories · 298243a5
      Linus Torvalds authored
      commit 0fa3ecd8 upstream.
      sgid directories have special semantics, making newly created files in
      the directory belong to the group of the directory, and newly created
      subdirectories will also become sgid.  This is historically used for
      group-shared directories.
      But group directories writable by non-group members should not imply
      that such non-group members can magically join the group, so make sure
      to clear the sgid bit on non-directories for non-members (but remember
      that sgid without group execute means "mandatory locking", just to
      confuse things even more).
      Reported-by: default avatarJann Horn <jannh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 08 Jul, 2018 1 commit
  4. 09 Sep, 2017 1 commit
  5. 05 Sep, 2017 1 commit
  6. 01 Sep, 2017 1 commit
    • Darrick J. Wong's avatar
      xfs: evict all inodes involved with log redo item · 799ea9e9
      Darrick J. Wong authored
      When we introduced the bmap redo log items, we set MS_ACTIVE on the
      mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
      from being truncated prematurely during log recovery.  This also had the
      effect of putting linked inodes on the lru instead of evicting them.
      Unfortunately, we neglected to find all those unreferenced lru inodes
      and evict them after finishing log recovery, which means that we leak
      them if anything goes wrong in the rest of xfs_mountfs, because the lru
      is only cleaned out on unmount.
      Therefore, evict unreferenced inodes in the lru list immediately
      after clearing MS_ACTIVE.
      Fixes: 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: viro@ZenIV.linux.org.uk
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
  7. 06 Jul, 2017 1 commit
    • Pavel Tatashin's avatar
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin authored
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 30 Jun, 2017 1 commit
  9. 27 Jun, 2017 1 commit
    • Jens Axboe's avatar
      fs: add fcntl() interface for setting/getting write life time hints · c75b1d94
      Jens Axboe authored
      Define a set of write life time hints:
      RWH_WRITE_LIFE_NOT_SET	No hint information set
      RWH_WRITE_LIFE_NONE	No hints about write life time
      RWH_WRITE_LIFE_SHORT	Data written has a short life time
      RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
      RWH_WRITE_LIFE_LONG	Data written has a long life time
      RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time
      The intent is for these values to be relative to each other, no
      absolute meaning should be attached to these flag names.
      Add an fcntl interface for querying these flags, and also for
      setting them as well:
      F_GET_RW_HINT		Returns the read/write hint set on the
      			underlying inode.
      F_SET_RW_HINT		Set one of the above write hints on the
      			underlying inode.
      F_GET_FILE_RW_HINT	Returns the read/write hint set on the
      			file descriptor.
      F_SET_FILE_RW_HINT	Set one of the above write hints on the
      			file descriptor.
      The user passes in a 64-bit pointer to get/set these values, and
      the interface returns 0/-1 on success/error.
      Sample program testing/implementing basic setting/getting of write
      hints is below.
      Add support for storing the write life time hint in the inode flags
      and in struct file as well, and pass them to the kiocb flags. If
      both a file and its corresponding inode has a write hint, then we
      use the one in the file, if available. The file hint can be used
      for sync/direct IO, for buffered writeback only the inode hint
      is available.
      This is in preparation for utilizing these hints in the block layer,
      to guide on-media data placement.
       * writehint.c: get or set an inode write hint
       #include <stdio.h>
       #include <fcntl.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <stdbool.h>
       #include <inttypes.h>
       #ifndef F_GET_RW_HINT
       #define F_LINUX_SPECIFIC_BASE	1024
       #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
       #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
      static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
      int main(int argc, char *argv[])
      	uint64_t hint;
      	int fd, ret;
      	if (argc < 2) {
      		fprintf(stderr, "%s: file <hint>\n", argv[0]);
      		return 1;
      	fd = open(argv[1], O_RDONLY);
      	if (fd < 0) {
      		return 2;
      	if (argc > 2) {
      		hint = atoi(argv[2]);
      		ret = fcntl(fd, F_SET_RW_HINT, &hint);
      		if (ret < 0) {
      			perror("fcntl: F_SET_RW_HINT");
      			return 4;
      	ret = fcntl(fd, F_GET_RW_HINT, &hint);
      	if (ret < 0) {
      		perror("fcntl: F_GET_RW_HINT");
      		return 3;
      	printf("%s: hint %s\n", argv[1], str[hint]);
      	return 0;
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  10. 20 Jun, 2017 1 commit
  11. 09 May, 2017 1 commit
  12. 03 May, 2017 1 commit
    • Josef Bacik's avatar
      fs: don't set *REFERENCED on single use objects · 563f4001
      Josef Bacik authored
      By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
      inode we create.  This is problematic as this means that it takes two
      trips through the LRU for any of these objects to be reclaimed,
      regardless of their actual lifetime.  With enough pressure from these
      caches we can easily evict our working set from page cache with single
      use objects.  So instead only set *REFERENCED if we've already been
      added to the LRU list.  This means that we've been touched since the
      first time we were accessed, and so more likely to need to hang out in
      To illustrate this issue I wrote the following scripts
      on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
      I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
      creates a new file system and creates 2 6.5gib files in order to take up
      13gib of the 16gib of ram with pagecache.  Then it runs a test program
      that reads these 2 files in a loop, and keeps track of how often it has
      to read bytes for each loop.  On an ideal system with no pressure we
      should have to read 0 bytes indefinitely.  The second thing this script
      does is start a fs_mark job that creates a ton of 0 length files,
      putting pressure on the system with slab only allocations.  On exit the
      script prints out how many bytes were read by the read-file program.
      The results are as follows
      Without patch:
      /mnt/btrfs-test/reads/file1: total read during loops 27262988288
      /mnt/btrfs-test/reads/file2: total read during loops 27262976000
      With patch:
      /mnt/btrfs-test/reads/file2: total read during loops 18640457728
      /mnt/btrfs-test/reads/file1: total read during loops 9565376512
      This patch results in a 50% reduction of the amount of pages evicted
      from our working set.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  13. 10 Apr, 2017 2 commits
    • Jan Kara's avatar
      fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83
      Jan Kara authored
      Currently we free fsnotify_mark_connector structure only when inode /
      vfsmount is getting freed. This can however impose noticeable memory
      overhead when marks get attached to inodes only temporarily. So free the
      connector structure once the last mark is detached from the object.
      Since notification infrastructure can be working with the connector
      under the protection of fsnotify_mark_srcu, we have to be careful and
      free the fsnotify_mark_connector only after SRCU period passes.
      Reviewed-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
    • Jan Kara's avatar
      fsnotify: Move mark list head from object into dedicated structure · 9dd813c1
      Jan Kara authored
      Currently notification marks are attached to object (inode or vfsmnt) by
      a hlist_head in the object. The list is also protected by a spinlock in
      the object. So while there is any mark attached to the list of marks,
      the object must be pinned in memory (and thus e.g. last iput() deleting
      inode cannot happen). Also for list iteration in fsnotify() to work, we
      must hold fsnotify_mark_srcu lock so that mark itself and
      mark->obj_list.next cannot get freed. Thus we are required to wait for
      response to fanotify events from userspace process with
      fsnotify_mark_srcu lock held. That causes issues when userspace process
      is buggy and does not reply to some event - basically the whole
      notification subsystem gets eventually stuck.
      So to be able to drop fsnotify_mark_srcu lock while waiting for
      response, we have to pin the mark in memory and make sure it stays in
      the object list (as removing the mark waiting for response could lead to
      lost notification events for groups later in the list). However we don't
      want inode reclaim to block on such mark as that would lead to system
      just locking up elsewhere.
      This commit is the first in the series that paves way towards solving
      these conflicting lifetime needs. Instead of anchoring the list of marks
      directly in the object, we anchor it in a dedicated structure
      (fsnotify_mark_connector) and just point to that structure from the
      object. The following commits will also add spinlock protecting the list
      and object pointer to the structure.
      Reviewed-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
  14. 08 Oct, 2016 1 commit
  15. 28 Sep, 2016 2 commits
    • Deepa Dinamani's avatar
      fs: Replace current_fs_time() with current_time() · c2050a45
      Deepa Dinamani authored
      current_fs_time() uses struct super_block* as an argument.
      As per Linus's suggestion, this is changed to take struct
      inode* as a parameter instead. This is because the function
      is primarily meant for vfs inode timestamps.
      Also the function was renamed as per Arnd's suggestion.
      Change all calls to current_fs_time() to use the new
      current_time() function instead. current_fs_time() will be
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Deepa Dinamani's avatar
      vfs: Add current_time() api · 3cd88666
      Deepa Dinamani authored
      current_fs_time() is used for inode timestamps.
      Change the signature of the function to take inode pointer
      instead of superblock as per Linus's suggestion.
      Also, move the api under vfs as per the discussion on the
      thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's
      suggestion on the thread, changing the function name.
      current_fs_time() will be deleted after all the references
      to it are replaced by current_time().
      There was a bug reported by kbuild test bot with the change
      as some of the calls to current_time() were made before the
      super_block was initialized. Catch these accidental assignments
      as timespec_trunc() does for wrong granularities. This allows
      for the function to work right even in these circumstances.
      But, adds a warning to make the user aware of the bug.
      A coccinelle script was used to identify all the current
      .alloc_inode super_block callbacks that updated inode timestamps.
      proc filesystem was the only one that was modifying inode times
      as part of this callback. The series includes a patch to fix that.
      Note that timespec_trunc() will also be moved to fs/inode.c
      in a separate patch when this will need to be revamped for
      bounds checking purposes.
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  16. 16 Sep, 2016 1 commit
    • Miklos Szeredi's avatar
      vfs: update ovl inode before relatime check · 598e3c8f
      Miklos Szeredi authored
      On overlayfs relatime_need_update() needs inode times to be correct on
      overlay inode.  But i_mtime and i_ctime are updated by filesystem code on
      underlying inode only, so they will be out-of-date on the overlay inode.
      This patch copies the times from the underlying inode if needed.  This
      can't be done if called from RCU lookup (link following) but link m/ctime
      are not updated by fs, so this is all right.
      This patch doesn't change functionality for anything but overlayfs.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  17. 03 Aug, 2016 2 commits
  18. 02 Aug, 2016 1 commit
  19. 26 Jul, 2016 1 commit
  20. 05 Jul, 2016 1 commit
    • Eric W. Biederman's avatar
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman authored
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  21. 04 Jul, 2016 1 commit
    • Al Viro's avatar
      iget_locked et.al.: make sure we don't return bad inodes · 2864f301
      Al Viro authored
      If one thread does iget_locked(), proceeds to try and set
      the new inode up and fails, inode will be unhashed and dropped.
      However, another thread doing ilookup/iget_locked in the middle
      of that would end up finding a half-set-up inode, grabbing
      a reference, waiting for it to come unlocked and getting the
      resulting bad inode.  It's a race (if that ilookup had been
      called just after the failure of setup attempt it wouldn't
      have found the sucker at all), particularly unpleasant in
      cases when failure is transient/caller-dependent/etc.
      While it can be dealt with in the callers, there's no reason
      not to handle it in fs/inode.c primitives, especially since
      the cost is trivial.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  22. 02 May, 2016 2 commits
    • Al Viro's avatar
      parallel lookups: actual switch to rwsem · 9902af79
      Al Viro authored
      The main issue is the lack of down_write_killable(), so the places
      like readdir.c switched to plain inode_lock(); once killable
      variants of rwsem primitives appear, that'll be dealt with.
      lockdep side also might need more work
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      parallel lookups machinery, part 2 · 84e710da
      Al Viro authored
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  23. 31 Mar, 2016 1 commit
    • Andreas Gruenbacher's avatar
      posix_acl: Inode acl caching fixes · b8a7a3a6
      Andreas Gruenbacher authored
      When get_acl() is called for an inode whose ACL is not cached yet, the
      get_acl inode operation is called to fetch the ACL from the filesystem.
      The inode operation is responsible for updating the cached acl with
      set_cached_acl().  This is done without locking at the VFS level, so
      another task can call set_cached_acl() or forget_cached_acl() before the
      get_acl inode operation gets to calling set_cached_acl(), and then
      get_acl's call to set_cached_acl() results in caching an outdate ACL.
      Prevent this from happening by setting the cached ACL pointer to a
      task-specific sentinel value before calling the get_acl inode operation.
      Move the responsibility for updating the cached ACL from the get_acl
      inode operations to get_acl().  There, only set the cached ACL if the
      sentinel value hasn't changed.
      The sentinel values are chosen to have odd values.  Likewise, the value
      of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
      an even value (ACLs are aligned in memory).  This allows to distinguish
      uncached ACLs values from ACL objects.
      In addition, switch from guarding inode->i_acl and inode->i_default_acl
      upates by the inode->i_lock spinlock to using xchg() and cmpxchg().
      Filesystems that do not want ACLs returned from their get_acl inode
      operations to be cached must call forget_cached_acl() to prevent the VFS
      from doing so.
      (Patch written by Al Viro and Andreas Gruenbacher.)
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  24. 16 Feb, 2016 1 commit
  25. 23 Jan, 2016 1 commit
    • Ross Zwisler's avatar
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler authored
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  26. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  27. 15 Jan, 2016 1 commit
    • Vladimir Davydov's avatar
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov authored
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  28. 08 Jan, 2016 1 commit
  29. 09 Dec, 2015 1 commit
    • Al Viro's avatar
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro authored
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  30. 11 Nov, 2015 1 commit
  31. 09 Nov, 2015 1 commit
  32. 18 Aug, 2015 2 commits
    • Josef Bacik's avatar
      inode: don't softlockup when evicting inodes · ac05fbb4
      Josef Bacik authored
      On a box with a lot of ram (148gb) I can make the box softlockup after running
      an fs_mark job that creates hundreds of millions of empty files.  This is
      because we never generate enough memory pressure to keep the number of inodes on
      our unused list low, so when we go to unmount we have to evict ~100 million
      inodes.  This makes one processor a very unhappy person, so add a cond_resched()
      in dispose_list() and if we need a resched when processing the s_inodes list do
      that and run dispose_list() on what we've currently culled.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
    • Dave Chinner's avatar
      inode: rename i_wb_list to i_io_list · c7f54084
      Dave Chinner authored
      There's a small consistency problem between the inode and writeback
      naming. Writeback calls the "for IO" inode queues b_io and
      b_more_io, but the inode calls these the "writeback list" or
      i_wb_list. This makes it hard to an new "under writeback" list to
      the inode, or call it an "under IO" list on the bdi because either
      way we'll have writeback on IO and IO on writeback and it'll just be
      confusing. I'm getting confused just writing this!
      So, rename the inode "for IO" list variable to i_io_list so we can
      add a new "writeback list" in a subsequent patch.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
  33. 17 Aug, 2015 1 commit
  34. 01 Jul, 2015 1 commit
    • Carlos Maiolino's avatar
      vfs: avoid creation of inode number 0 in get_next_ino · 2adc376c
      Carlos Maiolino authored
      currently, get_next_ino() is able to create inodes with inode number = 0.
      This have a bad impact in the filesystems relying in this function to generate
      inode numbers.
      While there is no problem at all in having inodes with number 0, userspace tools
      which handle file management tasks can have problems handling these files, like
      for example, the impossiblity of users to delete these files, since glibc will
      ignore them. So, I believe the best way is kernel to avoid creating them.
      This problem has been raised previously, but the old thread didn't have any
      other update for a year+, and I've seen too many users hitting the same issue
      regarding the impossibility to delete files while using filesystems relying on
      this function. So, I'm starting the thread again, with the same patch
      that I believe is enough to address this problem.
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  35. 23 Jun, 2015 1 commit
    • Jan Kara's avatar
      fs: Call security_ops->inode_killpriv on truncate · 45f147a1
      Jan Kara authored
      Comment in include/linux/security.h says that ->inode_killpriv() should
      be called when setuid bit is being removed and that similar security
      labels (in fact this applies only to file capabilities) should be
      removed at this time as well. However we don't call ->inode_killpriv()
      when we remove suid bit on truncate.
      We fix the problem by calling ->inode_need_killpriv() and subsequently
      ->inode_killpriv() on truncate the same way as we do it on file write.
      After this patch there's only one user of should_remove_suid() - ocfs2 -
      and indeed it's buggy because it doesn't call ->inode_killpriv() on
      write. However fixing it is difficult because of special locking
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>