1. 30 May, 2018 1 commit
    • Dave Chinner's avatar
      fs: don't scan the inode cache before SB_BORN is set · b9659ff3
      Dave Chinner authored
      commit 79f546a6 upstream.
      
      We recently had an oops reported on a 4.14 kernel in
      xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
      and so the m_perag_tree lookup walked into lala land.  It produces
      an oops down this path during the failed mount:
      
        radix_tree_gang_lookup_tag+0xc4/0x130
        xfs_perag_get_tag+0x37/0xf0
        xfs_reclaim_inodes_count+0x32/0x40
        xfs_fs_nr_cached_objects+0x11/0x20
        super_cache_count+0x35/0xc0
        shrink_slab.part.66+0xb1/0x370
        shrink_node+0x7e/0x1a0
        try_to_free_pages+0x199/0x470
        __alloc_pages_slowpath+0x3a1/0xd20
        __alloc_pages_nodemask+0x1c3/0x200
        cache_grow_begin+0x20b/0x2e0
        fallback_alloc+0x160/0x200
        kmem_cache_alloc+0x111/0x4e0
      
      The problem is that the superblock shrinker is running before the
      filesystem structures it depends on have been fully set up. i.e.
      the shrinker is registered in sget(), before ->fill_super() has been
      called, and the shrinker can call into the filesystem before
      fill_super() does it's setup work. Essentially we are exposed to
      both use-after-free and use-before-initialisation bugs here.
      
      To fix this, add a check for the SB_BORN flag in super_cache_count.
      In general, this flag is not set until ->fs_mount() completes
      successfully, so we know that it is set after the filesystem
      setup has completed. This matches the trylock_super() behaviour
      which will not let super_cache_scan() run if SB_BORN is not set, and
      hence will not allow the superblock shrinker from entering the
      filesystem while it is being set up or after it has failed setup
      and is being torn down.
      
      Cc: stable@kernel.org
      Signed-Off-By: 's avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9659ff3
  2. 03 Mar, 2018 1 commit
  3. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: 's avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: 's avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  4. 17 Aug, 2017 1 commit
  5. 17 Jul, 2017 2 commits
    • David Howells's avatar
      VFS: Differentiate mount flags (MS_*) from internal superblock flags · e462ec50
      David Howells authored
      Differentiate the MS_* flags passed to mount(2) from the internal flags set
      in the super_block's s_flags.  s_flags are now called SB_*, with the names
      and the values for the moment mirroring the MS_* flags that they're
      equivalent to.
      
      In this patch, just the headers are altered and some kernel code where
      blind automated conversion isn't necessarily correct.
      
      Note that this shows up some interesting issues:
      
       (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
           MNT_NODEV) without passing this on to the filesystem, but some
           filesystems set such flags anyway.
      
       (2) The ->remount_fs() methods of some filesystems adjust the *flags
           argument by setting MS_* flags in it, such as MS_NOATIME - but these
           flags are then scrubbed by do_remount_sb() (only the occupants of
           MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
           MS_I_VERSION and MS_LAZYTIME)
      
      I'm not sure what's the best way to solve all these cases.
      Suggested-by: 's avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      e462ec50
    • David Howells's avatar
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells authored
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      bc98a42c
  6. 11 Jul, 2017 1 commit
  7. 06 Jul, 2017 1 commit
  8. 20 Apr, 2017 4 commits
  9. 02 Feb, 2017 1 commit
  10. 01 Feb, 2017 1 commit
    • Eric W. Biederman's avatar
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman authored
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: 's avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
  11. 30 Nov, 2016 1 commit
    • Jan Kara's avatar
      quota: Remove dqonoff_mutex · c3b00446
      Jan Kara authored
      The only places that were grabbing dqonoff_mutex are functions turning
      quotas on and off and these are properly serialized using s_umount
      semaphore. Remove dqonoff_mutex.
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      c3b00446
  12. 23 Nov, 2016 1 commit
  13. 15 Oct, 2016 2 commits
  14. 26 Jul, 2016 1 commit
  15. 23 Jun, 2016 5 commits
    • Eric W. Biederman's avatar
      userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag · cc50a07a
      Eric W. Biederman authored
      Now that SB_I_NODEV controls the nodev behavior devpts can just clear
      this flag during mount.  Simplifying the code and making it easier
      to audit how the code works.  While still preserving the invariant
      that s_iflags is only modified during mount.
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      cc50a07a
    • Eric W. Biederman's avatar
      userns: Remove implicit MNT_NODEV fragility. · 67690f93
      Eric W. Biederman authored
      Replace the implict setting of MNT_NODEV on mounts that happen with
      just user namespace permissions with an implicit setting of SB_I_NODEV
      in s_iflags.  The visibility of the implicit MNT_NODEV has caused
      problems in the past.
      
      With this change the fragile case where an implicit MNT_NODEV needs to
      be preserved in do_remount is removed.  Using SB_I_NODEV is much less
      fragile as s_iflags are set during the original mount and never
      changed.
      
      In do_new_mount with the implicit setting of MNT_NODEV gone, the only
      code that can affect mnt_flags is fs_fully_visible so simplify the if
      statement and reduce the indentation of the code to make that clear.
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      67690f93
    • Eric W. Biederman's avatar
      mnt: Move the FS_USERNS_MOUNT check into sget_userns · a001e74c
      Eric W. Biederman authored
      Allowing a filesystem to be mounted by other than root in the initial
      user namespace is a filesystem property not a mount namespace property
      and as such should be checked in filesystem specific code.  Move the
      FS_USERNS_MOUNT test into super.c:sget_userns().
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a001e74c
    • Eric W. Biederman's avatar
      fs: Add user namespace member to struct super_block · 6e4eab57
      Eric W. Biederman authored
      Start marking filesystems with a user namespace owner, s_user_ns.  In
      this change this is only used for permission checks of who may mount a
      filesystem.  Ultimately s_user_ns will be used for translating ids and
      checking capabilities for filesystems mounted from user namespaces.
      
      The default policy for setting s_user_ns is implemented in sget(),
      which arranges for s_user_ns to be set to current_user_ns() and to
      ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
      user_ns.
      
      The guts of sget are split out into another function sget_userns().
      The function sget_userns calls alloc_super with the specified user
      namespace or it verifies the existing superblock that was found
      has the expected user namespace, and fails with EBUSY when it is not.
      This failing prevents users with the wrong privileges mounting a
      filesystem.
      
      The reason for the split of sget_userns from sget is that in some
      cases such as mount_ns and kernfs_mount_ns a different policy for
      permission checking of mounts and setting s_user_ns is necessary, and
      the existence of sget_userns() allows those policies to be
      implemented.
      
      The helper mount_ns is expected to be used for filesystems such as
      proc and mqueuefs which present per namespace information.  The
      function mount_ns is modified to call sget_userns instead of sget to
      ensure the user namespace owner of the namespace whose information is
      presented by the filesystem is used on the superblock.
      
      For sysfs and cgroup the appropriate permission checks are already in
      place, and kernfs_mount_ns is modified to call sget_userns so that
      the init_user_ns is the only user namespace used.
      
      For the cgroup filesystem cgroup namespace mounts are bind mounts of a
      subset of the full cgroup filesystem and as such s_user_ns must be the
      same for all of them as there is only a single superblock.
      
      Mounts of sysfs that vary based on the network namespace could in principle
      change s_user_ns but it keeps the analysis and implementation of kernfs
      simpler if that is not supported, and at present there appear to be no
      benefits from supporting a different s_user_ns on any sysfs mount.
      
      Getting the details of setting s_user_ns correct has been
      a long process.  Thanks to Pavel Tikhorirorv who spotted a leak
      in sget_userns.  Thanks to Seth Forshee who has kept the work alive.
      
      Thanks-to: Seth Forshee <seth.forshee@canonical.com>
      Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatarEric W. Biederman <ebiederm@xmission.com>
      6e4eab57
    • Eric W. Biederman's avatar
      vfs: Pass data, ns, and ns->userns to mount_ns · d91ee87d
      Eric W. Biederman authored
      Today what is normally called data (the mount options) is not passed
      to fill_super through mount_ns.
      
      Pass the mount options and the namespace separately to mount_ns so
      that filesystems such as proc that have mount options, can use
      mount_ns.
      
      Pass the user namespace to mount_ns so that the standard permission
      check that verifies the mounter has permissions over the namespace can
      be performed in mount_ns instead of in each filesystems .mount method.
      Thus removing the duplication between mqueuefs and proc in terms of
      permission checks.  The extra permission check does not currently
      affect the rpc_pipefs filesystem and the nfsd filesystem as those
      filesystems do not currently allow unprivileged mounts.  Without
      unpvileged mounts it is guaranteed that the caller has already passed
      capable(CAP_SYS_ADMIN) which guarantees extra permission check will
      pass.
      
      Update rpc_pipefs and the nfsd filesystem to ensure that the network
      namespace reference is always taken in fill_super and always put in kill_sb
      so that the logic is simpler and so that errors originating inside of
      fill_super do not cause a network namespace leak.
      Acked-by: 's avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d91ee87d
  16. 18 Apr, 2016 1 commit
  17. 03 Mar, 2016 1 commit
    • Tejun Heo's avatar
      writeback: flush inode cgroup wb switches instead of pinning super_block · a1a0e23e
      Tejun Heo authored
      If cgroup writeback is in use, inodes can be scheduled for
      asynchronous wb switching.  Before 5ff8eaac ("writeback: keep
      superblock pinned during cgroup writeback association switches"), this
      could race with umount leading to super_block being destroyed while
      inodes are pinned for wb switching.  5ff8eaac fixed it by bumping
      s_active while wb switches are in flight; however, this allowed
      in-flight wb switches to make umounts asynchronous when the userland
      expected synchronosity - e.g. fsck immediately following umount may
      fail because the device is still busy.
      
      This patch removes the problematic super_block pinning and instead
      makes generic_shutdown_super() flush in-flight wb switches.  wb
      switches are now executed on a dedicated isw_wq so that they can be
      flushed and isw_nr_in_flight keeps track of the number of in-flight wb
      switches so that flushing can be avoided in most cases.
      
      v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
          check in inode_switch_wbs() as Jan an Al suggested.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Reported-by: 's avatarTahsin Erdogan <tahsin@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: 5ff8eaac ("writeback: keep superblock pinned during cgroup writeback association switches")
      Cc: stable@vger.kernel.org #v4.5
      Reviewed-by: 's avatarJan Kara <jack@suse.cz>
      Tested-by: 's avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      a1a0e23e
  18. 06 Jan, 2016 1 commit
  19. 08 Dec, 2015 1 commit
  20. 17 Aug, 2015 2 commits
  21. 15 Aug, 2015 4 commits
    • Oleg Nesterov's avatar
      change sb_writers to use percpu_rw_semaphore · 8129ed29
      Oleg Nesterov authored
      We can remove everything from struct sb_writers except frozen
      and add the array of percpu_rw_semaphore's instead.
      
      This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
      it for get_super_thawed(). We will probably remove it later.
      
      This change tries to address the following problems:
      
      	- Firstly, __sb_start_write() looks simply buggy. It does
      	  __sb_end_write() if it sees ->frozen, but if it migrates
      	  to another CPU before percpu_counter_dec(), sb_wait_write()
      	  can wrongly succeed if there is another task which holds
      	  the same "semaphore": sb_wait_write() can miss the result
      	  of the previous percpu_counter_inc() but see the result
      	  of this percpu_counter_dec().
      
      	- As Dave Hansen reports, it is suboptimal. The trivial
      	  microbenchmark that writes to a tmpfs file in a loop runs
      	  12% faster if we change this code to rely on RCU and kill
      	  the memory barriers.
      
      	- This code doesn't look simple. It would be better to rely
      	  on the generic locking code.
      
      	  According to Dave, this change adds the same performance
      	  improvement.
      
      Note: with this change both freeze_super() and thaw_super() will do
      synchronize_sched_expedited() 3 times. This is just ugly. But:
      
      	- This will be "fixed" by the rcu_sync changes we are going
      	  to merge. After that freeze_super()->percpu_down_write()
      	  will use synchronize_sched(), and thaw_super() won't use
      	  synchronize() at all.
      
      	  This doesn't need any changes in fs/super.c.
      
      	- Once we merge rcu_sync changes, we can also change super.c
      	  so that all wb_write->rw_sem's will share the single ->rss
      	  in struct sb_writes, then freeze_super() will need only one
      	  synchronize_sched().
      Signed-off-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: 's avatarJan Kara <jack@suse.com>
      8129ed29
    • Oleg Nesterov's avatar
      shift percpu_counter_destroy() into destroy_super_work() · 853b39a7
      Oleg Nesterov authored
      Of course, this patch is ugly as hell. It will be (partially)
      reverted later. We add it to ensure that other WIP changes in
      percpu_rw_semaphore won't break fs/super.c.
      
      We do not even need this change right now, percpu_free_rwsem()
      is fine in atomic context. But we are going to change this, it
      will be might_sleep() after we merge the rcu_sync() patches.
      
      And even after that we do not really need destroy_super_work(),
      we will kill it in any case. Instead, destroy_super_rcu() should
      just check that rss->cb_state == CB_IDLE and do call_rcu() again
      in the (very unlikely) case this is not true.
      
      So this is just the temporary kludge which helps us to avoid the
      conflicts with the changes which will be (hopefully) routed via
      rcu tree.
      Signed-off-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: 's avatarJan Kara <jack@suse.com>
      853b39a7
    • Oleg Nesterov's avatar
      document rwsem_release() in sb_wait_write() · 0e28e01f
      Oleg Nesterov authored
      Not only we need to avoid the warning from lockdep_sys_exit(), the
      caller of freeze_super() can never release this lock. Another thread
      can do this, so there is another reason for rwsem_release().
      
      Plus the comment should explain why we have to fool lockdep.
      Signed-off-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: 's avatarJan Kara <jack@suse.com>
      0e28e01f
    • Oleg Nesterov's avatar
      fix the broken lockdep logic in __sb_start_write() · f4b554af
      Oleg Nesterov authored
      1. wait_event(frozen < level) without rwsem_acquire_read() is just
         wrong from lockdep perspective. If we are going to deadlock
         because the caller is buggy, lockdep can't detect this problem.
      
      2. __sb_start_write() can race with thaw_super() + freeze_super(),
         and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
      
      3. The "tell lockdep we are doing trylock" hack doesn't look nice.
      
         I think this is correct, but this logic should be more explicit.
         Yes, the recursive read_lock() is fine if we hold the lock on a
         higher level. But we do not need to fool lockdep. If we can not
         deadlock in this case then try-lock must not fail and we can use
         use wait == F throughout this code.
      
      Note: as Dave Chinner explains, the "trylock" hack and the fat comment
      can be probably removed. But this needs a separate change and it will
      be trivial: just kill __sb_start_write() and rename do_sb_start_write()
      back to __sb_start_write().
      Signed-off-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: 's avatarJan Kara <jack@suse.com>
      f4b554af
  22. 01 Jul, 2015 1 commit
  23. 14 Apr, 2015 1 commit
    • Vladimir Davydov's avatar
      cleancache: remove limit on the number of cleancache enabled filesystems · 3cb29d11
      Vladimir Davydov authored
      The limit equals 32 and is imposed by the number of entries in the
      fs_poolid_map and shared_fs_poolid_map.  Nowadays it is insufficient,
      because with containers on board a Linux host can have hundreds of
      active fs mounts.
      
      These maps were introduced by commit 49a9ab81 ("mm: cleancache:
      lazy initialization to allow tmem backends to build/run as modules") in
      order to allow compiling cleancache drivers as modules.  Real pool ids
      are stored in these maps while super_block->cleancache_poolid points to
      an entry in the map, so that on cleancache registration we can walk over
      all (if there are <= 32 of them, of course) cleancache-enabled super
      blocks and assign real pool ids.
      
      Actually, there is absolutely no need in these maps, because we can
      iterate over all super blocks immediately using iterate_supers.  This is
      not racy, because cleancache_init_ops is called from mount_fs with
      super_block->s_umount held for writing, while iterate_supers takes this
      semaphore for reading, so if we call iterate_supers after setting
      cleancache_ops, all super blocks that had been created before
      cleancache_register_ops was called will be assigned pool ids by the
      action function of iterate_supers while all newer super blocks will
      receive it in cleancache_init_fs.
      
      This patch therefore removes the maps and hence the artificial limit on
      the number of cleancache enabled filesystems.
      Signed-off-by: 's avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb29d11
  24. 22 Feb, 2015 1 commit
    • Konstantin Khlebnikov's avatar
      trylock_super(): replacement for grab_super_passive() · eb6ef3df
      Konstantin Khlebnikov authored
      I've noticed significant locking contention in memory reclaimer around
      sb_lock inside grab_super_passive(). Grab_super_passive() is called from
      two places: in icache/dcache shrinkers (function super_cache_scan) and
      from writeback (function __writeback_inodes_wb). Both are required for
      progress in memory allocator.
      
      Grab_super_passive() acquires sb_lock to increment sb->s_count and check
      sb->s_instances. It seems sb->s_umount locked for read is enough here:
      super-block deactivation always runs under sb->s_umount locked for write.
      Protecting super-block itself isn't a problem: in super_cache_scan() sb
      is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
      are still active. Inside writeback super-block comes from inode from bdi
      writeback list under wb->list_lock.
      
      This patch removes locking sb_lock and checks s_instances under s_umount:
      generic_shutdown_super() unlinks it under sb->s_umount locked for write.
      New variant is called trylock_super() and since it only locks semaphore,
      callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
      they're done.
      Signed-off-by: 's avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      eb6ef3df
  25. 13 Feb, 2015 3 commits
    • Vladimir Davydov's avatar
      fs: shrinker: always scan at least one object of each type · 49e7e7ff
      Vladimir Davydov authored
      In super_cache_scan() we divide the number of objects of particular type
      by the total number of objects in order to distribute pressure among As a
      result, in some corner cases we can get nr_to_scan=0 even if there are
      some objects to reclaim, e.g.  dentries=1, inodes=1, fs_objects=1,
      nr_to_scan=1/3=0.
      
      This is unacceptable for per memcg kmem accounting, because this means
      that some objects may never get reclaimed after memcg death, preventing it
      from being freed.
      
      This patch therefore assures that super_cache_scan() will scan at least
      one object of each type if any.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: 's avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      49e7e7ff
    • Vladimir Davydov's avatar
      fs: make shrinker memcg aware · 2acb60a0
      Vladimir Davydov authored
      Now, to make any list_lru-based shrinker memcg aware we should only
      initialize its list_lru as memcg aware.  Let's do it for the general FS
      shrinker (super_block::s_shrink).
      
      There are other FS-specific shrinkers that use list_lru for storing
      objects, such as XFS and GFS2 dquot cache shrinkers, but since they
      reclaim objects that are shared among different cgroups, there is no point
      making them memcg aware.  It's a big question whether we should account
      them to memcg at all.
      Signed-off-by: 's avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      2acb60a0
    • Vladimir Davydov's avatar
      list_lru: organize all list_lrus to list · c0a5b560
      Vladimir Davydov authored
      To make list_lru memcg aware, we need all list_lrus to be kept on a list
      protected by a mutex, so that we could sleep while walking over the
      list.
      
      Therefore after this change list_lru_destroy may sleep.  Fortunately,
      there is only one user that calls it from an atomic context - it's
      put_super - and we can easily fix it by calling list_lru_destroy before
      put_super in destroy_locked_super - anyway we don't longer need lrus by
      that time.
      
      Another point that should be noted is that list_lru_destroy is allowed
      to be called on an uninitialized zeroed-out object, in which case it is
      a no-op.  Before this patch this was guaranteed by kfree, but now we
      need an explicit check there.
      Signed-off-by: 's avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0a5b560