1. 05 Sep, 2018 1 commit
  2. 31 Mar, 2018 1 commit
  3. 08 Apr, 2017 1 commit
    • NeilBrown's avatar
      sysfs: be careful of error returns from ops->show() · c8a139d0
      NeilBrown authored
      ops->show() can return a negative error code.
      Commit 65da3484 ("sysfs: correctly handle short reads on PREALLOC attrs.")
      (in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
      would look like large numbers.
      As a result, if an error is returned, sysfs_kf_read() will return the
      value of 'count', typically 4096.
      Commit 17d0774f ("sysfs: correctly handle read offset on PREALLOC attrs")
      (in v4.8) extended this error to use the unsigned large 'len' as a size for
      Consequently, if ->show returns an error, then the first read() on the
      sysfs file will return 4096 and could return uninitialized memory to
      If the application performs a subsequent read, this will trigger a memmove()
      with extremely large count, and is likely to crash the machine is bizarre ways.
      This bug can currently only be triggered by reading from an md
      sysfs attribute declared with __ATTR_PREALLOC() during the
      brief period between when mddev_put() deletes an mddev from
      the ->all_mddevs list, and when mddev_delayed_delete() - which is
      scheduled on a workqueue - completes.
      Before this, an error won't be returned by the ->show()
      After this, the ->show() won't be called.
      I can reproduce it reliably only by putting delay like
      early in mddev_delayed_delete(). Then after creating an
      md device md0 run
        echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state
      The bug can be triggered without the usleep.
      Fixes: 65da3484 ("sysfs: correctly handle short reads on PREALLOC attrs.")
      Fixes: 17d0774f ("sysfs: correctly handle read offset on PREALLOC attrs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 27 Sep, 2016 1 commit
  5. 31 Aug, 2016 1 commit
  6. 10 Aug, 2016 1 commit
    • Tejun Heo's avatar
      kernfs: make kernfs_path*() behave in the style of strlcpy() · 3abb1d90
      Tejun Heo authored
      kernfs_path*() functions always return the length of the full path but
      the path content is undefined if the length is larger than the
      provided buffer.  This makes its behavior different from strlcpy() and
      requires error handling in all its users even when they don't care
      about truncation.  In addition, the implementation can actully be
      simplified by making it behave properly in strlcpy() style.
      * Update kernfs_path_from_node_locked() to always fill up the buffer
        with path.  If the buffer is not large enough, the output is
        truncated and terminated.
      * kernfs_path() no longer needs error handling.  Make it a simple
        inline wrapper around kernfs_path_from_node().
      * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
        Updated accordingly.
      * cgroup_path()'s use of kernfs_path() updated to retain the old
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
  7. 23 Jun, 2016 2 commits
    • Eric W. Biederman's avatar
      kernfs: The cgroup filesystem also benefits from SB_I_NOEXEC · 29a517c2
      Eric W. Biederman authored
      The cgroup filesystem is in the same boat as sysfs.  No one ever
      permits executables of any kind on the cgroup filesystem, and there is
      no reasonable future case to support executables in the future.
      Therefore move the setting of SB_I_NOEXEC which makes the code proof
      against future mistakes of accidentally creating executables from
      sysfs to kernfs itself.  Making the code simpler and covering the
      sysfs, cgroup, and cgroup2 filesystems.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      mnt: Refactor fs_fully_visible into mount_too_revealing · 8654df4e
      Eric W. Biederman authored
      Replace the call of fs_fully_visible in do_new_mount from before the
      new superblock is allocated with a call of mount_too_revealing after
      the superblock is allocated.   This winds up being a much better location
      for maintainability of the code.
      The first change this enables is the replacement of FS_USERNS_VISIBLE
      with SB_I_USERNS_VISIBLE.  Moving the flag from struct filesystem_type
      to sb_iflags on the superblock.
      Unfortunately mount_too_revealing fundamentally needs to touch
      mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
      times.  If the mnt_flags did not need to be touched the code
      could be easily moved into the filesystem specific mount code.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  8. 18 Oct, 2015 1 commit
  9. 07 Oct, 2015 1 commit
  10. 04 Oct, 2015 1 commit
  11. 10 Jul, 2015 1 commit
    • Eric W. Biederman's avatar
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman authored
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  12. 01 Jul, 2015 1 commit
  13. 01 Jun, 2015 1 commit
  14. 24 May, 2015 1 commit
  15. 14 May, 2015 1 commit
    • Eric W. Biederman's avatar
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace · 1b852bce
      Eric W. Biederman authored
      Fresh mounts of proc and sysfs are a very special case that works very
      much like a bind mount.  Unfortunately the current structure can not
      preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
      into a form that can be modified to preserve those lock bits.
      Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
      of the filesystem be fully visible in the current mount namespace,
      before the filesystem may be mounted.
      Move the logic for calling fs_fully_visible from proc and sysfs into
      fs/namespace.c where it has greater access to mount namespace state.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  16. 25 Mar, 2015 2 commits
  17. 14 Feb, 2015 1 commit
    • Tejun Heo's avatar
      kernfs: remove KERNFS_STATIC_NAME · dfeb0750
      Tejun Heo authored
      When a new kernfs node is created, KERNFS_STATIC_NAME is used to avoid
      making a separate copy of its name.  It's currently only used for sysfs
      attributes whose filenames are required to stay accessible and unchanged.
      There are rare exceptions where these names are allocated and formatted
      dynamically but for the vast majority of cases they're consts in the
      rodata section.
      Now that kernfs is converted to use kstrdup_const() and kfree_const(),
      there's little point in keeping KERNFS_STATIC_NAME around.  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andrzej Hajda <a.hajda@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  18. 03 Feb, 2015 1 commit
  19. 07 Nov, 2014 3 commits
    • NeilBrown's avatar
      sysfs/kernfs: make read requests on pre-alloc files use the buffer. · 4ef67a8c
      NeilBrown authored
      To match the previous patch which used the pre-alloc buffer for
      writes, this patch causes reads to use the same buffer.
      This is not strictly necessary as the current seq_read() will allocate
      on first read, so user-space can trigger the required pre-alloc.  But
      consistency is valuable.
      The read function is somewhat simpler than seq_read() and, for example,
      does not support reading from an offset into the file: reads must be
      at the start of the file.
      As seq_read() does not use the prealloc buffer, ->seq_show is
      incompatible with ->prealloc and caused an EINVAL return from open().
      sysfs code which calls into kernfs always chooses the correct function.
      As the buffer is shared with writes and other reads, the mutex is
      extended to cover the copy_to_user.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • NeilBrown's avatar
      sysfs/kernfs: allow attributes to request write buffer be pre-allocated. · 2b75869b
      NeilBrown authored
      md/raid allows metadata management to be performed in user-space.
      A various times, particularly on device failure, the metadata needs
      to be updated before further writes can be permitted.
      This means that the user-space program which updates metadata much
      not block on writeout, and so must not allocate memory.
      mlockall(MCL_CURRENT|MCL_FUTURE) and pre-allocation can avoid all
      memory allocation issues for user-memory, but that does not help
      kernel memory.
      Several kernel objects can be pre-allocated.  e.g. files opened before
      any writes to the array are permitted.
      However some kernel allocation happens in places that cannot be
      In particular, writes to sysfs files (to tell md that it can now
      allow writes to the array) allocate a buffer using GFP_KERNEL.
      This patch allows attributes to be marked as "PREALLOC".  In that case
      the maximal buffer is allocated when the file is opened, and then used
      on each write instead of allocating a new buffer.
      As the same buffer is now shared for all writes on the same file
      description, the mutex is extended to cover full use of the buffer
      including the copy_from_user().
      The new __ATTR_PREALLOC() 'or's a new flag in to the 'mode', which is
      inspected by sysfs_add_file_mode_ns() to determine if the file should be
      marked as requiring prealloc.
      Despite the comment, we *do* use ->seq_show together with ->prealloc
      in this patch.  The next patch fixes that.
      Signed-off-by: default avatarNeilBrown  <neilb@suse.de>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Vladimir Zapolskiy's avatar
      fs: sysfs: return EGBIG on write if offset is larger than file size · 09368960
      Vladimir Zapolskiy authored
      According to the user expectations common utilities like dd or sh
      redirection operator > should work correctly over binary files from
      sysfs. At the moment doing excessive write can not be completed:
        write(1, "\0\0\0\0\0\0\0\0", 8)         = 4
        write(1, "\0\0\0\0", 4)                 = 0
        write(1, "\0\0\0\0", 4)                 = 0
        write(1, "\0\0\0\0", 4)                 = 0
      Fix the problem by returning EFBIG described in man 2 write.
      Signed-off-by: default avatarVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  20. 03 Jun, 2014 1 commit
  21. 27 May, 2014 2 commits
  22. 20 May, 2014 1 commit
    • Tejun Heo's avatar
      sysfs: make sure read buffer is zeroed · f5c16f29
      Tejun Heo authored
      13c589d5 ("sysfs: use seq_file when reading regular files")
      switched sysfs from custom read implementation to seq_file to enable
      later transition to kernfs.  After the change, the buffer passed to
      ->show() is acquired through seq_get_buf(); unfortunately, this
      introduces a subtle behavior change.  Before the commit, the buffer
      passed to ->show() was always zero as it was allocated using
      get_zeroed_page().  Because seq_file doesn't clear buffers on
      allocation and neither does seq_get_buf(), after the commit, depending
      on the behavior of ->show(), we may end up exposing uninitialized data
      to userland thus possibly altering userland visible behavior and
      leaking information.
      Fix it by explicitly clearing the buffer.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRon <ron@debian.org>
      Fixes: 13c589d5 ("sysfs: use seq_file when reading regular files")
      Cc: stable <stable@vger.kernel.org> # 3.13+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  23. 13 May, 2014 1 commit
    • Tejun Heo's avatar
      kernfs, sysfs, cgroup: restrict extra perm check on open to sysfs · 555724a8
      Tejun Heo authored
      The kernfs open method - kernfs_fop_open() - inherited extra
      permission checks from sysfs.  While the vfs layer allows ignoring the
      read/write permissions checks if the issuer has CAP_DAC_OVERRIDE,
      sysfs explicitly denied open regardless of the cap if the file doesn't
      have any of the UGO perms of the requested access or doesn't implement
      the requested operation.  It can be debated whether this was a good
      idea or not but the behavior is too subtle and dangerous to change at
      this point.
      After cgroup got converted to kernfs, this extra perm check also got
      applied to cgroup breaking libcgroup which opens write-only files with
      O_RDWR as root.  This patch gates the extra open permission check with
      a new flag KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK and enables it for sysfs.
      For sysfs, nothing changes.  For cgroup, root now can perform any
      operation regardless of the permissions as it was before kernfs
      conversion.  Note that kernfs still fails unimplemented operations
      with -EINVAL.
      While at it, add comments explaining KERNFS_ROOT flags.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarAndrey Wagin <avagin@gmail.com>
      Tested-by: default avatarAndrey Wagin <avagin@gmail.com>
      Cc: Li Zefan <lizefan@huawei.com>
      References: http://lkml.kernel.org/g/CANaxB-xUm3rJ-Cbp72q-rQJO5mZe1qK6qXsQM=vh0U8upJ44+A@mail.gmail.com
      Fixes: 2bd59d48 ("cgroup: convert to kernfs")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  24. 16 Apr, 2014 1 commit
  25. 26 Mar, 2014 1 commit
  26. 23 Mar, 2014 1 commit
  27. 25 Feb, 2014 1 commit
    • Li Zefan's avatar
      sysfs: fix namespace refcnt leak · fed95bab
      Li Zefan authored
      As mount() and kill_sb() is not a one-to-one match, we shoudn't get
      ns refcnt unconditionally in sysfs_mount(), and instead we should
      get the refcnt only when kernfs_mount() allocated a new superblock.
      - Changed the name of the new argument, suggested by Tejun.
      - Made the argument optional, suggested by Tejun.
      - Make the new argument as second-to-last arg, suggested by Tejun.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
       fs/kernfs/mount.c      | 8 +++++++-
       fs/sysfs/mount.c       | 5 +++--
       include/linux/kernfs.h | 9 +++++----
       3 files changed, 15 insertions(+), 7 deletions(-)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  28. 15 Feb, 2014 1 commit
    • Cody P Schafer's avatar
      sysfs: create bin_attributes under the requested group · aabaf4c2
      Cody P Schafer authored
      bin_attributes created/updated in create_files() (such as those listed
      via (struct device).attribute_groups) were not placed under the
      specified group, and instead appeared in the base kobj directory.
      Fix this by making bin_attributes use creating code similar to normal
      A quick grep shows that no one is using bin_attrs in a named attribute
      group yet, so we can do this without breaking anything in usespace.
      Note that I do not add is_visible() support to
      bin_attributes, though that could be done as well.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  29. 08 Feb, 2014 2 commits
    • Tejun Heo's avatar
      kernfs: add CONFIG_KERNFS · ba341d55
      Tejun Heo authored
      As sysfs was kernfs's only user, kernfs has been piggybacking on
      CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very
      soon.  Introduce a separate config option CONFIG_KERNFS which is to be
      selected by kernfs users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends · 3eef34ad
      Tejun Heo authored
      kernfs_node->parent and ->name are currently marked as "published"
      indicating that kernfs users may access them directly; however, those
      fields may get updated by kernfs_rename[_ns]() and unrestricted access
      may lead to erroneous values or oops.
      Protect ->parent and ->name updates with a irq-safe spinlock
      kernfs_rename_lock and implement the following accessors for these
      * kernfs_name()		- format the node's name into the specified buffer
      * kernfs_path()		- format the node's path into the specified buffer
      * pr_cont_kernfs_name()	- pr_cont a node's name (doesn't need buffer)
      * pr_cont_kernfs_path()	- pr_cont a node's path (doesn't need buffer)
      * kernfs_get_parent()	- pin and return a node's parent
      All can be called under any context.  The recursive sysfs_pathname()
      in fs/sysfs/dir.c is replaced with kernfs_path() and
      sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of
      dereferencing parent directly.
      v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing
          static inline making it cause a lot of build warnings.  Add it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  30. 07 Feb, 2014 3 commits
    • Tejun Heo's avatar
      kernfs: allow nodes to be created in the deactivated state · d35258ef
      Tejun Heo authored
      Currently, kernfs_nodes are made visible to userland on creation,
      which makes it difficult for kernfs users to atomically succeed or
      fail creation of multiple nodes.  In addition, if something fails
      after creating some nodes, the created nodes might already be in use
      and their active refs need to be drained for removal, which has the
      potential to introduce tricky reverse locking dependency on active_ref
      depending on how the error path is synchronized.
      This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED.
      If set, all nodes under the root are created in the deactivated state
      and stay invisible to userland until explicitly enabled by the new
      kernfs_activate() API.  Also, nodes which have never been activated
      are guaranteed to bypass draining on removal thus allowing error paths
      to not worry about lockding dependency on active_ref draining.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner() · ce8b04aa
      Tejun Heo authored
      All device_schedule_callback_owner() users are converted to use
      device_remove_file_self().  Remove now unused
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers · 6b0afc2a
      Tejun Heo authored
      Sometimes it's necessary to implement a node which wants to delete
      nodes including itself.  This isn't straightforward because of kernfs
      active reference.  While a file operation is in progress, an active
      reference is held and kernfs_remove() waits for all such references to
      drain before completing.  For a self-deleting node, this is a deadlock
      as kernfs_remove() ends up waiting for an active reference that itself
      is sitting on top of.
      This currently is worked around in the sysfs layer using
      sysfs_schedule_callback() which makes such removals asynchronous.
      While it works, it's rather cumbersome and inherently breaks
      synchronicity of the operation - the file operation which triggered
      the operation may complete before the removal is finished (or even
      started) and the removal may fail asynchronously.  If a removal
      operation is immmediately followed by another operation which expects
      the specific name to be available (e.g. removal followed by rename
      onto the same name), there's no way to make the latter operation
      The thing is there's no inherent reason for this to be asynchrnous.
      All that's necessary to do this synchronous is a dedicated operation
      which drops its own active ref and deactivates self.  This patch
      implements kernfs_remove_self() and its wrappers in sysfs and driver
      core.  kernfs_remove_self() is to be called from one of the file
      operations, drops the active ref the task is holding, removes the self
      node, and restores active ref to the dead node so that the ref is
      balanced afterwards.  __kernfs_remove() is updated so that it takes an
      early exit if the target node is already fully removed so that the
      active ref restored by kernfs_remove_self() after removal doesn't
      confuse the deactivation path.
      This makes implementing self-deleting nodes very easy.  The normal
      removal path doesn't even need to be changed to use
      kernfs_remove_self() for the self-deleting node.  The method can
      invoke kernfs_remove_self() on itself before proceeding the normal
      removal path.  kernfs_remove() invoked on the node by the normal
      deletion path will simply be ignored.
      This will replace sysfs_schedule_callback().  A subtle feature of
      sysfs_schedule_callback() is that it collapses multiple invocations -
      even if multiple removals are triggered, the removal callback is run
      only once.  An equivalent effect can be achieved by testing the return
      value of kernfs_remove_self() - only the one which gets %true return
      value should proceed with actual deletion.  All other instances of
      kernfs_remove_self() will wait till the enclosing kernfs operation
      which invoked the winning instance of kernfs_remove_self() finishes
      and then return %false.  This trivially makes all users of
      kernfs_remove_self() automatically show correct synchronous behavior
      even when there are multiple concurrent operations - all "echo 1 >
      delete" instances will finish only after the whole operation is
      completed by one of the instances.
      Note that manipulation of active ref is implemented in separate public
      functions - kernfs_[un]break_active_protection().
      kernfs_remove_self() is the only user at the moment but this will be
      used to cater to more complex cases.
      v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
          and sysfs_remove_file_self() had incorrect return type.  Fix it.
          Reported by kbuild test bot.
      v3: kernfs_[un]break_active_protection() separated out from
          kernfs_remove_self() and exposed as public API.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  31. 13 Jan, 2014 2 commits