1. 20 Jul, 2017 1 commit
    • Eric W. Biederman's avatar
      userns,pidns: Verify the userns for new pid namespaces · a2b42626
      Eric W. Biederman authored
      It is pointless and confusing to allow a pid namespace hierarchy and
      the user namespace hierarchy to get out of sync.  The owner of a child
      pid namespace should be the owner of the parent pid namespace or
      a descendant of the owner of the parent pid namespace.
      Otherwise it is possible to construct scenarios where a process has a
      capability over a parent pid namespace but does not have the
      capability over a child pid namespace.  Which confusingly makes
      permission checks non-transitive.
      It requires use of setns into a pid namespace (but not into a user
      namespace) to create such a scenario.
      Add the function in_userns to help in making this determination.
      v2: Optimized in_userns by using level as suggested
          by: Kirill Tkhai <ktkhai@virtuozzo.com>
      Ref: 49f4d8b9 ("pidns: Capture the user namespace and filter ns_last_pid")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  2. 02 Mar, 2017 1 commit
  3. 23 Sep, 2016 2 commits
  4. 22 Sep, 2016 1 commit
  5. 08 Aug, 2016 5 commits
  6. 24 Jun, 2016 1 commit
  7. 04 Jan, 2016 1 commit
  8. 04 Sep, 2015 1 commit
    • Andy Lutomirski's avatar
      capabilities: ambient capabilities · 58319057
      Andy Lutomirski authored
      Credit where credit is due: this idea comes from Christoph Lameter with
      a lot of valuable input from Serge Hallyn.  This patch is heavily based
      on Christoph's patch.
      ===== The status quo =====
      On Linux, there are a number of capabilities defined by the kernel.  To
      perform various privileged tasks, processes can wield capabilities that
      they hold.
      Each task has four capability masks: effective (pE), permitted (pP),
      inheritable (pI), and a bounding set (X).  When the kernel checks for a
      capability, it checks pE.  The other capability masks serve to modify
      what capabilities can be in pE.
      Any task can remove capabilities from pE, pP, or pI at any time.  If a
      task has a capability in pP, it can add that capability to pE and/or pI.
      If a task has CAP_SETPCAP, then it can add any capability to pI, and it
      can remove capabilities from X.
      Tasks are not the only things that can have capabilities; files can also
      have capabilities.  A file can have no capabilty information at all [1].
      If a file has capability information, then it has a permitted mask (fP)
      and an inheritable mask (fI) as well as a single effective bit (fE) [2].
      File capabilities modify the capabilities of tasks that execve(2) them.
      A task that successfully calls execve has its capabilities modified for
      the file ultimately being excecuted (i.e.  the binary itself if that
      binary is ELF or for the interpreter if the binary is a script.) [3] In
      the capability evolution rules, for each mask Z, pZ represents the old
      value and pZ' represents the new value.  The rules are:
        pP' = (X & fP) | (pI & fI)
        pI' = pI
        pE' = (fE ? pP' : 0)
        X is unchanged
      For setuid binaries, fP, fI, and fE are modified by a moderately
      complicated set of rules that emulate POSIX behavior.  Similarly, if
      euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
      (primary, fP and fI usually end up being the full set).  For nonroot
      users executing binaries with neither setuid nor file caps, fI and fP
      are empty and fE is false.
      As an extra complication, if you execute a process as nonroot and fE is
      set, then the "secure exec" rules are in effect: AT_SECURE gets set,
      LD_PRELOAD doesn't work, etc.
      This is rather messy.  We've learned that making any changes is
      dangerous, though: if a new kernel version allows an unprivileged
      program to change its security state in a way that persists cross
      execution of a setuid program or a program with file caps, this
      persistent state is surprisingly likely to allow setuid or file-capped
      programs to be exploited for privilege escalation.
      ===== The problem =====
      Capability inheritance is basically useless.
      If you aren't root and you execute an ordinary binary, fI is zero, so
      your capabilities have no effect whatsoever on pP'.  This means that you
      can't usefully execute a helper process or a shell command with elevated
      capabilities if you aren't root.
      On current kernels, you can sort of work around this by setting fI to
      the full set for most or all non-setuid executable files.  This causes
      pP' = pI for nonroot, and inheritance works.  No one does this because
      it's a PITA and it isn't even supported on most filesystems.
      If you try this, you'll discover that every nonroot program ends up with
      secure exec rules, breaking many things.
      This is a problem that has bitten many people who have tried to use
      capabilities for anything useful.
      ===== The proposed change =====
      This patch adds a fifth capability mask called the ambient mask (pA).
      pA does what most people expect pI to do.
      pA obeys the invariant that no bit can ever be set in pA if it is not
      set in both pP and pI.  Dropping a bit from pP or pI drops that bit from
      pA.  This ensures that existing programs that try to drop capabilities
      still do so, with a complication.  Because capability inheritance is so
      broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
      then calling execve effectively drops capabilities.  Therefore,
      setresuid from root to nonroot conditionally clears pA unless
      SECBIT_NO_SETUID_FIXUP is set.  Processes that don't like this can
      re-add bits to pA afterwards.
      The capability evolution rules are changed:
        pA' = (file caps or setuid or setgid ? 0 : pA)
        pP' = (X & fP) | (pI & fI) | pA'
        pI' = pI
        pE' = (fE ? pP' : pA')
        X is unchanged
      If you are nonroot but you have a capability, you can add it to pA.  If
      you do so, your children get that capability in pA, pP, and pE.  For
      example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
      automatically bind low-numbered ports.  Hallelujah!
      Unprivileged users can create user namespaces, map themselves to a
      nonzero uid, and create both privileged (relative to their namespace)
      and unprivileged process trees.  This is currently more or less
      impossible.  Hallelujah!
      You cannot use pA to try to subvert a setuid, setgid, or file-capped
      program: if you execute any such program, pA gets cleared and the
      resulting evolution rules are unchanged by this patch.
      Users with nonzero pA are unlikely to unintentionally leak that
      capability.  If they run programs that try to drop privileges, dropping
      privileges will still work.
      It's worth noting that the degree of paranoia in this patch could
      possibly be reduced without causing serious problems.  Specifically, if
      we allowed pA to persist across executing non-pA-aware setuid binaries
      and across setresuid, then, naively, the only capabilities that could
      leak as a result would be the capabilities in pA, and any attacker
      *already* has those capabilities.  This would make me nervous, though --
      setuid binaries that tried to privilege-separate might fail to do so,
      and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
      unexpected side effects.  (Whether these unexpected side effects would
      be exploitable is an open question.) I've therefore taken the more
      paranoid route.  We can revisit this later.
      An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
      ambient capabilities.  I think that this would be annoying and would
      make granting otherwise unprivileged users minor ambient capabilities
      (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
      it is with this patch.
      ===== Footnotes =====
      [1] Files that are missing the "security.capability" xattr or that have
      unrecognized values for that xattr end up with has_cap set to false.
      The code that does that appears to be complicated for no good reason.
      [2] The libcap capability mask parsers and formatters are dangerously
      misleading and the documentation is flat-out wrong.  fE is *not* a mask;
      it's a single bit.  This has probably confused every single person who
      has tried to use file capabilities.
      [3] Linux very confusingly processes both the script and the interpreter
      if applicable, for reasons that elude me.  The results from thinking
      about a script's file capabilities and/or setuid bits are mostly
      Preliminary userspace code is here, but it needs updating:
      Here is a test program that can be used to verify the functionality
      (from Christoph):
       * Test program for the ambient capabilities. This program spawns a shell
       * that allows running processes with a defined set of capabilities.
       * (C) 2015 Christoph Lameter <cl@linux.com>
       * Released under: GPL v3 or later.
       * Compile using:
       *	gcc -o ambient_test ambient_test.o -lcap-ng
       * This program must have the following capabilities to run properly:
       * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
       * A command to equip the binary with the right caps is:
       *	setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
       * To get a shell with additional caps that can be inherited by other processes:
       *	./ambient_test /bin/bash
       * Verifying that it works:
       * From the bash spawed by ambient_test run
       *	cat /proc/$$/status
       * and have a look at the capabilities.
      #include <stdlib.h>
      #include <stdio.h>
      #include <errno.h>
      #include <cap-ng.h>
      #include <sys/prctl.h>
      #include <linux/capability.h>
       * Definitions from the kernel header files. These are going to be removed
       * when the /usr/include files have these defined.
      #define PR_CAP_AMBIENT 47
      #define PR_CAP_AMBIENT_IS_SET 1
      #define PR_CAP_AMBIENT_RAISE 2
      #define PR_CAP_AMBIENT_LOWER 3
      #define PR_CAP_AMBIENT_CLEAR_ALL 4
      static void set_ambient_cap(int cap)
      	int rc;
      	rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
      	if (rc) {
      		printf("Cannot add inheritable cap\n");
      	/* Note the two 0s at the end. Kernel checks for these */
      	if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
      		perror("Cannot set cap");
      int main(int argc, char **argv)
      	int rc;
      	printf("Ambient_test forking shell\n");
      	if (execv(argv[1], argv + 1))
      		perror("Cannot exec");
      	return 0;
      Signed-off-by: Christoph Lameter <cl@linux.com> # Original author
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Aaron Jones <aaronmdjones@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: Markku Savela <msa@moth.iki.fi>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 12 Aug, 2015 1 commit
  10. 12 Dec, 2014 3 commits
    • Eric W. Biederman's avatar
      userns; Correct the comment in map_write · 36476bea
      Eric W. Biederman authored
      It is important that all maps are less than PAGE_SIZE
      or else setting the last byte of the buffer to '0'
      could write off the end of the allocated storage.
      Correct the misleading comment.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      userns: Allow setting gid_maps without privilege when setgroups is disabled · 66d2f338
      Eric W. Biederman authored
      Now that setgroups can be disabled and not reenabled, setting gid_map
      without privielge can now be enabled when setgroups is disabled.
      This restores most of the functionality that was lost when unprivileged
      setting of gid_map was removed.  Applications that use this functionality
      will need to check to see if they use setgroups or init_groups, and if they
      don't they can be fixed by simply disabling setgroups before writing to
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman authored
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
        A value of "allow" means the segtoups system call is enabled.
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  11. 09 Dec, 2014 5 commits
  12. 06 Dec, 2014 1 commit
    • Eric W. Biederman's avatar
      userns: Document what the invariant required for safe unprivileged mappings. · 0542f17b
      Eric W. Biederman authored
      The rule is simple.  Don't allow anything that wouldn't be allowed
      without unprivileged mappings.
      It was previously overlooked that establishing gid mappings would
      allow dropping groups and potentially gaining permission to files and
      directories that had lesser permissions for a specific group than for
      all other users.
      This is the rule needed to fix CVE-2014-8989 and prevent any other
      security issues with new_idmap_permitted.
      The reason for this rule is that the unix permission model is old and
      there are programs out there somewhere that take advantage of every
      little corner of it.  So allowing a uid or gid mapping to be
      established without privielge that would allow anything that would not
      be allowed without that mapping will result in expectations from some
      code somewhere being violated.  Violated expectations about the
      behavior of the OS is a long way to say a security issue.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  13. 04 Dec, 2014 5 commits
  14. 08 Aug, 2014 1 commit
  15. 06 Jun, 2014 1 commit
  16. 14 Apr, 2014 1 commit
    • Mikulas Patocka's avatar
      user namespace: fix incorrect memory barriers · e79323bd
      Mikulas Patocka authored
      smp_read_barrier_depends() can be used if there is data dependency between
      the readers - i.e. if the read operation after the barrier uses address
      that was obtained from the read operation before the barrier.
      In this file, there is only control dependency, no data dependecy, so the
      use of smp_read_barrier_depends() is incorrect. The code could fail in the
      following way:
      * the cpu predicts that idx < entries is true and starts executing the
        body of the for loop
      * the cpu fetches map->extent[0].first and map->extent[0].count
      * the cpu fetches map->nr_extents
      * the cpu verifies that idx < extents is true, so it commits the
        instructions in the body of the for loop
      The problem is that in this scenario, the cpu read map->extent[0].first
      and map->nr_extents in the wrong order. We need a full read memory barrier
      to prevent it.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 03 Apr, 2014 1 commit
    • Paul Gortmaker's avatar
      kernel: audit/fix non-modular users of module_init in core code · c96d6660
      Paul Gortmaker authored
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      The audit targets the following module_init users for change:
       kernel/user.c                  obj-y
       kernel/kexec.c                 bool KEXEC (one instance per arch)
       kernel/profile.c               bool PROFILING
       kernel/hung_task.c             bool DETECT_HUNG_TASK
       kernel/sched/stats.c           bool SCHEDSTATS
       kernel/user_namespace.c        bool USER_NS
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).  However no observable impact of that
      difference has been observed during testing.
      Also, two instances of missing ";" at EOL are fixed in kexec.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  18. 20 Feb, 2014 1 commit
  19. 19 Feb, 2014 1 commit
  20. 24 Sep, 2013 1 commit
    • David Howells's avatar
      KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches · f36f8c75
      David Howells authored
      Add support for per-user_namespace registers of persistent per-UID kerberos
      caches held within the kernel.
      This allows the kerberos cache to be retained beyond the life of all a user's
      processes so that the user's cron jobs can work.
      The kerberos cache is envisioned as a keyring/key tree looking something like:
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 big_key	- A ccache blob
      			\___ tkt12345 big_key	- Another ccache blob
      Or possibly:
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 keyring	- A ccache
      				\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
      				\___ http/REDHAT.COM@REDHAT.COM user
      				\___ afs/REDHAT.COM@REDHAT.COM user
      				\___ nfs/REDHAT.COM@REDHAT.COM user
      				\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
      				\___ http/KERNEL.ORG@KERNEL.ORG big_key
      What goes into a particular Kerberos cache is entirely up to userspace.  Kernel
      support is limited to giving you the Kerberos cache keyring that you want.
      The user asks for their Kerberos cache by:
      	krb_cache = keyctl_get_krbcache(uid, dest_keyring);
      The uid is -1 or the user's own UID for the user's own cache or the uid of some
      other user's cache (requires CAP_SETUID).  This permits rpc.gssd or whatever to
      mess with the cache.
      The cache returned is a keyring named "_krb.<uid>" that the possessor can read,
      search, clear, invalidate, unlink from and add links to.  Active LSMs get a
      chance to rule on whether the caller is permitted to make a link.
      Each uid's cache keyring is created when it first accessed and is given a
      timeout that is extended each time this function is called so that the keyring
      goes away after a while.  The timeout is configurable by sysctl but defaults to
      three days.
      Each user_namespace struct gets a lazily-created keyring that serves as the
      register.  The cache keyrings are added to it.  This means that standard key
      search and garbage collection facilities are available.
      The user_namespace struct's register goes away when it does and anything left
      in it is then automatically gc'd.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSimo Sorce <simo@redhat.com>
      cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      cc: Eric W. Biederman <ebiederm@xmission.com>
  21. 27 Aug, 2013 1 commit
    • Eric W. Biederman's avatar
      userns: Better restrictions on when proc and sysfs can be mounted · e51db735
      Eric W. Biederman authored
      Rely on the fact that another flavor of the filesystem is already
      mounted and do not rely on state in the user namespace.
      Verify that the mounted filesystem is not covered in any significant
      way.  I would love to verify that the previously mounted filesystem
      has no mounts on top but there are at least the directories
      /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
      for other filesystems to mount on top of.
      Refactor the test into a function named fs_fully_visible and call that
      function from the mount routines of proc and sysfs.  This makes this
      test local to the filesystems involved and the results current of when
      the mounts take place, removing a weird threading of the user
      namespace, the mount namespace and the filesystems themselves.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  22. 08 Aug, 2013 1 commit
  23. 06 Aug, 2013 1 commit
  24. 01 May, 2013 1 commit
  25. 15 Apr, 2013 1 commit