1. 01 Nov, 2018 1 commit
    • Philippe Gerum's avatar
      sched: ipipe: enable task migration between domains · 957ac4c9
      Philippe Gerum authored
      This is the basic code enabling alternate control of tasks between the
      regular kernel and an embedded co-kernel. The changes cover the
      following aspects:
      
      - extend the per-thread information block with a private area usable
        by the co-kernel for storing additional state information
      
      - provide the API enabling a scheduler exchange mechanism, so that
        tasks can run under the control of either kernel alternatively. This
        includes a service to move the current task to the head domain under
        the control of the co-kernel, and the converse service to re-enter
        the root domain once the co-kernel has released such task.
      
      - ensure the generic context switching code can be used from any
        domain, serializing execution as required.
      
      These changes have to be paired with arch-specific code further
      enabling context switching from the head domain.
      957ac4c9
  2. 03 Mar, 2018 1 commit
  3. 05 Jan, 2018 1 commit
  4. 20 Dec, 2017 1 commit
  5. 05 Dec, 2017 1 commit
  6. 20 Oct, 2017 1 commit
    • Mathieu Desnoyers's avatar
      membarrier: Provide register expedited private command · a961e409
      Mathieu Desnoyers authored
      This introduces a "register private expedited" membarrier command which
      allows eventual removal of important memory barrier constraints on the
      scheduler fast-paths. It changes how the "private expedited" membarrier
      command (new to 4.14) is used from user-space.
      
      This new command allows processes to register their intent to use the
      private expedited command.  This affects how the expedited private
      command introduced in 4.14-rc is meant to be used, and should be merged
      before 4.14 final.
      
      Processes are now required to register before using
      MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
      
      This fixes a problem that arose when designing requested extensions to
      sys_membarrier() to allow JITs to efficiently flush old code from
      instruction caches.  Several potential algorithms are much less painful
      if the user register intent to use this functionality early on, for
      example, before the process spawns the second thread.  Registering at
      this time removes the need to interrupt each and every thread in that
      process at the first expedited sys_membarrier() system call.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a961e409
  7. 04 Oct, 2017 1 commit
  8. 15 Sep, 2017 1 commit
  9. 14 Sep, 2017 1 commit
    • Michal Hocko's avatar
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko authored
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
  10. 04 Sep, 2017 2 commits
  11. 01 Aug, 2017 10 commits
    • Kees Cook's avatar
      exec: Consolidate pdeath_signal clearing · fe8993b3
      Kees Cook authored
      Instead of an additional secureexec check for pdeath_signal, just move it
      up into the initial secureexec test. Neither perf nor arch code touches
      pdeath_signal, so the relocation shouldn't change anything.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      fe8993b3
    • Kees Cook's avatar
      exec: Use sane stack rlimit under secureexec · 64701dee
      Kees Cook authored
      For a secureexec, before memory layout selection has happened, reset the
      stack rlimit to something sane to avoid the caller having control over
      the resulting layouts.
      
      $ ulimit -s
      8192
      $ ulimit -s unlimited
      $ /bin/sh -c 'ulimit -s'
      unlimited
      $ sudo /bin/sh -c 'ulimit -s'
      8192
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJames Morris <james.l.morris@oracle.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      64701dee
    • Kees Cook's avatar
      exec: Consolidate dumpability logic · 473d8963
      Kees Cook authored
      Since it's already valid to set dumpability in the early part of
      setup_new_exec(), we can consolidate the logic into a single place.
      The BINPRM_FLAGS_ENFORCE_NONDUMP is set during would_dump() calls
      before setup_new_exec(), so its test is safe to move as well.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Reviewed-by: default avatarJames Morris <james.l.morris@oracle.com>
      473d8963
    • Kees Cook's avatar
      exec: Use secureexec for clearing pdeath_signal · a70423df
      Kees Cook authored
      Like dumpability, clearing pdeath_signal happens both in setup_new_exec()
      and later in commit_creds(). The test in setup_new_exec() is different
      from all other privilege comparisons, though: it is checking the new cred
      (bprm) uid vs the old cred (current) euid. This appears to be a bug,
      introduced by commit a6f76f23 ("CRED: Make execve() take advantage of
      copy-on-write credentials"):
      
      -       if (bprm->e_uid != current_euid() ||
      -           bprm->e_gid != current_egid()) {
      -               set_dumpable(current->mm, suid_dumpable);
      +       if (bprm->cred->uid != current_euid() ||
      +           bprm->cred->gid != current_egid()) {
      
      It was bprm euid vs current euid (and egids), but the effective got
      dropped. Nothing in the exec flow changes bprm->cred->uid (nor gid).
      The call traces are:
      
      	prepare_bprm_creds()
      	    prepare_exec_creds()
      	        prepare_creds()
      	            memcpy(new_creds, old_creds, ...)
      	            security_prepare_creds() (unimplemented by commoncap)
      	...
      	prepare_binprm()
      	    bprm_fill_uid()
      	        resets euid/egid to current euid/egid
      	        sets euid/egid on bprm based on set*id file bits
      	    security_bprm_set_creds()
      		cap_bprm_set_creds()
      		        handle all caps-based manipulations
      
      so this test is effectively a test of current_uid() vs current_euid(),
      which is wrong, just like the prior dumpability tests were wrong.
      
      The commit log says "Clear pdeath_signal and set dumpable on
      certain circumstances that may not be covered by commit_creds()." This
      may be meaning the earlier old euid vs new euid (and egid) test that
      got changed.
      
      Luckily, as with dumpability, this is all masked by commit_creds()
      which performs old/new euid and egid tests and clears pdeath_signal.
      
      And again, like dumpability, we should include LSM secureexec logic for
      pdeath_signal clearing. For example, Smack goes out of its way to clear
      pdeath_signal when it finds a secureexec condition.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Reviewed-by: default avatarJames Morris <james.l.morris@oracle.com>
      a70423df
    • Kees Cook's avatar
      exec: Use secureexec for setting dumpability · e37fdb78
      Kees Cook authored
      The examination of "current" to decide dumpability is wrong. This was a
      check of and euid/uid (or egid/gid) mismatch in the existing process,
      not the newly created one. This appears to stretch back into even the
      "history.git" tree. Luckily, dumpability is later set in commit_creds().
      In earlier kernel versions before creds existed, similar checks also
      existed late in the exec flow, covering up the mistake as far back as I
      could find.
      
      Note that because the commit_creds() check examines differences of euid,
      uid, egid, gid, and capabilities between the old and new creds, it would
      look like the setup_new_exec() dumpability test could be entirely removed.
      However, the secureexec test may cover a different set of tests (specific
      to the LSMs) than what commit_creds() checks for. So, fix this test to
      use secureexec (the removed euid tests are redundant to the commoncap
      secureexec checks now).
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Reviewed-by: default avatarJames Morris <james.l.morris@oracle.com>
      e37fdb78
    • Kees Cook's avatar
      LSM: drop bprm_secureexec hook · 2af62280
      Kees Cook authored
      This removes the bprm_secureexec hook since the logic has been folded into
      the bprm_set_creds hook for all LSMs now.
      
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJohn Johansen <john.johansen@canonical.com>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      2af62280
    • Kees Cook's avatar
      commoncap: Refactor to remove bprm_secureexec hook · 46d98eb4
      Kees Cook authored
      The commoncap implementation of the bprm_secureexec hook is the only LSM
      that depends on the final call to its bprm_set_creds hook (since it may
      be called for multiple files, it ignores bprm->called_set_creds). As a
      result, it cannot safely _clear_ bprm->secureexec since other LSMs may
      have set it.  Instead, remove the bprm_secureexec hook by introducing a
      new flag to bprm specific to commoncap: cap_elevated. This is similar to
      cap_effective, but that is used for a specific subset of elevated
      privileges, and exists solely to track state from bprm_set_creds to
      bprm_secureexec. As such, it will be removed in the next patch.
      
      Here, set the new bprm->cap_elevated flag when setuid/setgid has happened
      from bprm_fill_uid() or fscapabilities have been prepared. This temporarily
      moves the bprm_secureexec hook to a static inline. The helper will be
      removed in the next patch; this makes the step easier to review and bisect,
      since this does not introduce any changes to inputs nor outputs to the
      "elevated privileges" calculation.
      
      The new flag is merged with the bprm->secureexec flag in setup_new_exec()
      since this marks the end of any further prepare_binprm() calls.
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      46d98eb4
    • Kees Cook's avatar
      binfmt: Introduce secureexec flag · c425e189
      Kees Cook authored
      The bprm_secureexec hook can be moved earlier. Right now, it is called
      during create_elf_tables(), via load_binary(), via search_binary_handler(),
      via exec_binprm(). Nearly all (see exception below) state used by
      bprm_secureexec is created during the bprm_set_creds hook, called from
      prepare_binprm().
      
      For all LSMs (except commoncaps described next), only the first execution
      of bprm_set_creds takes any effect (they all check bprm->called_set_creds
      which prepare_binprm() sets after the first call to the bprm_set_creds
      hook).  However, all these LSMs also only do anything with bprm_secureexec
      when they detected a secure state during their first run of bprm_set_creds.
      Therefore, it is functionally identical to move the detection into
      bprm_set_creds, since the results from secureexec here only need to be
      based on the first call to the LSM's bprm_set_creds hook.
      
      The single exception is that the commoncaps secureexec hook also examines
      euid/uid and egid/gid differences which are controlled by bprm_fill_uid(),
      via prepare_binprm(), which can be called multiple times (e.g.
      binfmt_script, binfmt_misc), and may clear the euid/egid for the final
      load (i.e. the script interpreter). However, while commoncaps specifically
      ignores bprm->cred_prepared, and runs its bprm_set_creds hook each time
      prepare_binprm() may get called, it needs to base the secureexec decision
      on the final call to bprm_set_creds. As a result, it will need special
      handling.
      
      To begin this refactoring, this adds the secureexec flag to the bprm
      struct, and calls the secureexec hook during setup_new_exec(). This is
      safe since all the cred work is finished (and past the point of no return).
      This explicit call will be removed in later patches once the hook has been
      removed.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJohn Johansen <john.johansen@canonical.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Reviewed-by: default avatarJames Morris <james.l.morris@oracle.com>
      c425e189
    • Kees Cook's avatar
      exec: Correct comments about "point of no return" · a9208e42
      Kees Cook authored
      In commit 221af7f8 ("Split 'flush_old_exec' into two functions"),
      the comment about the point of no return should have stayed in
      flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
      changes in commits c89681ed ("remove steal_locks()"), and
      fd8328be ("sanitize handling of shared descriptor tables in failing
      execve()") made it look like it meant the current->sas_ss_sp line instead.
      
      The comment was referring to the fact that once bprm->mm is NULL, all
      failures from a binfmt load_binary hook (e.g. load_elf_binary), will
      get SEGV raised against current. Move this comment and expand the
      explanation a bit, putting it above the assignment this time, and add
      details about the true nature of "point of no return" being the call
      to flush_old_exec() itself.
      
      This also removes an erroneous commet about when credentials are being
      installed. That has its own dedicated function, install_exec_creds(),
      which carries a similar (and correct) comment, so remove the bogus comment
      where installation is not actually happening.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      a9208e42
    • Kees Cook's avatar
      exec: Rename bprm->cred_prepared to called_set_creds · ddb4a144
      Kees Cook authored
      The cred_prepared bprm flag has a misleading name. It has nothing to do
      with the bprm_prepare_cred hook, and actually tracks if bprm_set_creds has
      been called. Rename this flag and improve its comment.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarJohn Johansen <john.johansen@canonical.com>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      ddb4a144
  12. 08 Jul, 2017 1 commit
  13. 23 Jun, 2017 1 commit
    • Kees Cook's avatar
      fs/exec.c: account for argv/envp pointers · 98da7d08
      Kees Cook authored
      When limiting the argv/envp strings during exec to 1/4 of the stack limit,
      the storage of the pointers to the strings was not included.  This means
      that an exec with huge numbers of tiny strings could eat 1/4 of the stack
      limit in strings and then additional space would be later used by the
      pointers to the strings.
      
      For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
      single-byte strings would consume less than 2MB of stack, the max (8MB /
      4) amount allowed, but the pointers to the strings would consume the
      remaining additional stack space (1677721 * 4 == 6710884).
      
      The result (1677721 + 6710884 == 8388605) would exhaust stack space
      entirely.  Controlling this stack exhaustion could result in
      pathological behavior in setuid binaries (CVE-2017-1000365).
      
      [akpm@linux-foundation.org: additional commenting from Kees]
      Fixes: b6a2fea3 ("mm: variable length argument support")
      Link: http://lkml.kernel.org/r/20170622001720.GA32173@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Qualys Security Advisory <qsa@qualys.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98da7d08
  14. 20 Mar, 2017 1 commit
    • Kyle Huey's avatar
      x86/arch_prctl: Add ARCH_[GET|SET]_CPUID · e9ea1e7f
      Kyle Huey authored
      Intel supports faulting on the CPUID instruction beginning with Ivy Bridge.
      When enabled, the processor will fault on attempts to execute the CPUID
      instruction with CPL>0. Exposing this feature to userspace will allow a
      ptracer to trap and emulate the CPUID instruction.
      
      When supported, this feature is controlled by toggling bit 0 of
      MSR_MISC_FEATURES_ENABLES. It is documented in detail in Section 2.3.2 of
      https://bugzilla.kernel.org/attachment.cgi?id=243991
      
      Implement a new pair of arch_prctls, available on both x86-32 and x86-64.
      
      ARCH_GET_CPUID: Returns the current CPUID state, either 0 if CPUID faulting
          is enabled (and thus the CPUID instruction is not available) or 1 if
          CPUID faulting is not enabled.
      
      ARCH_SET_CPUID: Set the CPUID state to the second argument. If
          cpuid_enabled is 0 CPUID faulting will be activated, otherwise it will
          be deactivated. Returns ENODEV if CPUID faulting is not supported on
          this system.
      
      The state of the CPUID faulting flag is propagated across forks, but reset
      upon exec.
      Signed-off-by: default avatarKyle Huey <khuey@kylehuey.com>
      Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: linux-kselftest@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Robert O'Callahan <robert@ocallahan.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: user-mode-linux-user@lists.sourceforge.net
      Cc: David Matlack <dmatlack@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/20170320081628.18952-9-khuey@kylehuey.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e9ea1e7f
  15. 02 Mar, 2017 6 commits
  16. 14 Feb, 2017 1 commit
    • Vivek Goyal's avatar
      vfs: Use upper filesystem inode in bprm_fill_uid() · fea6d2a6
      Vivek Goyal authored
      Right now bprm_fill_uid() uses inode fetched from file_inode(bprm->file).
      This in turn returns inode of lower filesystem (in a stacked filesystem
      setup).
      
      I was playing with modified patches of shiftfs posted by james bottomley
      and realized that through shiftfs setuid bit does not take effect. And
      reason being that we fetch uid/gid from inode of lower fs (and not from
      shiftfs inode). And that results in following checks failing.
      
      /* We ignore suid/sgid if there are no mappings for them in the ns */
      if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
          !kgid_has_mapping(bprm->cred->user_ns, gid))
      	return;
      
      uid/gid fetched from lower fs inode might not be mapped inside the user
      namespace of container. So we need to look at uid/gid fetched from
      upper filesystem (shiftfs in this particular case) and these should be
      mapped and setuid bit can take affect.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      fea6d2a6
  17. 23 Jan, 2017 1 commit
  18. 24 Dec, 2016 1 commit
  19. 23 Dec, 2016 1 commit
    • Aleksa Sarai's avatar
      fs: exec: apply CLOEXEC before changing dumpable task flags · 613cc2b6
      Aleksa Sarai authored
      If you have a process that has set itself to be non-dumpable, and it
      then undergoes exec(2), any CLOEXEC file descriptors it has open are
      "exposed" during a race window between the dumpable flags of the process
      being reset for exec(2) and CLOEXEC being applied to the file
      descriptors. This can be exploited by a process by attempting to access
      /proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE.
      
      The race in question is after set_dumpable has been (for get_link,
      though the trace is basically the same for readlink):
      
      [vfs]
      -> proc_pid_link_inode_operations.get_link
         -> proc_pid_get_link
            -> proc_fd_access_allowed
               -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
      
      Which will return 0, during the race window and CLOEXEC file descriptors
      will still be open during this window because do_close_on_exec has not
      been called yet. As a result, the ordering of these calls should be
      reversed to avoid this race window.
      
      This is of particular concern to container runtimes, where joining a
      PID namespace with file descriptors referring to the host filesystem
      can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
      against access of CLOEXEC file descriptors -- file descriptors which may
      reference filesystem objects the container shouldn't have access to).
      
      Cc: dev@opencontainers.org
      Cc: <stable@vger.kernel.org> # v3.2+
      Reported-by: default avatarMichael Crosby <crosbymichael@gmail.com>
      Signed-off-by: default avatarAleksa Sarai <asarai@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      613cc2b6
  20. 15 Dec, 2016 1 commit
    • Lorenzo Stoakes's avatar
      mm: add locked parameter to get_user_pages_remote() · 5b56d49f
      Lorenzo Stoakes authored
      Patch series "mm: unexport __get_user_pages_unlocked()".
      
      This patch series continues the cleanup of get_user_pages*() functions
      taking advantage of the fact we can now pass gup_flags as we please.
      
      It firstly adds an additional 'locked' parameter to
      get_user_pages_remote() to allow for its callers to utilise
      VM_FAULT_RETRY functionality.  This is necessary as the invocation of
      __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
      this and no other existing higher level function would allow it to do
      so.
      
      Secondly existing callers of __get_user_pages_unlocked() are replaced
      with the appropriate higher-level replacement -
      get_user_pages_unlocked() if the current task and memory descriptor are
      referenced, or get_user_pages_remote() if other task/memory descriptors
      are referenced (having acquiring mmap_sem.)
      
      This patch (of 2):
      
      Add a int *locked parameter to get_user_pages_remote() to allow
      VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
      
      Taking into account the previous adjustments to get_user_pages*()
      functions allowing for the passing of gup_flags, we are now in a
      position where __get_user_pages_unlocked() need only be exported for his
      ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
      subsequently unexport __get_user_pages_unlocked() as well as allowing
      for future flexibility in the use of get_user_pages_remote().
      
      [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
        Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b56d49f
  21. 22 Nov, 2016 2 commits
    • Eric W. Biederman's avatar
      exec: Ensure mm->user_ns contains the execed files · f84df2a6
      Eric W. Biederman authored
      When the user namespace support was merged the need to prevent
      ptrace from revealing the contents of an unreadable executable
      was overlooked.
      
      Correct this oversight by ensuring that the executed file
      or files are in mm->user_ns, by adjusting mm->user_ns.
      
      Use the new function privileged_wrt_inode_uidgid to see if
      the executable is a member of the user namespace, and as such
      if having CAP_SYS_PTRACE in the user namespace should allow
      tracing the executable.  If not update mm->user_ns to
      the parent user namespace until an appropriate parent is found.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJann Horn <jann@thejh.net>
      Fixes: 9e4a36ec ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      f84df2a6
    • Eric W. Biederman's avatar
      ptrace: Capture the ptracer's creds not PT_PTRACE_CAP · 64b875f7
      Eric W. Biederman authored
      When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
      overlooked.  This can result in incorrect behavior when an application
      like strace traces an exec of a setuid executable.
      
      Further PT_PTRACE_CAP does not have enough information for making good
      security decisions as it does not report which user namespace the
      capability is in.  This has already allowed one mistake through
      insufficient granulariy.
      
      I found this issue when I was testing another corner case of exec and
      discovered that I could not get strace to set PT_PTRACE_CAP even when
      running strace as root with a full set of caps.
      
      This change fixes the above issue with strace allowing stracing as
      root a setuid executable without disabling setuid.  More fundamentaly
      this change allows what is allowable at all times, by using the correct
      information in it's decision.
      
      Cc: stable@vger.kernel.org
      Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      64b875f7
  22. 16 Nov, 2016 1 commit
  23. 19 Oct, 2016 1 commit
  24. 02 Aug, 2016 1 commit
    • Stephen Boyd's avatar
      firmware: support loading into a pre-allocated buffer · a098ecd2
      Stephen Boyd authored
      Some systems are memory constrained but they need to load very large
      firmwares.  The firmware subsystem allows drivers to request this
      firmware be loaded from the filesystem, but this requires that the
      entire firmware be loaded into kernel memory first before it's provided
      to the driver.  This can lead to a situation where we map the firmware
      twice, once to load the firmware into kernel memory and once to copy the
      firmware into the final resting place.
      
      This creates needless memory pressure and delays loading because we have
      to copy from kernel memory to somewhere else.  Let's add a
      request_firmware_into_buf() API that allows drivers to request firmware
      be loaded directly into a pre-allocated buffer.  This skips the
      intermediate step of allocating a buffer in kernel memory to hold the
      firmware image while it's read from the filesystem.  It also requires
      that drivers know how much memory they'll require before requesting the
      firmware and negates any benefits of firmware caching because the
      firmware layer doesn't manage the buffer lifetime.
      
      For a 16MB buffer, about half the time is spent performing a memcpy from
      the buffer to the final resting place.  I see loading times go from
      0.081171 seconds to 0.047696 seconds after applying this patch.  Plus
      the vmalloc pressure is reduced.
      
      This is based on a patch from Vikram Mulukutla on codeaurora.org:
        https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=rel/msm-3.18&id=0a328c5f6cd999f5c591f172216835636f39bcb5
      
      Link: http://lkml.kernel.org/r/20160607164741.31849-4-stephen.boyd@linaro.orgSigned-off-by: default avatarStephen Boyd <stephen.boyd@linaro.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Vikram Mulukutla <markivx@codeaurora.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a098ecd2