1. 19 Sep, 2018 1 commit
  2. 15 Sep, 2018 1 commit
  3. 09 Aug, 2018 2 commits
  4. 05 Jan, 2018 3 commits
  5. 07 Sep, 2017 1 commit
    • Eric Biggers's avatar
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 17c564f6
      Eric Biggers authored
      commit 355627f5 upstream.
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      Here is the use-after-free reported by KASAN:
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
          Allocated by task 199:
          Freed by task 199:
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 30 Aug, 2017 1 commit
    • Eric Biggers's avatar
      fork: fix incorrect fput of ->exe_file causing use-after-free · b65b6ac5
      Eric Biggers authored
      commit 2b7e8665 upstream.
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      This caused the struct file to later be freed prematurely.
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
          static void *mmap_thread(void *_arg)
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
          static void *fork_thread(void *_arg)
              usleep(rand() % 10000);
          int main(void)
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      Google Bug Id: 64772007
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 25 May, 2017 2 commits
    • Daniel Micay's avatar
      stackprotector: Increase the per-task stack canary's random range from 32 bits... · f157261b
      Daniel Micay authored
      stackprotector: Increase the per-task stack canary's random range from 32 bits to 64 bits on 64-bit platforms
      commit 5ea30e4e upstream.
      The stack canary is an 'unsigned long' and should be fully initialized to
      random data rather than only 32 bits of random data.
      Signed-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Acked-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Arjan van Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-hardening@lists.openwall.com
      Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kirill Tkhai's avatar
      pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes() · 2ea2f891
      Kirill Tkhai authored
      commit 3fd37226 upstream.
      Imagine we have a pid namespace and a task from its parent's pid_ns,
      which made setns() to the pid namespace. The task is doing fork(),
      while the pid namespace's child reaper is dying. We have the race
      between them:
      Task from parent pid_ns             Child reaper
      copy_process()                      ..
        alloc_pid()                       ..
        ..                                zap_pid_ns_processes()
        ..                                  disable_pid_allocation()
        ..                                  read_lock(&tasklist_lock)
        ..                                  iterate over pids in pid_ns
        ..                                    kill tasks linked to pids
        ..                                  read_unlock(&tasklist_lock)
        write_lock_irq(&tasklist_lock);   ..
        attach_pid(p, PIDTYPE_PID);       ..
        ..                                ..
      So, just created task p won't receive SIGKILL signal,
      and the pid namespace will be in contradictory state.
      Only manual kill will help there, but does the userspace
      care about this? I suppose, the most users just inject
      a task into a pid namespace and wait a SIGCHLD from it.
      The patch fixes the problem. It simply checks for
      (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
      We do it under the tasklist_lock, and can't skip
      PIDNS_HASH_ADDING as noted by Oleg:
      "zap_pid_ns_processes() does disable_pid_allocation()
      and then takes tasklist_lock to kill the whole namespace.
      Given that copy_process() checks PIDNS_HASH_ADDING
      under write_lock(tasklist) they can't race;
      if copy_process() takes this lock first, the new child will
      be killed, otherwise copy_process() can't miss
      the change in ->nr_hashed."
      If allocation is disabled, we just return -ENOMEM
      like it's made for such cases in alloc_pid().
      v2: Do not move disable_pid_allocation(), do not
      introduce a new variable in copy_process() and simplify
      the patch as suggested by Oleg Nesterov.
      Account the problem with double irq enabling
      found by Eric W. Biederman.
      Fixes: c876ad76 ("pidns: Stop pid allocation when init dies")
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Ingo Molnar <mingo@kernel.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Oleg Nesterov <oleg@redhat.com>
      CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
      CC: Michal Hocko <mhocko@suse.com>
      CC: Andy Lutomirski <luto@kernel.org>
      CC: "Eric W. Biederman" <ebiederm@xmission.com>
      CC: Andrei Vagin <avagin@openvz.org>
      CC: Cyrill Gorcunov <gorcunov@openvz.org>
      CC: Serge Hallyn <serge@hallyn.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  8. 14 May, 2017 1 commit
  9. 06 Jan, 2017 1 commit
    • Eric W. Biederman's avatar
      mm: Add a user_ns owner to mm_struct and fix ptrace permission checks · 694a95fa
      Eric W. Biederman authored
      commit bfedb589 upstream.
      During exec dumpable is cleared if the file that is being executed is
      not readable by the user executing the file.  A bug in
      ptrace_may_access allows reading the file if the executable happens to
      enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
      unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
      This problem is fixed with only necessary userspace breakage by adding
      a user namespace owner to mm_struct, captured at the time of exec, so
      it is clear in which user namespace CAP_SYS_PTRACE must be present in
      to be able to safely give read permission to the executable.
      The function ptrace_may_access is modified to verify that the ptracer
      has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
      This ensures that if the task changes it's cred into a subordinate
      user namespace it does not become ptraceable.
      The function ptrace_attach is modified to only set PT_PTRACE_CAP when
      CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
      PT_PTRACE_CAP is to be a flag to note that whatever permission changes
      the task might go through the tracer has sufficient permissions for
      it not to be an issue.  task->cred->user_ns is always the same
      as or descendent of mm->user_ns.  Which guarantees that having
      CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
      To prevent regressions mm->dumpable and mm->user_ns are not considered
      when a task has no mm.  As simply failing ptrace_may_attach causes
      regressions in privileged applications attempting to read things
      such as /proc/<pid>/stat
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Fixes: 8409cca7 ("userns: allow ptrace from non-init user namespaces")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  10. 01 Nov, 2016 1 commit
  11. 10 Oct, 2016 2 commits
    • Emese Revfy's avatar
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy authored
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: default avatarEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    • Emese Revfy's avatar
      gcc-plugins: Add latent_entropy plugin · 38addce8
      Emese Revfy authored
      This adds a new gcc plugin named "latent_entropy". It is designed to
      extract as much possible uncertainty from a running system at boot time as
      possible, hoping to capitalize on any possible variation in CPU operation
      (due to runtime data differences, hardware differences, SMP ordering,
      thermal timing variation, cache behavior, etc).
      At the very least, this plugin is a much more comprehensive example for
      how to manipulate kernel code using the gcc plugin internals.
      The need for very-early boot entropy tends to be very architecture or
      system design specific, so this plugin is more suited for those sorts
      of special cases. The existing kernel RNG already attempts to extract
      entropy from reliable runtime variation, but this plugin takes the idea to
      a logical extreme by permuting a global variable based on any variation
      in code execution (e.g. a different value (and permutation function)
      is used to permute the global based on loop count, case statement,
      if/then/else branching, etc).
      To do this, the plugin starts by inserting a local variable in every
      marked function. The plugin then adds logic so that the value of this
      variable is modified by randomly chosen operations (add, xor and rol) and
      random values (gcc generates separate static values for each location at
      compile time and also injects the stack pointer at runtime). The resulting
      value depends on the control flow path (e.g., loops and branches taken).
      Before the function returns, the plugin mixes this local variable into
      the latent_entropy global variable. The value of this global variable
      is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
      though it does not credit any bytes of entropy to the pool; the contents
      of the global are just used to mix the pool.
      Additionally, the plugin can pre-initialize arrays with build-time
      random contents, so that two different kernel builds running on identical
      hardware will not have the same starting values.
      Signed-off-by: default avatarEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message and code comments]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
  12. 08 Oct, 2016 4 commits
    • Aaron Lu's avatar
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu authored
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      Case used for test on Haswell EP:
        usemem -n 72 --readonly -j 0x200000 100G
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko authored
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop · 7283094e
      Michal Hocko authored
      Lockdep complains that __mmdrop is not safe from the softirq context:
        [ INFO: inconsistent lock state ]
        4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
        {SOFTIRQ-ON-W} state was registered at:
        irq event stamp: 105916
        hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
        hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
        softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
        softirqs last disabled at (105879): irq_exit+0x6f/0xd1
        other info that might help us debug this:
         Possible unsafe locking scenario:
         *** DEADLOCK ***
        1 lock held by swapper/1/0:
         #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
        stack backtrace:
        CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
      More over commit a79e53d8 ("x86/mm: Fix pgd_lock deadlock") was
      explicit about pgd_lock not to be called from the irq context.  This
      means that __mmdrop called from free_signal_struct has to be postponed
      to a user context.  We already have a similar mechanism for mmput_async
      so we can use it here as well.  This is safe because mm_count is pinned
      by mm_users.
      This fixes bug introduced by "oom: keep mm of the killed task available"
      Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko authored
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
            tsk->mm = NULL;
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 16 Sep, 2016 2 commits
  14. 02 Sep, 2016 1 commit
    • Michal Hocko's avatar
      kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd · 735f2770
      Michal Hocko authored
      Commit fec1d011 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
      exit") has caused a subtle regression in nscd which uses
      CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
      shared databases, so that the clients are notified when nscd is
      restarted.  Now, when nscd uses a non-persistent database, clients that
      have it mapped keep thinking the database is being updated by nscd, when
      in fact nscd has created a new (anonymous) one (for non-persistent
      databases it uses an unlinked file as backend).
      The original proposal for the CLONE_CHILD_CLEARTID change claimed
      : The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
      : on behalf of pthread_create() library calls.  This feature is used to
      : request that the kernel clear the thread-id in user space (at an address
      : provided in the syscall) when the thread disassociates itself from the
      : address space, which is done in mm_release().
      : Unfortunately, when a multi-threaded process incurs a core dump (such as
      : from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
      : the other threads, which then proceed to clear their user-space tids
      : before synchronizing in exit_mm() with the start of core dumping.  This
      : misrepresents the state of process's address space at the time of the
      : SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
      : problems (misleading him/her to conclude that the threads had gone away
      : before the fault).
      : The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
      : core dump has been initiated.
      The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
      seems to have a larger scope than the original patch asked for.  It
      seems that limitting the scope of the check to core dumping should work
      for SIGSEGV issue describe above.
      [Changelog partly based on Andreas' description]
      Fixes: fec1d011 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
      Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarWilliam Preston <wpreston@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Andreas Schwab <schwab@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 31 Aug, 2016 1 commit
  16. 24 Aug, 2016 1 commit
  17. 17 Aug, 2016 1 commit
    • Balbir Singh's avatar
      cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork · 568ac888
      Balbir Singh authored
      cgroup_threadgroup_rwsem is acquired in read mode during process exit
      and fork.  It is also grabbed in write mode during
      __cgroups_proc_write().  I've recently run into a scenario with lots
      of memory pressure and OOM and I am beginning to see
      This thread is waiting on the reader of cgroup_threadgroup_rwsem to
      exit.  The reader itself is under memory pressure and has gone into
      reclaim after fork. There are times the reader also ends up waiting on
      oom_lock as well.
      In the meanwhile, all processes exiting/forking are blocked almost
      stalling the system.
      This patch moves the threadgroup_change_begin from before
      cgroup_fork() to just before cgroup_canfork().  There is no nee to
      worry about threadgroup changes till the task is actually added to the
      threadgroup.  This avoids having to call reclaim with
      cgroup_threadgroup_rwsem held.
      tj: Subject and description edits.
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  18. 08 Aug, 2016 3 commits
  19. 28 Jul, 2016 2 commits
  20. 26 Jul, 2016 1 commit
    • Vladimir Davydov's avatar
      mm: charge/uncharge kmemcg from generic page allocator paths · 4949148a
      Vladimir Davydov authored
      Currently, to charge a non-slab allocation to kmemcg one has to use
      alloc_kmem_pages helper with __GFP_ACCOUNT flag.  A page allocated with
      this helper should finally be freed using free_kmem_pages, otherwise it
      won't be uncharged.
      This API suits its current users fine, but it turns out to be impossible
      to use along with page reference counting, i.e.  when an allocation is
      supposed to be freed with put_page, as it is the case with pipe or unix
      socket buffers.
      To overcome this limitation, this patch moves charging/uncharging to
      generic page allocator paths, i.e.  to __alloc_pages_nodemask and
      free_pages_prepare, and zaps alloc/free_kmem_pages helpers.  This way,
      one can use any of the available page allocation functions to get the
      allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
      just like in case of kmalloc and friends.  A charged page will be
      automatically uncharged on free.
      To make it possible, we need to mark pages charged to kmemcg somehow.
      To avoid introducing a new page flag, we make use of page->_mapcount for
      marking such pages.  Since pages charged to kmemcg are not supposed to
      be mapped to userspace, it should work just fine.  There are other
      (ab)users of page->_mapcount - buddy and balloon pages - but we don't
      conflict with them.
      In case kmemcg is compiled out or not used at runtime, this patch
      introduces no overhead to generic page allocator paths.  If kmemcg is
      used, it will be plus one gfp flags check on alloc and plus one
      page->_mapcount check on free, which shouldn't hurt performance, because
      the data accessed are hot.
      Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  21. 25 Jun, 2016 1 commit
  22. 24 Jun, 2016 1 commit
    • Linus Torvalds's avatar
      Clarify naming of thread info/stack allocators · b235beea
      Linus Torvalds authored
      We've had the thread info allocated together with the thread stack for
      most architectures for a long time (since the thread_info was split off
      from the task struct), but that is about to change.
      But the patches that move the thread info to be off-stack (and a part of
      the task struct instead) made it clear how confused the allocator and
      freeing functions are.
      Because the common case was that we share an allocation with the thread
      stack and the thread_info, the two pointers were identical.  That
      identity then meant that we would have things like
      	ti = alloc_thread_info_node(tsk, node);
      	tsk->stack = ti;
      which certainly _worked_ (since stack and thread_info have the same
      value), but is rather confusing: why are we assigning a thread_info to
      the stack? And if we move the thread_info away, the "confusing" code
      just gets to be entirely bogus.
      So remove all this confusion, and make it clear that we are doing the
      stack allocation by renaming and clarifying the function names to be
      about the stack.  The fact that the thread_info then shares the
      allocation is an implementation detail, and not really about the
      allocation itself.
      This is a pure renaming and type fix: we pass in the same pointer, it's
      just that we clarify what the pointer means.
      The ia64 code that actually only has one single allocation (for all of
      task_struct, thread_info and kernel thread stack) now looks a bit odd,
      but since "tsk->stack" is actually not even used there, that oddity
      doesn't matter.  It would be a separate thing to clean that up, I
      intentionally left the ia64 changes as a pure brute-force renaming and
      type change.
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  23. 26 May, 2016 1 commit
  24. 24 May, 2016 2 commits
  25. 21 May, 2016 2 commits
    • Jiri Slaby's avatar
      fork: free thread in copy_process on failure · 0740aa5f
      Jiri Slaby authored
      When using this program (as root):
      	#include <err.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/io.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      	#define ITER 1000
      	#define FORKERS 15
      	#define THREADS (6000/FORKERS) // 1850 is proc max
      	static void fork_100_wait()
      		unsigned a, to_wait = 0;
      		printf("\t%d forking %d\n", THREADS, getpid());
      		for (a = 0; a < THREADS; a++) {
      			switch (fork()) {
      			case 0:
      			case -1:
      		printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
      		for (a = 0; a < to_wait; a++)
      		printf("\t%d waited from %d\n", THREADS, getpid());
      	static void run_forkers()
      		pid_t forkers[FORKERS];
      		unsigned a;
      		for (a = 0; a < FORKERS; a++) {
      			switch ((forkers[a] = fork())) {
      			case 0:
      			case -1:
      				err(1, "DIE fork of %d'th forker", a);
      		for (a = 0; a < FORKERS; a++)
      			waitpid(forkers[a], NULL, 0);
      	int main()
      		unsigned a;
      		int ret;
      		ret = ioperm(10, 20, 0);
      		if (ret < 0)
      			err(1, "ioperm");
      		for (a = 0; a < ITER; a++)
      		return 0;
      kmemleak reports many occurences of this leak:
      unreferenced object 0xffff8805917c8000 (size 8192):
        comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
          [<ffffffff814cfbf5>] kmemdup+0x25/0x50
          [<ffffffff8103ab43>] copy_thread_tls+0x6c3/0x9a0
          [<ffffffff81150174>] copy_process+0x1a84/0x5790
          [<ffffffff811dc375>] wake_up_new_task+0x2d5/0x6f0
          [<ffffffff8115411d>] _do_fork+0x12d/0x820
      Due to the leakage of the memory items which should have been freed in
      Make sure the memory is freed when fork fails later in copy_process.
      This is done by calling exit_thread with the thread to kill.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko authored
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      The issue is only theoretical but not impossible.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  26. 03 May, 2016 1 commit
    • Stas Sergeev's avatar
      signals/sigaltstack: Implement SS_AUTODISARM flag · 2a742138
      Stas Sergeev authored
      This patch implements the SS_AUTODISARM flag that can be OR-ed with
      SS_ONSTACK when forming ss_flags.
      When this flag is set, sigaltstack will be disabled when entering
      the signal handler; more precisely, after saving sas to uc_stack.
      When leaving the signal handler, the sigaltstack is restored by
      When this flag is used, it is safe to switch from sighandler with
      swapcontext(). Without this flag, the subsequent signal will corrupt
      the state of the switched-away sighandler.
      To detect the support of this functionality, one can do:
        err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
        if (err && errno == EINVAL)
      Signed-off-by: default avatarStas Sergeev <stsp@list.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ruSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>