1. 01 Nov, 2018 1 commit
    • Philippe Gerum's avatar
      sched: ipipe: enable task migration between domains · 957ac4c9
      Philippe Gerum authored
      This is the basic code enabling alternate control of tasks between the
      regular kernel and an embedded co-kernel. The changes cover the
      following aspects:
      - extend the per-thread information block with a private area usable
        by the co-kernel for storing additional state information
      - provide the API enabling a scheduler exchange mechanism, so that
        tasks can run under the control of either kernel alternatively. This
        includes a service to move the current task to the head domain under
        the control of the co-kernel, and the converse service to re-enter
        the root domain once the co-kernel has released such task.
      - ensure the generic context switching code can be used from any
        domain, serializing execution as required.
      These changes have to be paired with arch-specific code further
      enabling context switching from the head domain.
  2. 15 Sep, 2018 1 commit
  3. 03 Aug, 2018 1 commit
    • Kees Cook's avatar
      fork: unconditionally clear stack on fork · 2d5fc7ff
      Kees Cook authored
      commit e01e8063 upstream.
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      I did some more with perf and cycle counts on running 100,000 execs of
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 22 Feb, 2018 1 commit
  5. 29 Dec, 2017 1 commit
    • Thomas Gleixner's avatar
      arch, mm: Allow arch_dup_mmap() to fail · ee8e8b2d
      Thomas Gleixner authored
      commit c10e83f5 upstream.
      In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
      allowed to fail. Fix up all instances.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: dan.j.williams@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 13 Oct, 2017 1 commit
  7. 04 Oct, 2017 1 commit
  8. 09 Sep, 2017 2 commits
  9. 07 Sep, 2017 2 commits
    • Rik van Riel's avatar
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel authored
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      It would be better to have the kernel take care of this automatically.
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: default avatarRik van Riel <riel@redhat.com>
      Reported-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Reported-by: default avatarColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andrea Arcangeli's avatar
      mm: oom: let oom_reap_task and exit_mmap run concurrently · 21292580
      Andrea Arcangeli authored
      This is purely required because exit_aio() may block and exit_mmap() may
      never start, if the oom_reap_task cannot start running on a mm with
      mm_users == 0.
      At the same time if the OOM reaper doesn't wait at all for the memory of
      the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
      generate a spurious OOM kill.
      If it wasn't because of the exit_aio or similar blocking functions in
      the last mmput, it would be enough to change the oom_reap_task() in the
      case it finds mm_users == 0, to wait for a timeout or to wait for
      __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
      problem here so the concurrency of exit_mmap and oom_reap_task is
      apparently warranted.
      It's a non standard runtime, exit_mmap() runs without mmap_sem, and
      oom_reap_task runs with the mmap_sem for reading as usual (kind of
      The race between the two is solved with a combination of
      tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
      (serialized by a dummy down_write/up_write cycle on the same lines of
      the ksm_exit method).
      If the oom_reap_task() may be running concurrently during exit_mmap,
      exit_mmap will wait it to finish in down_write (before taking down mm
      structures that would make the oom_reap_task fail with use after free).
      If exit_mmap comes first, oom_reap_task() will skip the mm if
      MMF_OOM_SKIP is already set and in turn all memory is already freed and
      furthermore the mm data structures may already have been taken down by
      [aarcange@redhat.com: incremental one liner]
        Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
      [rientjes@google.com: remove unused mmput_async]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
      [aarcange@redhat.com: microoptimization]
        Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
      Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
      Fixes: 26db62f1 ("oom: keep mm of the killed task available")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 31 Aug, 2017 1 commit
    • Eric Biggers's avatar
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 355627f5
      Eric Biggers authored
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      Here is the use-after-free reported by KASAN:
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
          Allocated by task 199:
          Freed by task 199:
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>    [4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 25 Aug, 2017 1 commit
    • Eric Biggers's avatar
      fork: fix incorrect fput of ->exe_file causing use-after-free · 2b7e8665
      Eric Biggers authored
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      This caused the struct file to later be freed prematurely.
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
          static void *mmap_thread(void *_arg)
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
          static void *fork_thread(void *_arg)
              usleep(rand() % 10000);
          int main(void)
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      Google Bug Id: 64772007
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[v4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 15 Aug, 2017 1 commit
    • Mark Rutland's avatar
      fork: allow arch-override of VMAP stack alignment · 48ac3c18
      Mark Rutland authored
      In some cases, an architecture might wish its stacks to be aligned to a
      boundary larger than THREAD_SIZE. For example, using an alignment of
      double THREAD_SIZE can allow for stack overflows smaller than
      THREAD_SIZE to be detected by checking a single bit of the stack
      This patch allows architectures to override the alignment of VMAP'd
      stacks, by defining THREAD_ALIGN. Where not defined, this defaults to
      THREAD_SIZE, as is the case today.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarLaura Abbott <labbott@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: linux-kernel@vger.kernel.org
  13. 10 Aug, 2017 2 commits
    • Nadav Amit's avatar
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit authored
      Patch series "fixes of TLB batching races", v6.
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      This patch (of 7):
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Byungchul Park's avatar
      locking/lockdep: Implement the 'crossrelease' feature · b09be676
      Byungchul Park authored
      Lockdep is a runtime locking correctness validator that detects and
      reports a deadlock or its possibility by checking dependencies between
      locks. It's useful since it does not report just an actual deadlock but
      also the possibility of a deadlock that has not actually happened yet.
      That enables problems to be fixed before they affect real systems.
      However, this facility is only applicable to typical locks, such as
      spinlocks and mutexes, which are normally released within the context in
      which they were acquired. However, synchronization primitives like page
      locks or completions, which are allowed to be released in any context,
      also create dependencies and can cause a deadlock.
      So lockdep should track these locks to do a better job. The 'crossrelease'
      implementation makes these primitives also be tracked.
      Signed-off-by: default avatarByungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
  14. 18 Jul, 2017 1 commit
  15. 12 Jul, 2017 3 commits
  16. 06 Jul, 2017 1 commit
  17. 05 Jul, 2017 2 commits
  18. 22 May, 2017 1 commit
    • Vegard Nossum's avatar
      kthread: Fix use-after-free if kthread fork fails · 4d6501dc
      Vegard Nossum authored
      If a kthread forks (e.g. usermodehelper since commit 1da5c46f) but
      fails in copy_process() between calling dup_task_struct() and setting
      p->set_child_tid, then the value of p->set_child_tid will be inherited
      from the parent and get prematurely freed by free_kthread_struct().
           - worker_thread()
              - process_one_work()
              |  - call_usermodehelper_exec_work()
              |     - kernel_thread()
              |        - _do_fork()
              |           - copy_process()
              |              - dup_task_struct()
              |                 - arch_dup_task_struct()
              |                    - tsk->set_child_tid = current->set_child_tid // implied
              |              - ...
              |              - goto bad_fork_*
              |              - ...
              |              - free_task(tsk)
              |                 - free_kthread_struct(tsk)
              |                    - kfree(tsk->set_child_tid)
              - ...
              - schedule()
                 - __schedule()
                    - wq_worker_sleeping()
                       - kthread_data(task)->flags // UAF
      The problem started showing up with commit 1da5c46f since it reused
      ->set_child_tid for the kthread worker data.
      A better long-term solution might be to get rid of the ->set_child_tid
      abuse. The comment in set_kthread_struct() also looks slightly wrong.
      Debugged-by: default avatarJamie Iles <jamie.iles@oracle.com>
      Fixes: 1da5c46f ("kthread: Make struct kthread kmalloc'ed")
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  19. 13 May, 2017 1 commit
    • Kirill Tkhai's avatar
      pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes() · 3fd37226
      Kirill Tkhai authored
      Imagine we have a pid namespace and a task from its parent's pid_ns,
      which made setns() to the pid namespace. The task is doing fork(),
      while the pid namespace's child reaper is dying. We have the race
      between them:
      Task from parent pid_ns             Child reaper
      copy_process()                      ..
        alloc_pid()                       ..
        ..                                zap_pid_ns_processes()
        ..                                  disable_pid_allocation()
        ..                                  read_lock(&tasklist_lock)
        ..                                  iterate over pids in pid_ns
        ..                                    kill tasks linked to pids
        ..                                  read_unlock(&tasklist_lock)
        write_lock_irq(&tasklist_lock);   ..
        attach_pid(p, PIDTYPE_PID);       ..
        ..                                ..
      So, just created task p won't receive SIGKILL signal,
      and the pid namespace will be in contradictory state.
      Only manual kill will help there, but does the userspace
      care about this? I suppose, the most users just inject
      a task into a pid namespace and wait a SIGCHLD from it.
      The patch fixes the problem. It simply checks for
      (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
      We do it under the tasklist_lock, and can't skip
      PIDNS_HASH_ADDING as noted by Oleg:
      "zap_pid_ns_processes() does disable_pid_allocation()
      and then takes tasklist_lock to kill the whole namespace.
      Given that copy_process() checks PIDNS_HASH_ADDING
      under write_lock(tasklist) they can't race;
      if copy_process() takes this lock first, the new child will
      be killed, otherwise copy_process() can't miss
      the change in ->nr_hashed."
      If allocation is disabled, we just return -ENOMEM
      like it's made for such cases in alloc_pid().
      v2: Do not move disable_pid_allocation(), do not
      introduce a new variable in copy_process() and simplify
      the patch as suggested by Oleg Nesterov.
      Account the problem with double irq enabling
      found by Eric W. Biederman.
      Fixes: c876ad76 ("pidns: Stop pid allocation when init dies")
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Ingo Molnar <mingo@kernel.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Oleg Nesterov <oleg@redhat.com>
      CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
      CC: Michal Hocko <mhocko@suse.com>
      CC: Andy Lutomirski <luto@kernel.org>
      CC: "Eric W. Biederman" <ebiederm@xmission.com>
      CC: Andrei Vagin <avagin@openvz.org>
      CC: Cyrill Gorcunov <gorcunov@openvz.org>
      CC: Serge Hallyn <serge@hallyn.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
  20. 09 May, 2017 2 commits
  21. 05 May, 2017 1 commit
  22. 18 Apr, 2017 1 commit
    • Paul E. McKenney's avatar
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney authored
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
  23. 04 Apr, 2017 1 commit
    • Xunlei Pang's avatar
      sched/rtmutex/deadline: Fix a PI crash for deadline tasks · e96a7705
      Xunlei Pang authored
      A crash happened while I was playing with deadline PI rtmutex.
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
          PGD 232a75067 PUD 230947067 PMD 0
          Oops: 0000 [#1] SMP
          CPU: 1 PID: 10994 Comm: a.out Not tainted
          Call Trace:
          [<ffffffff810b658c>] enqueue_task+0x2c/0x80
          [<ffffffff810ba763>] activate_task+0x23/0x30
          [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
          [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
          [<ffffffff8164e783>] __schedule+0xd3/0x900
          [<ffffffff8164efd9>] schedule+0x29/0x70
          [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
          [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
          [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
          [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
          [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
          [<ffffffff810edd50>] SyS_futex+0x80/0x180
      This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
      are only protected by pi_lock when operating pi waiters, while
      rt_mutex_get_top_task(), will access them with rq lock held but
      not holding pi_lock.
      In order to tackle it, we introduce new "pi_top_task" pointer
      cached in task_struct, and add new rt_mutex_update_top_task()
      to update its value, it can be called by rt_mutex_setprio()
      which held both owner's pi_lock and rq lock. Thus "pi_top_task"
      can be safely accessed by enqueue_task_dl() under rq lock.
      Originally-From: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarXunlei Pang <xlpang@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  24. 28 Mar, 2017 1 commit
    • Tetsuo Handa's avatar
      LSM: Revive security_task_alloc() hook and per "struct task_struct" security blob. · e4e55b47
      Tetsuo Handa authored
      We switched from "struct task_struct"->security to "struct cred"->security
      in Linux 2.6.29. But not all LSM modules were happy with that change.
      TOMOYO LSM module is an example which want to use per "struct task_struct"
      security blob, for TOMOYO's security context is defined based on "struct
      task_struct" rather than "struct cred". AppArmor LSM module is another
      example which want to use it, for AppArmor is currently abusing the cred
      a little bit to store the change_hat and setexeccon info. Although
      security_task_free() hook was revived in Linux 3.4 because Yama LSM module
      wanted to release per "struct task_struct" security blob,
      security_task_alloc() hook and "struct task_struct"->security field were
      not revived. Nowadays, we are getting proposals of lightweight LSM modules
      which want to use per "struct task_struct" security blob.
      We are already allowing multiple concurrent LSM modules (up to one fully
      armored module which uses "struct cred"->security field or exclusive hooks
      like security_xfrm_state_pol_flow_match(), plus unlimited number of
      lightweight modules which do not use "struct cred"->security nor exclusive
      hooks) as long as they are built into the kernel. But this patch does not
      implement variable length "struct task_struct"->security field which will
      become needed when multiple LSM modules want to use "struct task_struct"->
      security field. Although it won't be difficult to implement variable length
      "struct task_struct"->security field, let's think about it after we merged
      this patch.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarJohn Johansen <john.johansen@canonical.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Acked-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Tested-by: default avatarDjalal Harouni <tixxdz@gmail.com>
      Acked-by: default avatarJosé Bollo <jobol@nonadev.net>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: José Bollo <jobol@nonadev.net>
      Signed-off-by: default avatarJames Morris <james.l.morris@oracle.com>
  25. 13 Mar, 2017 1 commit
    • Hari Bathini's avatar
      perf: Add PERF_RECORD_NAMESPACES to include namespaces related info · e4222673
      Hari Bathini authored
      With the advert of container technologies like docker, that depend on
      namespaces for isolation, there is a need for tracing support for
      namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
      recording namespaces related info. By recording info for every
      namespace, it is left to userspace to take a call on the definition of a
      container and trace containers by updating perf tool accordingly.
      Each namespace has a combination of device and inode numbers. Though
      every namespace has the same device number currently, that may change in
      future to avoid the need for a namespace of namespaces. Considering such
      possibility, record both device and inode numbers separately for each
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
  26. 08 Mar, 2017 1 commit
    • Josh Poimboeuf's avatar
      livepatch: change to a per-task consistency model · d83a7cb3
      Josh Poimboeuf authored
      Change livepatch to use a basic per-task consistency model.  This is the
      foundation which will eventually enable us to patch those ~10% of
      security patches which change function or data semantics.  This is the
      biggest remaining piece needed to make livepatch more generally useful.
      This code stems from the design proposal made by Vojtech [1] in November
      2014.  It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
      consistency and syscall barrier switching combined with kpatch's stack
      trace switching.  There are also a number of fallback options which make
      it quite flexible.
      Patches are applied on a per-task basis, when the task is deemed safe to
      switch over.  When a patch is enabled, livepatch enters into a
      transition state where tasks are converging to the patched state.
      Usually this transition state can complete in a few seconds.  The same
      sequence occurs when a patch is disabled, except the tasks converge from
      the patched state to the unpatched state.
      An interrupt handler inherits the patched state of the task it
      interrupts.  The same is true for forked tasks: the child inherits the
      patched state of the parent.
      Livepatch uses several complementary approaches to determine when it's
      safe to patch tasks:
      1. The first and most effective approach is stack checking of sleeping
         tasks.  If no affected functions are on the stack of a given task,
         the task is patched.  In most cases this will patch most or all of
         the tasks on the first try.  Otherwise it'll keep trying
         periodically.  This option is only available if the architecture has
         reliable stacks (HAVE_RELIABLE_STACKTRACE).
      2. The second approach, if needed, is kernel exit switching.  A
         task is switched when it returns to user space from a system call, a
         user space IRQ, or a signal.  It's useful in the following cases:
         a) Patching I/O-bound user tasks which are sleeping on an affected
            function.  In this case you have to send SIGSTOP and SIGCONT to
            force it to exit the kernel and be patched.
         b) Patching CPU-bound user tasks.  If the task is highly CPU-bound
            then it will get patched the next time it gets interrupted by an
         c) In the future it could be useful for applying patches for
            architectures which don't yet have HAVE_RELIABLE_STACKTRACE.  In
            this case you would have to signal most of the tasks on the
            system.  However this isn't supported yet because there's
            currently no way to patch kthreads without
      3. For idle "swapper" tasks, since they don't ever exit the kernel, they
         instead have a klp_update_patch_state() call in the idle loop which
         allows them to be patched before the CPU enters the idle state.
         (Note there's not yet such an approach for kthreads.)
      All the above approaches may be skipped by setting the 'immediate' flag
      in the 'klp_patch' struct, which will disable per-task consistency and
      patch all tasks immediately.  This can be useful if the patch doesn't
      change any function or data semantics.  Note that, even with this flag
      set, it's possible that some tasks may still be running with an old
      version of the function, until that function returns.
      There's also an 'immediate' flag in the 'klp_func' struct which allows
      you to specify that certain functions in the patch can be applied
      without per-task consistency.  This might be useful if you want to patch
      a common function like schedule(), and the function change doesn't need
      consistency but the rest of the patch does.
      For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
      must set patch->immediate which causes all tasks to be patched
      immediately.  This option should be used with care, only when the patch
      doesn't change any function or data semantics.
      In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
      may be allowed to use per-task consistency if we can come up with
      another way to patch kthreads.
      The /sys/kernel/livepatch/<patch>/transition file shows whether a patch
      is in transition.  Only a single patch (the topmost patch on the stack)
      can be in transition at a given time.  A patch can remain in transition
      indefinitely, if any of the tasks are stuck in the initial patch state.
      A transition can be reversed and effectively canceled by writing the
      opposite value to the /sys/kernel/livepatch/<patch>/enabled file while
      the transition is in progress.  Then all the tasks will attempt to
      converge back to the original patch state.
      [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.czSigned-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Acked-by: Ingo Molnar <mingo@kernel.org>        # for the scheduler changes
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
  27. 03 Mar, 2017 1 commit
  28. 02 Mar, 2017 6 commits