1. 21 Jul, 2019 2 commits
    • Eiichi Tsukata's avatar
      cpu/hotplug: Fix out-of-bounds read when setting fail state · 8f5e9235
      Eiichi Tsukata authored
      [ Upstream commit 33d4a5a7a5b4d02915d765064b2319e90a11cbde ]
      
      Setting invalid value to /sys/devices/system/cpu/cpuX/hotplug/fail
      can control `struct cpuhp_step *sp` address, results in the following
      global-out-of-bounds read.
      
      Reproducer:
      
        # echo -2 > /sys/devices/system/cpu/cpu0/hotplug/fail
      
      KASAN report:
      
        BUG: KASAN: global-out-of-bounds in write_cpuhp_fail+0x2cd/0x2e0
        Read of size 8 at addr ffffffff89734438 by task bash/1941
      
        CPU: 0 PID: 1941 Comm: bash Not tainted 5.2.0-rc6+ #31
        Call Trace:
         write_cpuhp_fail+0x2cd/0x2e0
         dev_attr_store+0x58/0x80
         sysfs_kf_write+0x13d/0x1a0
         kernfs_fop_write+0x2bc/0x460
         vfs_write+0x1e1/0x560
         ksys_write+0x126/0x250
         do_syscall_64+0xc1/0x390
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f05e4f4c970
      
        The buggy address belongs to the variable:
         cpu_hotplug_lock+0x98/0xa0
      
        Memory state around the buggy address:
         ffffffff89734300: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
         ffffffff89734380: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
        >ffffffff89734400: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
                                                ^
         ffffffff89734480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffffffff89734500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Add a sanity check for the value written from user space.
      
      Fixes: 1db49484 ("smp/hotplug: Hotplug state fail injection")
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Link: https://lkml.kernel.org/r/20190627024732.31672-1-devel@etsukata.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      8f5e9235
    • Peter Zijlstra's avatar
      perf/core: Fix perf_sample_regs_user() mm check · a68c5a52
      Peter Zijlstra authored
      [ Upstream commit 085ebfe937d7a7a5df1729f35a12d6d655fea68c ]
      
      perf_sample_regs_user() uses 'current->mm' to test for the presence of
      userspace, but this is insufficient, consider use_mm().
      
      A better test is: '!(current->flags & PF_KTHREAD)', exec() clears
      PF_KTHREAD after it sets the new ->mm but before it drops to userspace
      for the first time.
      
      Possibly obsoletes: bf05fc25 ("powerpc/perf: Fix oops when kthread execs user process")
      Reported-by: default avatarRavi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Reported-by: default avatarYoung Xiao <92siuyang@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4018994f ("perf: Add ability to attach user level registers dump to sample")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a68c5a52
  2. 10 Jul, 2019 6 commits
    • Petr Mladek's avatar
      ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() · 0c0b5477
      Petr Mladek authored
      commit d5b844a2cf507fc7642c9ae80a9d585db3065c28 upstream.
      
      The commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text
      permissions race") causes a possible deadlock between register_kprobe()
      and ftrace_run_update_code() when ftrace is using stop_machine().
      
      The existing dependency chain (in reverse order) is:
      
      -> #1 (text_mutex){+.+.}:
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             __mutex_lock+0x88/0x908
             mutex_lock_nested+0x32/0x40
             register_kprobe+0x254/0x658
             init_kprobes+0x11a/0x168
             do_one_initcall+0x70/0x318
             kernel_init_freeable+0x456/0x508
             kernel_init+0x22/0x150
             ret_from_fork+0x30/0x34
             kernel_thread_starter+0x0/0xc
      
      -> #0 (cpu_hotplug_lock.rw_sem){++++}:
             check_prev_add+0x90c/0xde0
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             cpus_read_lock+0x62/0xd0
             stop_machine+0x2e/0x60
             arch_ftrace_update_code+0x2e/0x40
             ftrace_run_update_code+0x40/0xa0
             ftrace_startup+0xb2/0x168
             register_ftrace_function+0x64/0x88
             klp_patch_object+0x1a2/0x290
             klp_enable_patch+0x554/0x980
             do_one_initcall+0x70/0x318
             do_init_module+0x6e/0x250
             load_module+0x1782/0x1990
             __s390x_sys_finit_module+0xaa/0xf0
             system_call+0xd8/0x2d0
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(text_mutex);
                                     lock(cpu_hotplug_lock.rw_sem);
                                     lock(text_mutex);
        lock(cpu_hotplug_lock.rw_sem);
      
      It is similar problem that has been solved by the commit 2d1e38f5
      ("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
      To be on the safe side, text_mutex must become a low level lock taken
      after cpu_hotplug_lock.rw_sem.
      
      This can't be achieved easily with the current ftrace design.
      For example, arm calls set_all_modules_text_rw() already in
      ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
      This functions is called:
      
        + outside stop_machine() from ftrace_run_update_code()
        + without stop_machine() from ftrace_module_enable()
      
      Fortunately, the problematic fix is needed only on x86_64. It is
      the only architecture that calls set_all_modules_text_rw()
      in ftrace path and supports livepatching at the same time.
      
      Therefore it is enough to move text_mutex handling from the generic
      kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
      
         ftrace_arch_code_modify_prepare()
         ftrace_arch_code_modify_post_process()
      
      This patch basically reverts the ftrace part of the problematic
      commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module
      text permissions race"). And provides x86_64 specific-fix.
      
      Some refactoring of the ftrace code will be needed when livepatching
      is implemented for arm or nds32. These architectures call
      set_all_modules_text_rw() and use stop_machine() at the same time.
      
      Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
      
      Fixes: 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text permissions race")
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      [
        As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
        ftrace_run_update_code() as it is a void function.
      ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c0b5477
    • Eiichi Tsukata's avatar
      tracing/snapshot: Resize spare buffer if size changed · 90b89546
      Eiichi Tsukata authored
      commit 46cc0b44428d0f0e81f11ea98217fc0edfbeab07 upstream.
      
      Current snapshot implementation swaps two ring_buffers even though their
      sizes are different from each other, that can cause an inconsistency
      between the contents of buffer_size_kb file and the current buffer size.
      
      For example:
      
        # cat buffer_size_kb
        7 (expanded: 1408)
        # echo 1 > events/enable
        # grep bytes per_cpu/cpu0/stats
        bytes: 1441020
        # echo 1 > snapshot             // current:1408, spare:1408
        # echo 123 > buffer_size_kb     // current:123,  spare:1408
        # echo 1 > snapshot             // current:1408, spare:123
        # grep bytes per_cpu/cpu0/stats
        bytes: 1443700
        # cat buffer_size_kb
        123                             // != current:1408
      
      And also, a similar per-cpu case hits the following WARNING:
      
      Reproducer:
      
        # echo 1 > per_cpu/cpu0/snapshot
        # echo 123 > buffer_size_kb
        # echo 1 > per_cpu/cpu0/snapshot
      
      WARNING:
      
        WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
        Modules linked in:
        CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
        Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
        RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
        RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
        RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
        RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
        R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
        R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
        FS:  00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
        Call Trace:
         ? trace_array_printk_buf+0x140/0x140
         ? __mutex_lock_slowpath+0x10/0x10
         tracing_snapshot_write+0x4c8/0x7f0
         ? trace_printk_init_buffers+0x60/0x60
         ? selinux_file_permission+0x3b/0x540
         ? tracer_preempt_off+0x38/0x506
         ? trace_printk_init_buffers+0x60/0x60
         __vfs_write+0x81/0x100
         vfs_write+0x1e1/0x560
         ksys_write+0x126/0x250
         ? __ia32_sys_read+0xb0/0xb0
         ? do_syscall_64+0x1f/0x390
         do_syscall_64+0xc1/0x390
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This patch adds resize_buffer_duplicate_size() to check if there is a
      difference between current/spare buffer sizes and resize a spare buffer
      if necessary.
      
      Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com
      
      Cc: stable@vger.kernel.org
      Fixes: ad909e21 ("tracing: Add internal tracing_snapshot() functions")
      Signed-off-by: default avatarEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90b89546
    • Jann Horn's avatar
      ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME · bf71ef96
      Jann Horn authored
      commit 6994eefb0053799d2e07cd140df6c2ea106c41ee upstream.
      
      Fix two issues:
      
      When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU
      reference to the parent's objective credentials, then give that pointer
      to get_cred().  However, the object lifetime rules for things like
      struct cred do not permit unconditionally turning an RCU reference into
      a stable reference.
      
      PTRACE_TRACEME records the parent's credentials as if the parent was
      acting as the subject, but that's not the case.  If a malicious
      unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
      at a later point, the parent process becomes attacker-controlled
      (because it drops privileges and calls execve()), the attacker ends up
      with control over two processes with a privileged ptrace relationship,
      which can be abused to ptrace a suid binary and obtain root privileges.
      
      Fix both of these by always recording the credentials of the process
      that is requesting the creation of the ptrace relationship:
      current_cred() can't change under us, and current is the proper subject
      for access control.
      
      This change is theoretically userspace-visible, but I am not aware of
      any code that it will actually break.
      
      Fixes: 64b875f7 ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf71ef96
    • Wei Li's avatar
      ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper() · f6880316
      Wei Li authored
      [ Upstream commit 04e03d9a616c19a47178eaca835358610e63a1dd ]
      
      The mapper may be NULL when called from register_ftrace_function_probe()
      with probe->data == NULL.
      
      This issue can be reproduced as follow (it may be covered by compiler
      optimization sometime):
      
      / # cat /sys/kernel/debug/tracing/set_ftrace_filter
      #### all functions enabled ####
      / # echo foo_bar:dump > /sys/kernel/debug/tracing/set_ftrace_filter
      [  206.949100] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  206.952402] Mem abort info:
      [  206.952819]   ESR = 0x96000006
      [  206.955326]   Exception class = DABT (current EL), IL = 32 bits
      [  206.955844]   SET = 0, FnV = 0
      [  206.956272]   EA = 0, S1PTW = 0
      [  206.956652] Data abort info:
      [  206.957320]   ISV = 0, ISS = 0x00000006
      [  206.959271]   CM = 0, WnR = 0
      [  206.959938] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000419f3a000
      [  206.960483] [0000000000000000] pgd=0000000411a87003, pud=0000000411a83003, pmd=0000000000000000
      [  206.964953] Internal error: Oops: 96000006 [#1] SMP
      [  206.971122] Dumping ftrace buffer:
      [  206.973677]    (ftrace buffer empty)
      [  206.975258] Modules linked in:
      [  206.976631] Process sh (pid: 281, stack limit = 0x(____ptrval____))
      [  206.978449] CPU: 10 PID: 281 Comm: sh Not tainted 5.2.0-rc1+ #17
      [  206.978955] Hardware name: linux,dummy-virt (DT)
      [  206.979883] pstate: 60000005 (nZCv daif -PAN -UAO)
      [  206.980499] pc : free_ftrace_func_mapper+0x2c/0x118
      [  206.980874] lr : ftrace_count_free+0x68/0x80
      [  206.982539] sp : ffff0000182f3ab0
      [  206.983102] x29: ffff0000182f3ab0 x28: ffff8003d0ec1700
      [  206.983632] x27: ffff000013054b40 x26: 0000000000000001
      [  206.984000] x25: ffff00001385f000 x24: 0000000000000000
      [  206.984394] x23: ffff000013453000 x22: ffff000013054000
      [  206.984775] x21: 0000000000000000 x20: ffff00001385fe28
      [  206.986575] x19: ffff000013872c30 x18: 0000000000000000
      [  206.987111] x17: 0000000000000000 x16: 0000000000000000
      [  206.987491] x15: ffffffffffffffb0 x14: 0000000000000000
      [  206.987850] x13: 000000000017430e x12: 0000000000000580
      [  206.988251] x11: 0000000000000000 x10: cccccccccccccccc
      [  206.988740] x9 : 0000000000000000 x8 : ffff000013917550
      [  206.990198] x7 : ffff000012fac2e8 x6 : ffff000012fac000
      [  206.991008] x5 : ffff0000103da588 x4 : 0000000000000001
      [  206.991395] x3 : 0000000000000001 x2 : ffff000013872a28
      [  206.991771] x1 : 0000000000000000 x0 : 0000000000000000
      [  206.992557] Call trace:
      [  206.993101]  free_ftrace_func_mapper+0x2c/0x118
      [  206.994827]  ftrace_count_free+0x68/0x80
      [  206.995238]  release_probe+0xfc/0x1d0
      [  206.995555]  register_ftrace_function_probe+0x4a8/0x868
      [  206.995923]  ftrace_trace_probe_callback.isra.4+0xb8/0x180
      [  206.996330]  ftrace_dump_callback+0x50/0x70
      [  206.996663]  ftrace_regex_write.isra.29+0x290/0x3a8
      [  206.997157]  ftrace_filter_write+0x44/0x60
      [  206.998971]  __vfs_write+0x64/0xf0
      [  206.999285]  vfs_write+0x14c/0x2f0
      [  206.999591]  ksys_write+0xbc/0x1b0
      [  206.999888]  __arm64_sys_write+0x3c/0x58
      [  207.000246]  el0_svc_common.constprop.0+0x408/0x5f0
      [  207.000607]  el0_svc_handler+0x144/0x1c8
      [  207.000916]  el0_svc+0x8/0xc
      [  207.003699] Code: aa0003f8 a9025bf5 aa0103f5 f946ea80 (f9400303)
      [  207.008388] ---[ end trace 7b6d11b5f542bdf1 ]---
      [  207.010126] Kernel panic - not syncing: Fatal exception
      [  207.011322] SMP: stopping secondary CPUs
      [  207.013956] Dumping ftrace buffer:
      [  207.014595]    (ftrace buffer empty)
      [  207.015632] Kernel Offset: disabled
      [  207.017187] CPU features: 0x002,20006008
      [  207.017985] Memory Limit: none
      [  207.019825] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190606031754.10798-1-liwei391@huawei.comSigned-off-by: default avatarWei Li <liwei391@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f6880316
    • Josh Poimboeuf's avatar
      module: Fix livepatch/ftrace module text permissions race · 2d1a9468
      Josh Poimboeuf authored
      [ Upstream commit 9f255b632bf12c4dd7fc31caee89aa991ef75176 ]
      
      It's possible for livepatch and ftrace to be toggling a module's text
      permissions at the same time, resulting in the following panic:
      
        BUG: unable to handle page fault for address: ffffffffc005b1d9
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
        Oops: 0003 [#1] PREEMPT SMP PTI
        CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339ca5 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
        RIP: 0010:apply_relocate_add+0xbe/0x14c
        Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
        RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
        RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
        RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
        RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
        R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
        R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
        FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         klp_init_object_loaded+0x10f/0x219
         ? preempt_latency_start+0x21/0x57
         klp_enable_patch+0x662/0x809
         ? virt_to_head_page+0x3a/0x3c
         ? kfree+0x8c/0x126
         patch_init+0x2ed/0x1000 [livepatch_test02]
         ? 0xffffffffc0060000
         do_one_initcall+0x9f/0x1c5
         ? kmem_cache_alloc_trace+0xc4/0xd4
         ? do_init_module+0x27/0x210
         do_init_module+0x5f/0x210
         load_module+0x1c41/0x2290
         ? fsnotify_path+0x3b/0x42
         ? strstarts+0x2b/0x2b
         ? kernel_read+0x58/0x65
         __do_sys_finit_module+0x9f/0xc3
         ? __do_sys_finit_module+0x9f/0xc3
         __x64_sys_finit_module+0x1a/0x1c
         do_syscall_64+0x52/0x61
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above panic occurs when loading two modules at the same time with
      ftrace enabled, where at least one of the modules is a livepatch module:
      
      CPU0					CPU1
      klp_enable_patch()
        klp_init_object_loaded()
          module_disable_ro()
          					ftrace_module_enable()
      					  ftrace_arch_code_modify_post_process()
      				    	    set_all_modules_text_ro()
            klp_write_object_relocations()
              apply_relocate_add()
      	  *patches read-only code* - BOOM
      
      A similar race exists when toggling ftrace while loading a livepatch
      module.
      
      Fix it by ensuring that the livepatch and ftrace code patching
      operations -- and their respective permissions changes -- are protected
      by the text_mutex.
      
      Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.comReported-by: default avatarJohannes Erdfelt <johannes@erdfelt.com>
      Fixes: 444d13ff ("modules: add ro_after_init support")
      Acked-by: default avatarJessica Yu <jeyu@kernel.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      2d1a9468
    • Joel Savitz's avatar
      cpuset: restore sanity to cpuset_cpus_allowed_fallback() · 675a1a49
      Joel Savitz authored
      [ Upstream commit d477f8c202d1f0d4791ab1263ca7657bbe5cf79e ]
      
      In the case that a process is constrained by taskset(1) (i.e.
      sched_setaffinity(2)) to a subset of available cpus, and all of those are
      subsequently offlined, the scheduler will set tsk->cpus_allowed to
      the current value of task_cs(tsk)->effective_cpus.
      
      This is done via a call to do_set_cpus_allowed() in the context of
      cpuset_cpus_allowed_fallback() made by the scheduler when this case is
      detected. This is the only call made to cpuset_cpus_allowed_fallback()
      in the latest mainline kernel.
      
      However, this is not sane behavior.
      
      I will demonstrate this on a system running the latest upstream kernel
      with the following initial configuration:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,fffffff
      	Cpus_allowed_list:	0-63
      
      (Where cpus 32-63 are provided via smt.)
      
      If we limit our current shell process to cpu2 only and then offline it
      and reonline it:
      
      	# taskset -p 4 $$
      	pid 2272's current affinity mask: ffffffffffffffff
      	pid 2272's new affinity mask: 4
      
      	# echo off > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -3
      	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
      	[ 2195.872700] IRQ 114: no longer affine to CPU2
      	[ 2195.879128] smpboot: CPU 2 is now offline
      
      	# echo on > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -1
      	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
      
      We see that our current process now has an affinity mask containing
      every cpu available on the system _except_ the one we originally
      constrained it to:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:   ffffffff,fffffffb
      	Cpus_allowed_list:      0-1,3-63
      
      This is not sane behavior, as the scheduler can now not only place the
      process on previously forbidden cpus, it can't even schedule it on
      the cpu it was originally constrained to!
      
      Other cases result in even more exotic affinity masks. Take for instance
      a process with an affinity mask containing only cpus provided by smt at
      the moment that smt is toggled, in a configuration such as the following:
      
      	# taskset -p f000000000 $$
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	000000f0,00000000
      	Cpus_allowed_list:	36-39
      
      A double toggle of smt results in the following behavior:
      
      	# echo off > /sys/devices/system/cpu/smt/control
      	# echo on > /sys/devices/system/cpu/smt/control
      	# grep -i cpus /proc/$$/status
      	Cpus_allowed:	ffffff00,ffffffff
      	Cpus_allowed_list:	0-31,40-63
      
      This is even less sane than the previous case, as the new affinity mask
      excludes all smt-provided cpus with ids less than those that were
      previously in the affinity mask, as well as those that were actually in
      the mask.
      
      With this patch applied, both of these cases end in the following state:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,ffffffff
      	Cpus_allowed_list:	0-63
      
      The original policy is discarded. Though not ideal, it is the simplest way
      to restore sanity to this fallback case without reinventing the cpuset
      wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
      for the previous affinity mask to be restored in this fallback case can use
      that mechanism instead.
      
      This patch modifies scheduler behavior by instead resetting the mask to
      task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
      mode. I tested the cases above on both modes.
      
      Note that the scheduler uses this fallback mechanism if and only if
      _every_ other valid avenue has been traveled, and it is the last resort
      before calling BUG().
      Suggested-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Acked-by: default avatarPhil Auld <pauld@redhat.com>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      675a1a49
  3. 03 Jul, 2019 2 commits
  4. 25 Jun, 2019 1 commit
    • Miguel Ojeda's avatar
      tracing: Silence GCC 9 array bounds warning · 50bbae7d
      Miguel Ojeda authored
      commit 0c97bf863efce63d6ab7971dad811601e6171d2f upstream.
      
      Starting with GCC 9, -Warray-bounds detects cases when memset is called
      starting on a member of a struct but the size to be cleared ends up
      writing over further members.
      
      Such a call happens in the trace code to clear, at once, all members
      after and including `seq` on struct trace_iterator:
      
          In function 'memset',
              inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
          ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
          [8505, 8560] from the object at 'iter' is out of the bounds of
          referenced subobject 'seq' with type 'struct trace_seq' at offset
          4368 [-Warray-bounds]
            344 |  return __builtin_memset(p, c, size);
                |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      In order to avoid GCC complaining about it, we compute the address
      ourselves by adding the offsetof distance instead of referring
      directly to the member.
      
      Since there are two places doing this clear (trace.c and trace_kdb.c),
      take the chance to move the workaround into a single place in
      the internal header.
      
      Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [ Removed unnecessary parenthesis around "iter" ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50bbae7d
  5. 22 Jun, 2019 3 commits
    • Peter Zijlstra's avatar
      perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data · a9a83106
      Peter Zijlstra authored
      [ Upstream commit 4d839dd9e4356bbacf3eb0ab13a549b83b008c21 ]
      
      We must use {READ,WRITE}_ONCE() on rb->user_page data such that
      concurrent usage will see whole values. A few key sites were missing
      this.
      Suggested-by: default avatarYabin Cui <yabinc@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: 7b732a75 ("perf_counter: new output ABI - part 1")
      Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a9a83106
    • Peter Zijlstra's avatar
      perf/ring_buffer: Add ordering to rb->nest increment · 2bc92a4f
      Peter Zijlstra authored
      [ Upstream commit 3f9fbe9bd86c534eba2faf5d840fd44c6049f50e ]
      
      Similar to how decrementing rb->next too early can cause data_head to
      (temporarily) be observed to go backward, so too can this happen when
      we increment too late.
      
      This barrier() ensures the rb->head load happens after the increment,
      both the one in the 'goto again' path, as the one from
      perf_output_get_handle() -- albeit very unlikely to matter for the
      latter.
      Suggested-by: default avatarYabin Cui <yabinc@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      2bc92a4f
    • Yabin Cui's avatar
      perf/ring_buffer: Fix exposing a temporarily decreased data_head · 9e2de43b
      Yabin Cui authored
      [ Upstream commit 1b038c6e05ff70a1e66e3e571c2e6106bdb75f53 ]
      
      In perf_output_put_handle(), an IRQ/NMI can happen in below location and
      write records to the same ring buffer:
      
      	...
      	local_dec_and_test(&rb->nest)
      	...                          <-- an IRQ/NMI can happen here
      	rb->user_page->data_head = head;
      	...
      
      In this case, a value A is written to data_head in the IRQ, then a value
      B is written to data_head after the IRQ. And A > B. As a result,
      data_head is temporarily decreased from A to B. And a reader may see
      data_head < data_tail if it read the buffer frequently enough, which
      creates unexpected behaviors.
      
      This can be fixed by moving dec(&rb->nest) to after updating data_head,
      which prevents the IRQ/NMI above from updating data_head.
      
      [ Split up by peterz. ]
      Signed-off-by: default avatarYabin Cui <yabinc@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mark.rutland@arm.com
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9e2de43b
  6. 19 Jun, 2019 3 commits
    • Peter Zijlstra's avatar
      x86/uaccess, kcov: Disable stack protector · 8f6f2ee1
      Peter Zijlstra authored
      [ Upstream commit 40ea97290b08be2e038b31cbb33097d1145e8169 ]
      
      New tooling noticed this mishap:
      
        kernel/kcov.o: warning: objtool: write_comp_data()+0x138: call to __stack_chk_fail() with UACCESS enabled
        kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc()+0xd9: call to __stack_chk_fail() with UACCESS enabled
      
      All the other instrumentation (KASAN,UBSAN) also have stack protector
      disabled.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8f6f2ee1
    • Jann Horn's avatar
      ptrace: restore smp_rmb() in __ptrace_may_access() · 7013ea8d
      Jann Horn authored
      commit f6581f5b55141a95657ef5742cf6a6bfa20a109f upstream.
      
      Restore the read memory barrier in __ptrace_may_access() that was deleted
      a couple years ago. Also add comments on this barrier and the one it pairs
      with to explain why they're there (as far as I understand).
      
      Fixes: bfedb589 ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks")
      Cc: stable@vger.kernel.org
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7013ea8d
    • Eric W. Biederman's avatar
      signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO · 50f806a5
      Eric W. Biederman authored
      [ Upstream commit f6e2aa91a46d2bc79fce9b93a988dbe7655c90c0 ]
      
      Recently syzbot in conjunction with KMSAN reported that
      ptrace_peek_siginfo can copy an uninitialized siginfo to userspace.
      Inspecting ptrace_peek_siginfo confirms this.
      
      The problem is that off when initialized from args.off can be
      initialized to a negaive value.  At which point the "if (off >= 0)"
      test to see if off became negative fails because off started off
      negative.
      
      Prevent the core problem by adding a variable found that is only true
      if a siginfo is found and copied to a temporary in preparation for
      being copied to userspace.
      
      Prevent args.off from being truncated when being assigned to off by
      testing that off is <= the maximum possible value of off.  Convert off
      to an unsigned long so that we should not have to truncate args.off,
      we have well defined overflow behavior so if we add another check we
      won't risk fighting undefined compiler behavior, and so that we have a
      type whose maximum value is easy to test for.
      
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+0d602a1b0d8c95bdf299@syzkaller.appspotmail.com
      Fixes: 84c751bd ("ptrace: add ability to retrieve signals without removing from a queue (v4)")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      50f806a5
  7. 15 Jun, 2019 3 commits
  8. 11 Jun, 2019 1 commit
    • Jiri Kosina's avatar
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · b9284404
      Jiri Kosina authored
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9284404
  9. 09 Jun, 2019 1 commit
  10. 31 May, 2019 9 commits
    • Paul E. McKenney's avatar
      rcuperf: Fix cleanup path for invalid perf_type strings · 824343ed
      Paul E. McKenney authored
      [ Upstream commit ad092c027713a68a34168942a5ef422e42e039f4 ]
      
      If the specified rcuperf.perf_type is not in the rcu_perf_init()
      function's perf_ops[] array, rcuperf prints some console messages and
      then invokes rcu_perf_cleanup() to set state so that a future torture
      test can run.  However, rcu_perf_cleanup() also attempts to end the
      test that didn't actually start, and in doing so relies on the value
      of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case and
      inserts a check near the beginning of rcu_perf_cleanup(), thus avoiding
      relying on an irrelevant cur_ops value.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      824343ed
    • Paul E. McKenney's avatar
      rcutorture: Fix cleanup path for invalid torture_type strings · 014be4da
      Paul E. McKenney authored
      [ Upstream commit b813afae7ab6a5e91b4e16cc567331d9c2ae1f04 ]
      
      If the specified rcutorture.torture_type is not in the rcu_torture_init()
      function's torture_ops[] array, rcutorture prints some console messages
      and then invokes rcu_torture_cleanup() to set state so that a future
      torture test can run.  However, rcu_torture_cleanup() also attempts to
      end the test that didn't actually start, and in doing so relies on the
      value of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case
      and inserts a check near the beginning of rcu_torture_cleanup(),
      thus avoiding relying on an irrelevant cur_ops value.
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      014be4da
    • Peter Zijlstra's avatar
      x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP · 8190d6fb
      Peter Zijlstra authored
      [ Upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9 ]
      
      For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
      overloaded and generate callouts to this code, and thus also when
      AC=1.
      
      Make it safe.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8190d6fb
    • Konstantin Khlebnikov's avatar
      sched/core: Handle overflow in cpu_shares_write_u64 · e58b6d6a
      Konstantin Khlebnikov authored
      [ Upstream commit 5b61d50ab4ef590f5e1d4df15cd2cea5f5715308 ]
      
      Bit shift in scale_load() could overflow shares. This patch saturates
      it to MAX_SHARES like following sched_group_set_shares().
      
      Example:
      
       # echo 9223372036854776832 > cpu.shares
       # cat cpu.shares
      
      Before patch: 1024
      After pattch: 262144
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzzSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e58b6d6a
    • Konstantin Khlebnikov's avatar
      sched/rt: Check integer overflow at usec to nsec conversion · 3c06c178
      Konstantin Khlebnikov authored
      [ Upstream commit 1a010e29cfa00fee2888fd2fd4983f848cbafb58 ]
      
      Example of unhandled overflows:
      
       # echo 18446744073709651 > cpu.rt_runtime_us
       # cat cpu.rt_runtime_us
       99
      
       # echo 18446744073709900 > cpu.rt_period_us
       # cat cpu.rt_period_us
       348
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzzSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3c06c178
    • Konstantin Khlebnikov's avatar
      sched/core: Check quota and period overflow at usec to nsec conversion · a5e3cfc8
      Konstantin Khlebnikov authored
      [ Upstream commit 1a8b4540db732ca16c9e43ac7c08b1b8f0b252d8 ]
      
      Large values could overflow u64 and pass following sanity checks.
      
       # echo 18446744073750000 > cpu.cfs_period_us
       # cat cpu.cfs_period_us
       40448
      
       # echo 18446744073750000 > cpu.cfs_quota_us
       # cat cpu.cfs_quota_us
       40448
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzzSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a5e3cfc8
    • Roman Gushchin's avatar
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · d17cd67a
      Roman Gushchin authored
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d17cd67a
    • Wenwen Wang's avatar
      audit: fix a memory leak bug · b5785152
      Wenwen Wang authored
      [ Upstream commit 70c4cf17e445264453bc5323db3e50aa0ac9e81f ]
      
      In audit_rule_change(), audit_data_to_entry() is firstly invoked to
      translate the payload data to the kernel's rule representation. In
      audit_data_to_entry(), depending on the audit field type, an audit tree may
      be created in audit_make_tree(), which eventually invokes kmalloc() to
      allocate the tree.  Since this tree is a temporary tree, it will be then
      freed in the following execution, e.g., audit_add_rule() if the message
      type is AUDIT_ADD_RULE or audit_del_rule() if the message type is
      AUDIT_DEL_RULE. However, if the message type is neither AUDIT_ADD_RULE nor
      AUDIT_DEL_RULE, i.e., the default case of the switch statement, this
      temporary tree is not freed.
      
      To fix this issue, only allocate the tree when the type is AUDIT_ADD_RULE
      or AUDIT_DEL_RULE.
      Signed-off-by: default avatarWenwen Wang <wang6495@umn.edu>
      Reviewed-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b5785152
    • Eric Dumazet's avatar
      bpf: devmap: fix use-after-free Read in __dev_map_entry_free · ddfe0bfd
      Eric Dumazet authored
      commit 2baae3545327632167c0180e9ca1d467416f1919 upstream.
      
      synchronize_rcu() is fine when the rcu callbacks only need
      to free memory (kfree_rcu() or direct kfree() call rcu call backs)
      
      __dev_map_entry_free() is a bit more complex, so we need to make
      sure that call queued __dev_map_entry_free() callbacks have completed.
      
      sysbot report:
      
      BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
      [inline]
      BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
      kernel/bpf/devmap.c:379
      Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18
      
      CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x1b9/0x294 lib/dump_stack.c:113
        print_address_description+0x6c/0x20b mm/kasan/report.c:256
        kasan_report_error mm/kasan/report.c:354 [inline]
        kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
        __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
        dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
        __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
        __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
        rcu_do_batch kernel/rcu/tree.c:2558 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
        __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
        rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
        __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
        run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
        smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      Allocated by task 6675:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
        kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
        kmalloc include/linux/slab.h:513 [inline]
        kzalloc include/linux/slab.h:706 [inline]
        dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
        find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
        map_create+0x393/0x1010 kernel/bpf/syscall.c:453
        __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
        __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
        __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
        do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 26:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
        kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
        __cache_free mm/slab.c:3498 [inline]
        kfree+0xd9/0x260 mm/slab.c:3813
        dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
        bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
        process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
        worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      The buggy address belongs to the object at ffff8801b8da37c0
        which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 264 bytes inside of
        512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
      The buggy address belongs to the page:
      page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940
      index:0xffff8801b8da3540
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
      raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
        ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      > ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                     ^
        ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      
      Fixes: 546ac1ff ("bpf: add devmap, a map for storing net device references")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddfe0bfd
  11. 25 May, 2019 4 commits
    • Daniel Borkmann's avatar
      bpf, lru: avoid messing with eviction heuristics upon syscall lookup · c7af97a3
      Daniel Borkmann authored
      commit 50b045a8c0ccf44f76640ac3eea8d80ca53979a3 upstream.
      
      One of the biggest issues we face right now with picking LRU map over
      regular hash table is that a map walk out of user space, for example,
      to just dump the existing entries or to remove certain ones, will
      completely mess up LRU eviction heuristics and wrong entries such
      as just created ones will get evicted instead. The reason for this
      is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
      system call lookup side as well. Thus upon walk, all entries are
      being marked, so information of actual least recently used ones
      are "lost".
      
      In case of Cilium where it can be used (besides others) as a BPF
      based connection tracker, this current behavior causes disruption
      upon control plane changes that need to walk the map from user space
      to evict certain entries. Discussion result from bpfconf [0] was that
      we should simply just remove marking from system call side as no
      good use case could be found where it's actually needed there.
      Therefore this patch removes marking for regular LRU and per-CPU
      flavor. If there ever should be a need in future, the behavior could
      be selected via map creation flag, but due to mentioned reason we
      avoid this here.
      
        [0] http://vger.kernel.org/bpfconf.html
      
      Fixes: 29ba732a ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
      Fixes: 8f844938 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c7af97a3
    • Daniel Borkmann's avatar
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 9f229230
      Daniel Borkmann authored
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      9f229230
    • Tobin C. Harding's avatar
      sched/cpufreq: Fix kobject memleak · abda2930
      Tobin C. Harding authored
      [ Upstream commit 9a4f26cc98d81b67ecc23b890c28e2df324e29f3 ]
      
      Currently the error return path from kobject_init_and_add() is not
      followed by a call to kobject_put() - which means we are leaking
      the kobject.
      
      Fix it by adding a call to kobject_put() in the error path of
      kobject_init_and_add().
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      abda2930
    • Elazar Leibovich's avatar
      tracing: Fix partial reading of trace event's id file · cfa82959
      Elazar Leibovich authored
      commit cbe08bcbbe787315c425dde284dcb715cfbf3f39 upstream.
      
      When reading only part of the id file, the ppos isn't tracked correctly.
      This is taken care by simple_read_from_buffer.
      
      Reading a single byte, and then the next byte would result EOF.
      
      While this seems like not a big deal, this breaks abstractions that
      reads information from files unbuffered. See for example
      https://github.com/golang/go/issues/29399
      
      This code was mentioned as problematic in
      commit cd458ba9
      ("tracing: Do not (ab)use trace_seq in event_id_read()")
      
      An example C code that show this bug is:
      
        #include <stdio.h>
        #include <stdint.h>
      
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <unistd.h>
      
        int main(int argc, char **argv) {
          if (argc < 2)
            return 1;
          int fd = open(argv[1], O_RDONLY);
          char c;
          read(fd, &c, 1);
          printf("First  %c\n", c);
          read(fd, &c, 1);
          printf("Second %c\n", c);
        }
      
      Then run with, e.g.
      
        sudo ./a.out /sys/kernel/debug/tracing/events/tcp/tcp_set_state/id
      
      You'll notice you're getting the first character twice, instead of the
      first two characters in the id file.
      
      Link: http://lkml.kernel.org/r/20181231115837.4932-1-elazar@lightbitslabs.com
      
      Cc: Orit Wasserman <orit.was@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 23725aee ("ftrace: provide an id file for each event")
      Signed-off-by: default avatarElazar Leibovich <elazar@lightbitslabs.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cfa82959
  12. 21 May, 2019 2 commits
    • Andrea Arcangeli's avatar
      userfaultfd: use RCU to free the task struct when fork fails · 851d1a7c
      Andrea Arcangeli authored
      commit c3f3ce049f7d97cc7ec9c01cb51d9ec74e0f37c2 upstream.
      
      The task structure is freed while get_mem_cgroup_from_mm() holds
      rcu_read_lock() and dereferences mm->owner.
      
        get_mem_cgroup_from_mm()                failing fork()
        ----                                    ---
        task = mm->owner
                                                mm->owner = NULL;
                                                free(task)
        if (task) *task; /* use after free */
      
      The fix consists in freeing the task with RCU also in the fork failure
      case, exactly like it always happens for the regular exit(2) path.  That
      is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
      (left side above) effective to avoid a use after free when dereferencing
      the task structure.
      
      An alternate possible fix would be to defer the delivery of the
      userfaultfd contexts to the monitor until after fork() is guaranteed to
      succeed.  Such a change would require more changes because it would
      create a strict ordering dependency where the uffd methods would need to
      be called beyond the last potentially failing branch in order to be
      safe.  This solution as opposed only adds the dependency to common code
      to set mm->owner to NULL and to free the task struct that was pointed by
      mm->owner with RCU, if fork ends up failing.  The userfaultfd methods
      can still be called anywhere during the fork runtime and the monitor
      will keep discarding orphaned "mm" coming from failed forks in userland.
      
      This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
      time.
      
      [aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
        Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
      Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
      Fixes: 893e26e6 ("userfaultfd: non-cooperative: Add fork() event")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      851d1a7c
    • Waiman Long's avatar
      locking/rwsem: Prevent decrement of reader count before increment · 8a5e7aef
      Waiman Long authored
      [ Upstream commit a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 ]
      
      During my rwsem testing, it was found that after a down_read(), the
      reader count may occasionally become 0 or even negative. Consequently,
      a writer may steal the lock at that time and execute with the reader
      in parallel thus breaking the mutual exclusion guarantee of the write
      lock. In other words, both readers and writer can become rwsem owners
      simultaneously.
      
      The current reader wakeup code does it in one pass to clear waiter->task
      and put them into wake_q before fully incrementing the reader count.
      Once waiter->task is cleared, the corresponding reader may see it,
      finish the critical section and do unlock to decrement the count before
      the count is incremented. This is not a problem if there is only one
      reader to wake up as the count has been pre-incremented by 1.  It is
      a problem if there are more than one readers to be woken up and writer
      can steal the lock.
      
      The wakeup was actually done in 2 passes before the following v4.9 commit:
      
        70800c3c ("locking/rwsem: Scan the wait_list for readers only once")
      
      To fix this problem, the wakeup is now done in two passes
      again. In the first pass, we collect the readers and count them.
      The reader count is then fully incremented. In the second pass, the
      waiter->task is then cleared and they are put into wake_q to be woken
      up later.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Fixes: 70800c3c ("locking/rwsem: Scan the wait_list for readers only once")
      Link: http://lkml.kernel.org/r/20190428212557.13482-2-longman@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8a5e7aef
  13. 16 May, 2019 1 commit
    • Steven Rostedt (VMware)'s avatar
      tracing/fgraph: Fix set_graph_function from showing interrupts · 02cf6220
      Steven Rostedt (VMware) authored
      [ Upstream commit 5cf99a0f3161bc3ae2391269d134d6bf7e26f00e ]
      
      The tracefs file set_graph_function is used to only function graph functions
      that are listed in that file (or all functions if the file is empty). The
      way this is implemented is that the function graph tracer looks at every
      function, and if the current depth is zero and the function matches
      something in the file then it will trace that function. When other functions
      are called, the depth will be greater than zero (because the original
      function will be at depth zero), and all functions will be traced where the
      depth is greater than zero.
      
      The issue is that when a function is first entered, and the handler that
      checks this logic is called, the depth is set to zero. If an interrupt comes
      in and a function in the interrupt handler is traced, its depth will be
      greater than zero and it will automatically be traced, even if the original
      function was not. But because the logic only looks at depth it may trace
      interrupts when it should not be.
      
      The recent design change of the function graph tracer to fix other bugs
      caused the depth to be zero while the function graph callback handler is
      being called for a longer time, widening the race of this happening. This
      bug was actually there for a longer time, but because the race window was so
      small it seldom happened. The Fixes tag below is for the commit that widen
      the race window, because that commit belongs to a series that will also help
      fix the original bug.
      
      Cc: stable@kernel.org
      Fixes: 39eb456dacb5 ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack")
      Reported-by: default avatarJoe Lawrence <joe.lawrence@redhat.com>
      Tested-by: default avatarJoe Lawrence <joe.lawrence@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      02cf6220
  14. 14 May, 2019 1 commit
    • Josh Poimboeuf's avatar
      cpu/speculation: Add 'mitigations=' cmdline option · ed1dfe83
      Josh Poimboeuf authored
      commit 98af8452945c55652de68536afdde3b520fec429 upstream
      
      Keeping track of the number of mitigations for all the CPU speculation
      bugs has become overwhelming for many users.  It's getting more and more
      complicated to decide which mitigations are needed for a given
      architecture.  Complicating matters is the fact that each arch tends to
      have its own custom way to mitigate the same vulnerability.
      
      Most users fall into a few basic categories:
      
      a) they want all mitigations off;
      
      b) they want all reasonable mitigations on, with SMT enabled even if
         it's vulnerable; or
      
      c) they want all reasonable mitigations on, with SMT disabled if
         vulnerable.
      
      Define a set of curated, arch-independent options, each of which is an
      aggregation of existing options:
      
      - mitigations=off: Disable all mitigations.
      
      - mitigations=auto: [default] Enable all the default mitigations, but
        leave SMT enabled, even if it's vulnerable.
      
      - mitigations=auto,nosmt: Enable all the default mitigations, disabling
        SMT if needed by a mitigation.
      
      Currently, these options are placeholders which don't actually do
      anything.  They will be fleshed out in upcoming patches.
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
      Reviewed-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-arch@vger.kernel.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed1dfe83
  15. 10 May, 2019 1 commit
    • Will Deacon's avatar
      locking/futex: Allow low-level atomic operations to return -EAGAIN · d2045151
      Will Deacon authored
      commit 6b4f4bc9cb22875f97023984a625386f0c7cc1c0 upstream.
      
      Some futex() operations, including FUTEX_WAKE_OP, require the kernel to
      perform an atomic read-modify-write of the futex word via the userspace
      mapping. These operations are implemented by each architecture in
      arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
      are called in atomic context with the relevant hash bucket locks held.
      
      Although these routines may return -EFAULT in response to a page fault
      generated when accessing userspace, they are expected to succeed (i.e.
      return 0) in all other cases. This poses a problem for architectures
      that do not provide bounded forward progress guarantees or fairness of
      contended atomic operations and can lead to starvation in some cases.
      
      In these problematic scenarios, we must return back to the core futex
      code so that we can drop the hash bucket locks and reschedule if
      necessary, much like we do in the case of a page fault.
      
      Allow architectures to return -EAGAIN from their implementations of
      arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
      will cause the core futex code to reschedule if necessary and return
      back to the architecture code later on.
      
      Cc: <stable@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2045151