1. 17 Dec, 2018 1 commit
  2. 08 Dec, 2018 5 commits
  3. 01 Dec, 2018 3 commits
  4. 23 Nov, 2018 1 commit
  5. 13 Nov, 2018 8 commits
  6. 10 Nov, 2018 5 commits
    • Thomas Gleixner's avatar
      posix-timers: Sanitize overrun handling · 65cb24de
      Thomas Gleixner authored
      [ Upstream commit 78c9c4dfbf8c04883941445a195276bb4bb92c76 ]
      
      The posix timer overrun handling is broken because the forwarding functions
      can return a huge number of overruns which does not fit in an int. As a
      consequence timer_getoverrun(2) and siginfo::si_overrun can turn into
      random number generators.
      
      The k_clock::timer_forward() callbacks return a 64 bit value now. Make
      k_itimer::ti_overrun[_last] 64bit as well, so the kernel internal
      accounting is correct. 3Remove the temporary (int) casts.
      
      Add a helper function which clamps the overrun value returned to user space
      via timer_getoverrun(2) or siginfo::si_overrun limited to a positive value
      between 0 and INT_MAX. INT_MAX is an indicator for user space that the
      overrun value has been clamped.
      Reported-by: default avatarTeam OWL337 <icytxw@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Link: https://lkml.kernel.org/r/20180626132705.018623573@linutronix.de
      [florian: Make patch apply to v4.9.135]
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      65cb24de
    • Phil Auld's avatar
      sched/fair: Fix throttle_list starvation with low CFS quota · bc1fccc7
      Phil Auld authored
      commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.
      
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc1fccc7
    • Jiri Slaby's avatar
      futex: futex_wake_op, do not fail on invalid op · da92e8a2
      Jiri Slaby authored
      [ Upstream commit e78c38f6 ]
      
      In commit 30d6e0a4 ("futex: Remove duplicated code and fix undefined
      behaviour"), I let FUTEX_WAKE_OP to fail on invalid op.  Namely when op
      should be considered as shift and the shift is out of range (< 0 or > 31).
      
      But strace's test suite does this madness:
      
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);
      
      When I pick the first 0xa0caffee, it decodes as:
      
        0x80000000 & 0xa0caffee: oparg is shift
        0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
        0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
        0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
        0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18
      
      That means the op tries to do this:
      
        (futex |= (1 << (-849))) == -18
      
      which is completely bogus. The new check of op in the code is:
      
              if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
                      if (oparg < 0 || oparg > 31)
                              return -EINVAL;
                      oparg = 1 << oparg;
              }
      
      which results obviously in the "Invalid argument" errno:
      
        FAIL: futex
        ===========
      
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
        futex.test: failed test: ../futex failed with code 1
      
      So let us soften the failure to print only a (ratelimited) message, crop
      the value and continue as if it were right.  When userspace keeps up, we
      can switch this to return -EINVAL again.
      
      [v2] Do not return 0 immediatelly, proceed with the cropped value.
      
      Fixes: 30d6e0a4 ("futex: Remove duplicated code and fix undefined behaviour")
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      da92e8a2
    • Jiri Olsa's avatar
      perf/core: Fix locking for children siblings group read · 6279fc56
      Jiri Olsa authored
      [ Upstream commit 2aeb1883 ]
      
      We're missing ctx lock when iterating children siblings
      within the perf_read path for group reading. Following
      race and crash can happen:
      
      User space doing read syscall on event group leader:
      
      T1:
        perf_read
          lock event->ctx->mutex
          perf_read_group
            lock leader->child_mutex
            __perf_read_group_add(child)
              list_for_each_entry(sub, &leader->sibling_list, group_entry)
      
      ---->   sub might be invalid at this point, because it could
              get removed via perf_event_exit_task_context in T2
      
      Child exiting and cleaning up its events:
      
      T2:
        perf_event_exit_task_context
          lock ctx->mutex
          list_for_each_entry_safe(child_event, next, &child_ctx->event_list,...
            perf_event_exit_event(child)
              lock ctx->lock
              perf_group_detach(child)
              unlock ctx->lock
      
      ---->   child is removed from sibling_list without any sync
              with T1 path above
      
              ...
              free_event(child)
      
      Before the child is removed from the leader's child_list,
      (and thus is omitted from perf_read_group processing), we
      need to ensure that perf_read_group touches child's
      siblings under its ctx->lock.
      
      Peter further notes:
      
      | One additional note; this bug got exposed by commit:
      |
      |   ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      |
      | which made it possible to actually trigger this code-path.
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ba5213ae ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
      Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6279fc56
    • Jiri Olsa's avatar
      perf/ring_buffer: Prevent concurent ring buffer access · 4d083666
      Jiri Olsa authored
      [ Upstream commit cd6fb677ce7e460c25bdd66f689734102ec7d642 ]
      
      Some of the scheduling tracepoints allow the perf_tp_event
      code to write to ring buffer under different cpu than the
      code is running on.
      
      This results in corrupted ring buffer data demonstrated in
      following perf commands:
      
        # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
        # Running 'sched/messaging' benchmark:
        # 20 sender and receiver processes per group
        # 10 groups == 400 processes run
      
             Total time: 0.383 [sec]
        [ perf record: Woken up 8 times to write data ]
        0x42b890 [0]: failed to process type: -1765585640
        [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]
      
        # perf report --stdio
        0x42b890 [0]: failed to process type: -1765585640
      
      The reason for the corruption are some of the scheduling tracepoints,
      that have __perf_task dfined and thus allow to store data to another
      cpu ring buffer:
      
        sched_waking
        sched_wakeup
        sched_wakeup_new
        sched_stat_wait
        sched_stat_sleep
        sched_stat_iowait
        sched_stat_blocked
      
      The perf_tp_event function first store samples for current cpu
      related events defined for tracepoint:
      
          hlist_for_each_entry_rcu(event, head, hlist_entry)
            perf_swevent_event(event, count, &data, regs);
      
      And then iterates events of the 'task' and store the sample
      for any task's event that passes tracepoint checks:
      
        ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
      
        list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
          if (event->attr.type != PERF_TYPE_TRACEPOINT)
            continue;
          if (event->attr.config != entry->type)
            continue;
      
          perf_swevent_event(event, count, &data, regs);
        }
      
      Above code can race with same code running on another cpu,
      ending up with 2 cpus trying to store under the same ring
      buffer, which is specifically not allowed.
      
      This patch prevents the problem, by allowing only events with the same
      current cpu to receive the event.
      
      NOTE: this requires the use of (per-task-)per-cpu buffers for this
      feature to work; perf-record does this.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      [peterz: small edits to Changelog]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: e6dab5ff ("perf/trace: Add ability to set a target task for events")
      Link: http://lkml.kernel.org/r/20180923161343.GB15054@kravaSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4d083666
  7. 20 Oct, 2018 3 commits
  8. 13 Oct, 2018 1 commit
    • Prateek Sood's avatar
      cgroup: Fix deadlock in cpu hotplug path · 35b80e75
      Prateek Sood authored
      commit 116d2f74 upstream.
      
      Deadlock during cgroup migration from cpu hotplug path when a task T is
      being moved from source to destination cgroup.
      
      kworker/0:0
      cpuset_hotplug_workfn()
         cpuset_hotplug_update_tasks()
            hotplug_update_tasks_legacy()
              remove_tasks_in_empty_cpuset()
                cgroup_transfer_tasks() // stuck in iterator loop
                  cgroup_migrate()
                    cgroup_migrate_add_task()
      
      In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
      Task T will not migrate to destination cgroup. css_task_iter_start()
      will keep pointing to task T in loop waiting for task T cg_list node
      to be removed.
      
      Task T
      do_exit()
        exit_signals() // sets PF_EXITING
        exit_task_namespaces()
          switch_task_namespaces()
            free_nsproxy()
              put_mnt_ns()
                drop_collected_mounts()
                  namespace_unlock()
                    synchronize_rcu()
                      _synchronize_rcu_expedited()
                        schedule_work() // on cpu0 low priority worker pool
                        wait_event() // waiting for work item to execute
      
      Task T inserted a work item in the worklist of cpu0 low priority
      worker pool. It is waiting for expedited grace period work item
      to execute. This work item will only be executed once kworker/0:0
      complete execution of cpuset_hotplug_workfn().
      
      kworker/0:0 ==> Task T ==>kworker/0:0
      
      In case of PF_EXITING task being migrated from source to destination
      cgroup, migrate next available task in source cgroup.
      Signed-off-by: default avatarPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      [AmitP: Upstream commit cherry-pick failed, so I picked the
              backported changes from CAF/msm-4.9 tree instead:
              https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?id=49b74f1696417b270c89cd893ca9f37088928078]
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35b80e75
  9. 10 Oct, 2018 1 commit
  10. 04 Oct, 2018 2 commits
  11. 29 Sep, 2018 2 commits
  12. 26 Sep, 2018 2 commits
  13. 19 Sep, 2018 4 commits
    • Gaurav Kohli's avatar
      timers: Clear timer_base::must_forward_clk with timer_base::lock held · 05cb385f
      Gaurav Kohli authored
      [ Upstream commit 363e934d8811d799c88faffc5bfca782fd728334 ]
      
      timer_base::must_forward_clock is indicating that the base clock might be
      stale due to a long idle sleep.
      
      The forwarding of the base clock takes place in the timer softirq or when a
      timer is enqueued to a base which is idle. If the enqueue of timer to an
      idle base happens from a remote CPU, then the following race can happen:
      
        CPU0					CPU1
        run_timer_softirq			mod_timer
      
      					base = lock_timer_base(timer);
        base->must_forward_clk = false
      					if (base->must_forward_clk)
      				       	    forward(base); -> skipped
      
      					enqueue_timer(base, timer, idx);
      					-> idx is calculated high due to
      					   stale base
      					unlock_timer_base(timer);
        base = lock_timer_base(timer);
        forward(base);
      
      The root cause is that timer_base::must_forward_clk is cleared outside the
      timer_base::lock held region, so the remote queuing CPU observes it as
      cleared, but the base clock is still stale. This can cause large
      granularity values for timers, i.e. the accuracy of the expiry time
      suffers.
      
      Prevent this by clearing the flag with timer_base::lock held, so that the
      forwarding takes place before the cleared flag is observable by a remote
      CPU.
      Signed-off-by: default avatarGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: john.stultz@linaro.org
      Cc: sboyd@kernel.org
      Cc: linux-arm-msm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.orgSigned-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05cb385f
    • Prateek Sood's avatar
      locking/osq_lock: Fix osq_lock queue corruption · 2009f179
      Prateek Sood authored
      commit 50972fe7 upstream.
      
      Fix ordering of link creation between node->prev and prev->next in
      osq_lock(). A case in which the status of optimistic spin queue is
      CPU6->CPU2 in which CPU6 has acquired the lock.
      
              tail
                v
        ,-. <- ,-.
        |6|    |2|
        `-' -> `-'
      
      At this point if CPU0 comes in to acquire osq_lock, it will update the
      tail count.
      
        CPU2			CPU0
        ----------------------------------
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-' -> `-'    `-'
      
      After tail count update if CPU2 starts to unqueue itself from
      optimistic spin queue, it will find an updated tail count with CPU0 and
      update CPU2 node->next to NULL in osq_wait_next().
      
        unqueue-A
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-B
      
        ->tail != curr && !node->next
      
      If reordering of following stores happen then prev->next where prev
      being CPU2 would be updated to point to CPU0 node:
      
      				       tail
      				         v
      			  ,-. <- ,-.    ,-.
      			  |6|    |2|    |0|
      			  `-'    `-' -> `-'
      
        osq_wait_next()
          node->next <- 0
          xchg(node->next, NULL)
      
      	       tail
      	         v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
      
        unqueue-C
      
      At this point if next instruction
      	WRITE_ONCE(next->prev, prev);
      in CPU2 path is committed before the update of CPU0 node->prev = prev then
      CPU0 node->prev will point to CPU6 node.
      
      	       tail
          v----------. v
        ,-. <- ,-.    ,-.
        |6|    |2|    |0|
        `-'    `-'    `-'
           `----------^
      
      At this point if CPU0 path's node->prev = prev is committed resulting
      in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL
      currently,
      
      				       tail
      			                 v
      			  ,-. <- ,-. <- ,-.
      			  |6|    |2|    |0|
      			  `-'    `-'    `-'
      			     `----------^
      
      so if CPU0 gets into unqueue path of osq_lock it will keep spinning
      in infinite loop as condition prev->next == node will never be true.
      Signed-off-by: default avatarPrateek Sood <prsood@codeaurora.org>
      [ Added pictures, rewrote comments. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2009f179
    • Prateek Sood's avatar
      locking/rwsem-xadd: Fix missed wakeup due to reordering of load · 0cbde6c5
      Prateek Sood authored
      commit 9c29c318 upstream.
      
      If a spinner is present, there is a chance that the load of
      rwsem_has_spinner() in rwsem_wake() can be reordered with
      respect to decrement of rwsem count in __up_write() leading
      to wakeup being missed:
      
       spinning writer                  up_write caller
       ---------------                  -----------------------
       [S] osq_unlock()                 [L] osq
        spin_lock(wait_lock)
        sem->count=0xFFFFFFFF00000001
                  +0xFFFFFFFF00000000
        count=sem->count
        MB
                                         sem->count=0xFFFFFFFE00000001
                                                   -0xFFFFFFFF00000001
                                         spin_trylock(wait_lock)
                                         return
       rwsem_try_write_lock(count)
       spin_unlock(wait_lock)
       schedule()
      
      Reordering of atomic_long_sub_return_release() in __up_write()
      and rwsem_has_spinner() in rwsem_wake() can cause missing of
      wakeup in up_write() context. In spinning writer, sem->count
      and local variable count is 0XFFFFFFFE00000001. It would result
      in rwsem_try_write_lock() failing to acquire rwsem and spinning
      writer going to sleep in rwsem_down_write_failed().
      
      The smp_rmb() will make sure that the spinner state is
      consulted after sem->count is updated in up_write context.
      Signed-off-by: default avatarPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: longman@redhat.com
      Cc: parri.andrea@gmail.com
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0cbde6c5
    • Vegard Nossum's avatar
      kthread: Fix use-after-free if kthread fork fails · a70e46bc
      Vegard Nossum authored
      commit 4d6501dc upstream.
      
      If a kthread forks (e.g. usermodehelper since commit 1da5c46f) but
      fails in copy_process() between calling dup_task_struct() and setting
      p->set_child_tid, then the value of p->set_child_tid will be inherited
      from the parent and get prematurely freed by free_kthread_struct().
      
          kthread()
           - worker_thread()
              - process_one_work()
              |  - call_usermodehelper_exec_work()
              |     - kernel_thread()
              |        - _do_fork()
              |           - copy_process()
              |              - dup_task_struct()
              |                 - arch_dup_task_struct()
              |                    - tsk->set_child_tid = current->set_child_tid // implied
              |              - ...
              |              - goto bad_fork_*
              |              - ...
              |              - free_task(tsk)
              |                 - free_kthread_struct(tsk)
              |                    - kfree(tsk->set_child_tid)
              - ...
              - schedule()
                 - __schedule()
                    - wq_worker_sleeping()
                       - kthread_data(task)->flags // UAF
      
      The problem started showing up with commit 1da5c46f since it reused
      ->set_child_tid for the kthread worker data.
      
      A better long-term solution might be to get rid of the ->set_child_tid
      abuse. The comment in set_kthread_struct() also looks slightly wrong.
      Debugged-by: default avatarJamie Iles <jamie.iles@oracle.com>
      Fixes: 1da5c46f ("kthread: Make struct kthread kmalloc'ed")
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a70e46bc
  14. 15 Sep, 2018 1 commit
  15. 09 Sep, 2018 1 commit
    • Steven Rostedt (VMware)'s avatar
      printk/tracing: Do not trace printk_nmi_enter() · 6c6d1748
      Steven Rostedt (VMware) authored
      commit d1c392c9e2a301f38998a353f467f76414e38725 upstream.
      
      I hit the following splat in my tests:
      
      ------------[ cut here ]------------
      IRQs not enabled as expected
      WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c
      Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      EIP: tick_nohz_idle_enter+0x44/0x8c
      Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00
      75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0
      e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b
      EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007
      ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38
      DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296
      CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0
      Call Trace:
       do_idle+0x33/0x202
       cpu_startup_entry+0x61/0x63
       start_secondary+0x18e/0x1ed
       startup_32_smp+0x164/0x168
      irq event stamp: 18773830
      hardirqs last  enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10
      hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10
      softirqs last  enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf
      softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b
      ---[ end trace b7c64aa79e17954a ]---
      
      After a bit of debugging, I found what was happening. This would trigger
      when performing "perf" with a high NMI interrupt rate, while enabling and
      disabling function tracer. Ftrace uses breakpoints to convert the nops at
      the start of functions to calls to the function trampolines. The breakpoint
      traps disable interrupts and this makes calls into lockdep via the
      trace_hardirqs_off_thunk in the entry.S code. What happens is the following:
      
        do_idle {
      
          [interrupts enabled]
      
          <interrupt> [interrupts disabled]
      	TRACE_IRQS_OFF [lockdep says irqs off]
      	[...]
      	TRACE_IRQS_IRET
      	    test if pt_regs say return to interrupts enabled [yes]
      	    TRACE_IRQS_ON [lockdep says irqs are on]
      
      	    <nmi>
      		nmi_enter() {
      		    printk_nmi_enter() [traced by ftrace]
      		    [ hit ftrace breakpoint ]
      		    <breakpoint exception>
      			TRACE_IRQS_OFF [lockdep says irqs off]
      			[...]
      			TRACE_IRQS_IRET [return from breakpoint]
      			   test if pt_regs say interrupts enabled [no]
      			   [iret back to interrupt]
      	   [iret back to code]
      
          tick_nohz_idle_enter() {
      
      	lockdep_assert_irqs_enabled() [lockdep say no!]
      
      Although interrupts are indeed enabled, lockdep thinks it is not, and since
      we now do asserts via lockdep, it gives a false warning. The issue here is
      that printk_nmi_enter() is called before lockdep_off(), which disables
      lockdep (for this reason) in NMIs. By simply not allowing ftrace to see
      printk_nmi_enter() (via notrace annotation) we keep lockdep from getting
      confused.
      
      Cc: stable@vger.kernel.org
      Fixes: 42a0bb3f ("printk/nmi: generic solution for safe printk in NMI")
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c6d1748