1. 09 Dec, 2017 1 commit
  2. 30 Nov, 2017 4 commits
    • Marc Zyngier's avatar
      genirq: Track whether the trigger type has been set · c01dd3ad
      Marc Zyngier authored
      commit 4f8413a3 upstream.
      When requesting a shared interrupt, we assume that the firmware
      support code (DT or ACPI) has called irqd_set_trigger_type
      already, so that we can retrieve it and check that the requester
      is being reasonnable.
      Unfortunately, we still have non-DT, non-ACPI systems around,
      and these guys won't call irqd_set_trigger_type before requesting
      the interrupt. The consequence is that we fail the request that
      would have worked before.
      We can either chase all these use cases (boring), or address it
      in core code (easier). Let's have a per-irq_desc flag that
      indicates whether irqd_set_trigger_type has been called, and
      let's just check it when checking for a shared interrupt.
      If it hasn't been set, just take whatever the interrupt
      requester asks.
      Fixes: 382bd4de ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs")
      Reported-and-tested-by: default avatarPetr Cvek <petrcvekcz@gmail.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Steven Rostedt (Red Hat)'s avatar
      sched/rt: Simplify the IPI based RT balancing logic · f17c786b
      Steven Rostedt (Red Hat) authored
      commit 4bdced5c upstream.
      When a CPU lowers its priority (schedules out a high priority task for a
      lower priority one), a check is made to see if any other CPU has overloaded
      RT tasks (more than one). It checks the rto_mask to determine this and if so
      it will request to pull one of those tasks to itself if the non running RT
      task is of higher priority than the new priority of the next task to run on
      the current CPU.
      When we deal with large number of CPUs, the original pull logic suffered
      from large lock contention on a single CPU run queue, which caused a huge
      latency across all CPUs. This was caused by only having one CPU having
      overloaded RT tasks and a bunch of other CPUs lowering their priority. To
      solve this issue, commit:
        b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
      changed the way to request a pull. Instead of grabbing the lock of the
      overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
      Although the IPI logic worked very well in removing the large latency build
      up, it still could suffer from a large number of IPIs being sent to a single
      CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
      when I tested this on a 120 CPU box, with a stress test that had lots of
      RT tasks scheduling on all CPUs, it actually triggered the hard lockup
      detector! One CPU had so many IPIs sent to it, and due to the restart
      mechanism that is triggered when the source run queue has a priority status
      change, the CPU spent minutes! processing the IPIs.
      Thinking about this further, I realized there's no reason for each run queue
      to send its own IPI. As all CPUs with overloaded tasks must be scanned
      regardless if there's one or many CPUs lowering their priority, because
      there's no current way to find the CPU with the highest priority task that
      can schedule to one of these CPUs, there really only needs to be one IPI
      being sent around at a time.
      This greatly simplifies the code!
      The new approach is to have each root domain have its own irq work, as the
      rto_mask is per root domain. The root domain has the following fields
      attached to it:
        rto_push_work	 - the irq work to process each CPU set in rto_mask
        rto_lock	 - the lock to protect some of the other rto fields
        rto_loop_start - an atomic that keeps contention down on rto_lock
      		    the first CPU scheduling in a lower priority task
      		    is the one to kick off the process.
        rto_loop_next	 - an atomic that gets incremented for each CPU that
      		    schedules in a lower priority task.
        rto_loop	 - a variable protected by rto_lock that is used to
      		    compare against rto_loop_next
        rto_cpu	 - The cpu to send the next IPI to, also protected by
      		    the rto_lock.
      When a CPU schedules in a lower priority task and wants to make sure
      overloaded CPUs know about it. It increments the rto_loop_next. Then it
      atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
      then it is done, as another CPU is kicking off the IPI loop. If the old
      value is "0", then it will take the rto_lock to synchronize with a possible
      IPI being sent around to the overloaded CPUs.
      If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
      IPI being sent around, or one is about to finish. Then rto_cpu is set to the
      first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
      in rto_mask, then there's nothing to be done.
      When the CPU receives the IPI, it will first try to push any RT tasks that is
      queued on the CPU but can't run because a higher priority RT task is
      currently running on that CPU.
      Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
      finds one, it simply sends an IPI to that CPU and the process continues.
      If there's no more CPUs in the rto_mask, then rto_loop is compared with
      rto_loop_next. If they match, everything is done and the process is over. If
      they do not match, then a CPU scheduled in a lower priority task as the IPI
      was being passed around, and the process needs to start again. The first CPU
      in rto_mask is sent the IPI.
      This change removes this duplication of work in the IPI logic, and greatly
      lowers the latency caused by the IPIs. This removed the lockup happening on
      the 120 CPU machine. It also simplifies the code tremendously. What else
      could anyone ask for?
      Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
      supplying me with the rto_start_trylock() and rto_start_unlock() helper
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Paul E. McKenney's avatar
      sched: Make resched_cpu() unconditional · 9f088f6a
      Paul E. McKenney authored
      commit 7c2102e5 upstream.
      The current implementation of synchronize_sched_expedited() incorrectly
      assumes that resched_cpu() is unconditional, which it is not.  This means
      that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
      fails as follows (analysis by Neeraj Upadhyay):
      o	CPU1 is waiting for expedited wait to complete:
      	     rdp->exp_dynticks_snap & 0x1   // returns 1 for CPU5
      	     IPI sent to CPU5
      		 ret = swait_event_timeout(rsp->expedited_wq,
      	expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
      o	CPU5 handles IPI and fails to acquire rq lock.
      	Handles IPI
      		     returns while failing to try lock acquire rq->lock
      		 need_resched is not set
      o	CPU5 calls  rcu_idle_enter() and as need_resched is not set, goes to
      	idle (schedule() is not called).
      o	CPU 1 reports RCU stall.
      Given that resched_cpu() is now used only by RCU, this commit fixes the
      assumption by making resched_cpu() unconditional.
      Reported-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
      Suggested-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Viresh Kumar's avatar
      cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq · b7997494
      Viresh Kumar authored
      commit 07458f6a upstream.
      'cached_raw_freq' is used to get the next frequency quickly but should
      always be in sync with sg_policy->next_freq. There is a case where it is
      not and in such cases it should be reset to avoid switching to incorrect
      Consider this case for example:
       - policy->cur is 1.2 GHz (Max)
       - New request comes for 780 MHz and we store that in cached_raw_freq.
       - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
       - We then see the CPU wasn't idle recently and choose to keep the next
         freq as 1.2 GHz.
       - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
         1.2 GHz.
       - Now if the utilization doesn't change in then next request, then the
         next target frequency will still be 780 MHz and it will match with
         cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.
      Fixes: b7eaf1aa (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 24 Nov, 2017 1 commit
  4. 04 Nov, 2017 1 commit
  5. 02 Nov, 2017 2 commits
    • Jiri Slaby's avatar
      futex: futex_wake_op, do not fail on invalid op · e78c38f6
      Jiri Slaby authored
      In commit 30d6e0a4 ("futex: Remove duplicated code and fix undefined
      behaviour"), I let FUTEX_WAKE_OP to fail on invalid op.  Namely when op
      should be considered as shift and the shift is out of range (< 0 or > 31).
      But strace's test suite does this madness:
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);
      When I pick the first 0xa0caffee, it decodes as:
        0x80000000 & 0xa0caffee: oparg is shift
        0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
        0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
        0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
        0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18
      That means the op tries to do this:
        (futex |= (1 << (-849))) == -18
      which is completely bogus. The new check of op in the code is:
              if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
                      if (oparg < 0 || oparg > 31)
                              return -EINVAL;
                      oparg = 1 << oparg;
      which results obviously in the "Invalid argument" errno:
        FAIL: futex
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
        futex.test: failed test: ../futex failed with code 1
      So let us soften the failure to print only a (ratelimited) message, crop
      the value and continue as if it were right.  When userspace keeps up, we
      can switch this to return -EINVAL again.
      [v2] Do not return 0 immediatelly, proceed with the cropped value.
      Fixes: 30d6e0a4 ("futex: Remove duplicated code and fix undefined behaviour")
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 01 Nov, 2017 5 commits
    • Andrew Clayton's avatar
      signal: Fix name of SIGEMT in #if defined() check · c3aff086
      Andrew Clayton authored
      Commit cc731525 ("signal: Remove kernel interal si_code magic")
      added a check for SIGMET and NSIGEMT being defined. That SIGMET should
      in fact be SIGEMT, with SIGEMT being defined in
      This was actually pointed out by BenHutchings in a lwn.net comment
      here https://lwn.net/Comments/734608/
      Fixes: cc731525 ("signal: Remove kernel interal si_code magic")
      Signed-off-by: default avatarAndrew Clayton <andrew@digital-domain.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Don Zickus's avatar
      watchdog/hardlockup/perf: Use atomics to track in-use cpu counter · 42f930da
      Don Zickus authored
      Guenter reported:
        There is still a problem. When running 
          echo 6 > /proc/sys/kernel/watchdog_thresh
          echo 5 > /proc/sys/kernel/watchdog_thresh
        repeatedly, the message
         NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
        stops after a while (after ~10-30 iterations, with fluctuations).
        Maybe watchdog_cpus needs to be atomic ?
      That's correct as this again is affected by the asynchronous nature of the
      smpboot thread unpark mechanism.
      CPU 0				CPU1			CPU2
      write(watchdog_thresh, 6)	
      write(watchdog_thresh, 5)				thread->unpark()
          park()			thread->park()
      				   cnt--;		  cnt++;
      That's not a functional problem, it just affects the informational message.
      Convert watchdog_cpus to atomic_t to prevent the problem
      Reported-and-tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20171101181126.j727fqjmdthjz4xk@redhat.com
    • Thomas Gleixner's avatar
      watchdog/harclockup/perf: Revert a33d4484 ("watchdog/hardlockup/perf:... · 9c388a5e
      Thomas Gleixner authored
      watchdog/harclockup/perf: Revert a33d4484 ("watchdog/hardlockup/perf: Simplify deferred event destroy")
      Guenter reported a crash in the watchdog/perf code, which is caused by
      cleanup() and enable() running concurrently. The reason for this is:
      The watchdog functions are serialized via the watchdog_mutex and cpu
      hotplug locking, but the enable of the perf based watchdog happens in
      context of the unpark callback of the smpboot thread. But that unpark
      function is not synchronous inside the locking. The unparking of the thread
      just wakes it up and leaves so there is no guarantee when the thread is
      If it starts running _before_ the cleanup happened then it will create a
      event and overwrite the dead event pointer. The new event is then cleaned
      up because the event is marked dead.
      	cpus_read_unlock();		thread runs()
      					  overwrite dead event ptr
      	  free new event, which is active inside perf....
      The park side is safe as that actually waits for the thread to reach
      parked state.
      Commit a33d4484 removed the protection against this kind of scenario
      under the stupid assumption that the hotplug serialization and the
      watchdog_mutex cover everything. 
      Bring it back.
      Reverts: a33d4484 ("watchdog/hardlockup/perf: Simplify deferred event destroy")
      Reported-and-tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarThomas Feels-stupid Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710312145190.1942@nanos
    • Peter Zijlstra's avatar
      futex: Fix more put_pi_state() vs. exit_pi_state_list() races · 153fbd12
      Peter Zijlstra authored
      Dmitry (through syzbot) reported being able to trigger the WARN in
      get_pi_state() and a use-after-free on:
      Both are due to this race:
        exit_pi_state_list()				put_pi_state()
        while() {
      	pi_state = list_first_entry(head);
      	hb = hash_futex(&pi_state->key);
      	lock(&pi_state->pi_mutex.wait_lock)	// uaf if pi_state free'd
      	get_pi_state();				// WARN; refcount==0
      The problem is we take the reference count too late, and don't allow it
      being 0. Fix it by using inc_not_zero() and simply retrying the loop
      when we fail to get a refcount. In that case put_pi_state() should
      remove the entry from the list.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Gratian Crisan <gratian.crisan@ni.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: dvhart@infradead.org
      Cc: syzbot <bot+2af19c9e1ffe4d4ee1d16c56ae7580feaee75765@syzkaller.appspotmail.com>
      Cc: syzkaller-bugs@googlegroups.com
      Cc: <stable@vger.kernel.org>
      Fixes: c74aef2d ("futex: Fix pi_state->owner serialization")
      Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • John Fastabend's avatar
      bpf: remove SK_REDIRECT from UAPI · 04686ef2
      John Fastabend authored
      Now that SK_REDIRECT is no longer a valid return code. Remove it
      from the UAPI completely. Then do a namespace remapping internal
      to sockmap so SK_REDIRECT is no longer externally visible.
      Patchs primary change is to do a namechange from SK_REDIRECT to
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 30 Oct, 2017 2 commits
    • Li Bin's avatar
      workqueue: Fix NULL pointer dereference · cef572ad
      Li Bin authored
      When queue_work() is used in irq (not in task context), there is
      a potential case that trigger NULL pointer dereference.
      	|-worker->current_pwq = pwq
       	|-worker->current_pwq = NULL
      				//interrupt here
      						//assuming that the wq is draining
      							//Here, 'current' is the interrupted worker!
      								|-current->current_pwq is NULL here!
      Avoid it by checking for task context in current_wq_worker(), and
      if not in task context, we shouldn't use the 'current' to check the
      Reported-by: default avatarXiaofei Tan <tanxiaofei@huawei.com>
      Signed-off-by: default avatarLi Bin <huawei.libin@huawei.com>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 8d03ecfe ("workqueue: reimplement is_chained_work() using current_wq_worker()")
      Cc: stable@vger.kernel.org # v3.9+
    • Tejun Heo's avatar
      perf/cgroup: Fix perf cgroup hierarchy support · be96b316
      Tejun Heo authored
      The following commit:
        864c2357 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
      made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event
      targets %current's cgroup.
      This breaks perf_event's hierarchical support because events which target one
      of the ancestors get ignored.
      Fix it by using cgroup_is_descendant() test instead of equality.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel-team@fb.com
      Cc: stable@vger.kernel.org # v4.9+
      Fixes: 864c2357 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
      Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
  8. 29 Oct, 2017 2 commits
  9. 21 Oct, 2017 4 commits
  10. 20 Oct, 2017 7 commits
  11. 19 Oct, 2017 2 commits
  12. 18 Oct, 2017 1 commit
    • Jakub Kicinski's avatar
      bpf: disallow arithmetic operations on context pointer · 28e33f9d
      Jakub Kicinski authored
      Commit f1174f77 ("bpf/verifier: rework value tracking")
      removed the crafty selection of which pointer types are
      allowed to be modified.  This is OK for most pointer types
      since adjust_ptr_min_max_vals() will catch operations on
      immutable pointers.  One exception is PTR_TO_CTX which is
      now allowed to be offseted freely.
      The intent of aforementioned commit was to allow context
      access via modified registers.  The offset passed to
      ->is_valid_access() verifier callback has been adjusted
      by the value of the variable offset.
      What is missing, however, is taking the variable offset
      into account when the context register is used.  Or in terms
      of the code adding the offset to the value passed to the
      ->convert_ctx_access() callback.  This leads to the following
      eBPF user code:
           r1 += 68
           r0 = *(u32 *)(r1 + 8)
      being translated to this in kernel space:
         0: (07) r1 += 68
         1: (61) r0 = *(u32 *)(r1 +180)
         2: (95) exit
      Offset 8 is corresponding to 180 in the kernel, but offset
      76 is valid too.  Verifier will "accept" access to offset
      68+8=76 but then "convert" access to offset 8 as 180.
      Effective access to offset 248 is beyond the kernel context.
      (This is a __sk_buff example on a debug-heavy kernel -
      packet mark is 8 -> 180, 76 would be data.)
      Dereferencing the modified context pointer is not as easy
      as dereferencing other types, because we have to translate
      the access to reading a field in kernel structures which is
      usually at a different offset and often of a different size.
      To allow modifying the pointer we would have to make sure
      that given eBPF instruction will always access the same
      field or the fields accessed are "compatible" in terms of
      offset and size...
      Disallow dereferencing modified context pointers and add
      to selftests the test case described here.
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 13 Oct, 2017 3 commits
  14. 11 Oct, 2017 2 commits
  15. 10 Oct, 2017 3 commits
    • Colin Ian King's avatar
      seccomp: make function __get_seccomp_filter static · 084f5601
      Colin Ian King authored
      The function __get_seccomp_filter is local to the source and does
      not need to be in global scope, so make it static.
      Cleans up sparse warning:
      symbol '__get_seccomp_filter' was not declared. Should it be static?
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Fixes: 66a733ea ("seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    • Tejun Heo's avatar
      workqueue: replace pool->manager_arb mutex with a flag · 692b4825
      Tejun Heo authored
      Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
       [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
       [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
       [ 1270.473240] -----------------------------------------------------
       [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
       [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
       [ 1270.474994]
       [ 1270.474994] and this task is already holding:
       [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
       [ 1270.476046] which would create a new lock dependency:
       [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
       [ 1270.476949]
       [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
       [ 1270.477553]  (&pool->lock/1){-.-.}
       [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
       [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
       [ 1270.494735]  Possible interrupt unsafe locking scenario:
       [ 1270.494735]
       [ 1270.495250]        CPU0                    CPU1
       [ 1270.495600]        ----                    ----
       [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
       [ 1270.496295]                                local_irq_disable();
       [ 1270.496753]                                lock(&pool->lock/1);
       [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
       [ 1270.497744]   <Interrupt>
       [ 1270.497948]     lock(&pool->lock/1);
      , which will cause a irq inversion deadlock if the above lock scenario
      The root cause of this safe -> unsafe lock order is the
      mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
      Unlocking mutex while holding an irq spinlock was never safe and this
      problem has been around forever but it never got noticed because the
      only time the mutex is usually trylocked while holding irqlock making
      actual failures very unlikely and lockdep annotation missed the
      condition until the recent b9c16a0e ("locking/mutex: Fix
      lockdep_assert_held() fail").
      Using mutex for pool->manager_arb has always been a bit of stretch.
      It primarily is an mechanism to arbitrate managership between workers
      which can easily be done with a pool flag.  The only reason it became
      a mutex is that pool destruction path wants to exclude parallel
      managing operations.
      This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
      and make the destruction path wait for the current manager on a wait
      v2: Drop unnecessary flag clearing before pool destruction as
          suggested by Boqun.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarLai Jiangshan <jiangshanlai@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: stable@vger.kernel.org
    • Peter Zijlstra's avatar
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra authored
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: default avatarLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>