1. 03 Apr, 2019 40 commits
    • Philippe Gerum's avatar
      ipipe: add kernel event notifiers · 79db6170
      Philippe Gerum authored
      Add the core API for enabling (regular) kernel event notifications to
      a co-kernel running over the head domain. For instance, such a
      co-kernel may need to know when a task is about to be resumed upon
      signal receipt, or when it gets an access fault trap.
      This commit adds the client-side API for enabling such notification
      for class of events, but does not provide the notification points per
      se, which comes later.
    • Philippe Gerum's avatar
      printk: ipipe: add raw console channel · 56ed1979
      Philippe Gerum authored
      A raw output handler (.write_raw) is added to the console descriptor
      for writing (short) text output unmodified, without any logging,
      header or preparation whatsoever, usable from any pipeline domain.
      The dedicated raw_printk() variant formats the output message then
      passes it on to the handler holding a hard spinlock, irqs off.
      This is a very basic debug channel for situations when resorting to
      the fairly complex printk() handling is not an option. Unlike early
      consoles, regular consoles can provide a raw output service past the
      boot sequence. Raw output handlers are typically provided by serial
      console devices.
    • Philippe Gerum's avatar
      dump_stack: ipipe: make dump_stack() domain-aware · 6f013d57
      Philippe Gerum authored
      When dumping a stack backtrace, we neither need nor want to disable
      root stage IRQs over the head stage, where CPU migration can't
      Conversely, we neither need nor want to disable hard IRQs from the
      head stage, so that latency won't skyrocket either.
    • Philippe Gerum's avatar
      driver core: ipipe: defer dev_printk() from head domain · 5dbea799
      Philippe Gerum authored
      Just like printk(), dev_printk() cannot run from the head domain
      and/or with hard IRQs disabled. In such a case, log the output
      directly into the staging buffer we use for printk().
      NOTE: when redirected to the buffer, the output does not include the
      dev_printk() header text but is merely sent as-is to the log.
    • Philippe Gerum's avatar
      printk: ipipe: defer printk() from head domain · c469670d
      Philippe Gerum authored
      The printk() machinery cannot immediately invoke the console driver(s)
      when called from the head domain, since such driver code belongs to
      the root domain and cannot be shared between domains.
      Output issued from the head domain is formatted then logged into a
      staging buffer, and a dedicated virtual IRQ is posted to the root
      domain for notification. When the virtual IRQ handler runs, the
      contents of the staging buffer is flushed to the printk() interface
      anew, which may eventually pass the output on to the console drivers
      from such a context.
    • Philippe Gerum's avatar
      PM / hibernate: ipipe: protect against out-of-band interrupts · 5fb7e241
      Philippe Gerum authored
      We must not allow out-of-band activity to resume while we are busy
      suspending the devices in the system, until the PM sleep state has
      been fully entered.
      Pair existing virtual IRQ disabling calls which only apply to the root
      domain with hard ones.
    • Philippe Gerum's avatar
      module: ipipe: enable try_module_get() from hard atomic context · b7c424c4
      Philippe Gerum authored
      We might have out-of-band code calling try_module_get() from the head
      domain, or from a section covered by a hard spinlock where the root
      domain must not reschedule. This requires the preemption management
      calls in try_module_get() (and the converse module_put()) to be
      converted to their hard variant.
      REVISIT: using try_module_get() from such contexts is questionable,
      client domains should be fixed.
    • Philippe Gerum's avatar
      KGDB: ipipe: enable debugging over the head domain · 5f37bade
      Philippe Gerum authored
      Make the KGDB stub runnable over the head domain since we may take
      traps and interrupts from that context too, by converting the locks to
      hard spinlocks.
    • Philippe Gerum's avatar
      context_tracking: ipipe: do not track over the head domain · 5ac01b95
      Philippe Gerum authored
      Context tracking is a prerequisite for FULL_NOHZ, so that the RCU
      subsystem can detect CPU idleness without relying on the (regular)
      timer tick.
      Out-of-band activity running over the head domain should by definition
      not be involved in such detection logic, as the root domain has no
      knowledge of what happens - and when - on the head domain whatsoever.
    • Philippe Gerum's avatar
      lib/smp_processor_id: ipipe: exclude head domain from preemption check · bc085b2b
      Philippe Gerum authored
      There can be no CPU migration from the head stage, however the
      out-of-band code currently running smp_processor_id() might have
      preempted the regular kernel code from within a preemptible section,
      which might cause false positive in the end.
      These are the two reasons why we certainly neither need nor want to do
      the preemption check in that case.
    • Philippe Gerum's avatar
      atomic: ipipe: keep atomic when pipelining IRQs · 43381d19
      Philippe Gerum authored
      Because of the virtualization of interrupt masking for the regular
      kernel code when the pipeline is enabled, atomic helpers relying on
      common interrupt disabling helpers such as local_irq_save/restore
      pairs would not be atomic anymore, leading to data corruption.
      This commit restores true atomicity for the atomic helpers that would
      be otherwise affected by interrupt virtualization.
    • Philippe Gerum's avatar
      preempt: ipipe: : add preemption-safe hard_preempt_{enable, disable}() ops · e68fafcc
      Philippe Gerum authored
      Some inner code of the interrupt pipeline may have to traverse regular
      kernel code which manipulates the preemption count, expecting full
      serialization including with out-of-band contexts.
      The hard_preempt_*() variants are substituted to the original
      preempt_{enable, disable}() calls in these cases.
    • Philippe Gerum's avatar
      genirq: ipipe: protect generic chip against domain preemption · 4d7b948f
      Philippe Gerum authored
      As described in Documentation/ipipe.rst, irq_chip drivers need to be
      specifically adapted for dealing with interrupt pipelining safely.
      The basic issue to address is proper serialization between some
      irq_chip handlers which may be called from out-of-band context
      immediately upon receipt of an IRQ, and the rest of the driver which
      may access the same data / IO registers from the regular - in-band -
      context on the same CPU.
      This commit converts the generic irq_chip lock to a hard spinlock,
      which ensures such serialization.
    • Philippe Gerum's avatar
      ipipe: add latency tracer · a8de4bad
      Philippe Gerum authored
      The latency tracer is a variant of ftrace's 'function' tracer
      providing detailed information about the current interrupt state at
      each function entry (i.e. virtual interrupt flag and CPU interrupt
      disable bit). This commit introduces the generic tracer code, which
      builds upon the regular ftrace API.
      The arch-specific code should provide for ipipe_read_tsc(), a helper
      routine returning a 64bit monotonic time value for timestamping
      purpose. HAVE_IPIPE_TRACER_SUPPORT should be selected by the
      arch-specific code for enabling the tracer, which in turn makes
      CONFIG_IPIPE_TRACE available from the Kconfig interface.
    • Philippe Gerum's avatar
      ipipe: add out-of-band tick device · 216ce73c
      Philippe Gerum authored
      The out-of-band tick device manages the timer hardware by interposing
      on selected clockevent handlers transparently, so that a client domain
      (e.g. a co-kernel) eventually controls such hardware for scheduling
      the high-precision timer events it needs to. Those events are
      delivered to out-of-hand activities running on the head stage,
      unimpeded by (only virtually) interrupt-free sections of the regular
      kernel code.
      This commit introduces the generic API for controlling the out-of-band
      tick device from a co-kernel. It also provides for the internal API
      clock event chip drivers should use for enabling high-precision
      timing for their hardware.
    • Philippe Gerum's avatar
      locking: ipipe: add hard lock alternative to regular spinlocks · d46d1207
      Philippe Gerum authored
      Hard spinlocks manipulate the CPU interrupt mask, without affecting
      the kernel preemption state in locking/unlocking operations.
      This type of spinlock is useful for implementing a critical section to
      serialize concurrent accesses from both in-band and out-of-band
      contexts, i.e. from root and head stages.
      Hard spinlocks exclusively depend on the pre-existing arch-specific
      bits which implement regular spinlocks. They can be seen as basic
      spinlocks still affecting the CPU's interrupt state when all other
      spinlock types only deal with the virtual interrupt flag managed by
      the pipeline core - i.e. only disable interrupts for the regular
      in-band kernel activity.
    • Philippe Gerum's avatar
      genirq: add generic I-pipe core · b4d5ff3f
      Philippe Gerum authored
      This commit provides the arch-independent bits for implementing the
      interrupt pipeline core, a lightweight layer introducing a separate,
      high-priority execution stage for handling all IRQs in pseudo-NMI
      mode, which cannot be delayed by the regular kernel code. See
      Documentation/ipipe.rst for details about interrupt pipelining.
      Architectures which support interrupt pipelining should select
      HAVE_IPIPE_SUPPORT, along with implementing the required arch-specific
      code. In such a case, CONFIG_IPIPE becomes available to the user via
      the Kconfig interface for enabling the feature.
    • Greg Kroah-Hartman's avatar
      Linux 4.19.33 · 4b3a3ab0
      Greg Kroah-Hartman authored
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Remove the old connections for the muxes · 11008a9b
      Heikki Krogerus authored
      commit 148b0aa78e4e1077e38f928124bbc9c2d2d24006 upstream.
      USB Type-C class driver now expects the muxes to be always
      assigned to the ports and not controllers, so the
      connections for the mux and fusb302 can be removed.
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Heikki Krogerus's avatar
      usb: typec: class: Don't use port parent for getting mux handles · 056cda45
      Heikki Krogerus authored
      commit 23481121c81d984193edf1532f5e123637e50903 upstream.
      It is not possible to use the parent of the port device when
      requesting mux handles as the parent may be a multiport USB
      Type-C or PD controller. The muxes must be assigned to the
      ports, not the controllers.
      This will also move the requesting of the muxes after the
      port device is initialized.
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Add connections for the USB Type-C port · 6875404a
      Heikki Krogerus authored
      commit 495965a1002a0b301bf4fbfd1aed3233f3e7db1b upstream.
      Assigning the mux to the USB Type-C port on top of fusb302.
      That will prepare this driver for the change in the USB
      Type-C class code, where the class driver will assume the
      muxes to be always assigned to the ports and not the
      Once the USB Type-C class driver has been updated, the
      connections between the mux and fusb302 can be dropped.
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Add connection for the DP alt mode · 681a9fc1
      Heikki Krogerus authored
      commit 78d2b54b134ea6059e2b1554ad53fab2300a4cc6 upstream.
      Adding a connection for the DisplayPort alternate mode.
      PI3USB30532 is used for muxing the port to DisplayPort on
      CHT platforms. The connection allows the alternate mode
      device to get handle to the mux, and therefore make it
      possible to use the USB Type-C connector as DisplayPort.
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Register all connections at once · 3bb446a3
      Heikki Krogerus authored
      commit 140a4ec4adddda615b4e8e8055ca37a30c7fe5e8 upstream.
      We can register all device connection descriptors with a
      single call to device_connections_add().
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Heikki Krogerus's avatar
      drivers: base: Helpers for adding device connection descriptions · e99d90ce
      Heikki Krogerus authored
      commit cd7753d371388e712e3ee52b693459f9b71aaac2 upstream.
      Introducing helpers for adding and removing multiple device
      connection descriptions at once.
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Xu Yu's avatar
      bpf: do not restore dst_reg when cur_state is freed · f5959dec
      Xu Yu authored
      commit 0803278b0b4d8eeb2b461fb698785df65a725d9e upstream.
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      Call trace:
      Fault injection trace:
        push_stack+0x216/0x2f0	          <- inject failslab
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Gao Xiang's avatar
      staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir() · 738dda85
      Gao Xiang authored
      commit 33bac912840fe64dbc15556302537dc6a17cac63 upstream.
      After commit 419d6efc50e9, kernel cannot be crashed in the namei
      path. However, corrupted nameoff can do harm in the process of
      readdir for scenerios without dm-verity as well. Fix it now.
      Fixes: 3aa8ec71 ("staging: erofs: add directory operations")
      Cc: <stable@vger.kernel.org> # 4.19+
      Signed-off-by: default avatarGao Xiang <gaoxiang25@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Gao Xiang's avatar
      staging: erofs: fix error handling when failed to read compresssed data · 83bbd66b
      Gao Xiang authored
      commit b6391ac73400eff38377a4a7364bd3df5efb5178 upstream.
      Complete read error handling paths for all three kinds of
      compressed pages:
       1) For cache-managed pages, PG_uptodate will be checked since
          read_endio will unlock and SetPageUptodate for these pages;
       2) For inplaced pages, read_endio cannot SetPageUptodate directly
          since it should be used to mark the final decompressed data,
          PG_error will be set with page locked for IO error instead;
       3) For staging pages, PG_error is used, which is similar to
          what we do for inplaced pages.
      Fixes: 3883a79a ("staging: erofs: introduce VLE decompression support")
      Cc: <stable@vger.kernel.org> # 4.19+
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarGao Xiang <gaoxiang25@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Sean Christopherson's avatar
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 3a18eaba
      Sean Christopherson authored
      commit 0cf9135b773bf32fba9dd8e6699c1b331ee4b749 upstream.
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Sean Christopherson's avatar
      KVM: x86: update %rip after emulating IO · b9733a74
      Sean Christopherson authored
      commit 45def77ebf79e2e8942b89ed79294d97ce914fa0 upstream.
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      Updating %rip prior to executing to userspace has several drawbacks:
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Sean Christopherson's avatar
      KVM: Reject device ioctls from processes other than the VM's creator · 7ceedcef
      Sean Christopherson authored
      commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream.
      KVM's API requires thats ioctls must be issued from the same process
      that created the VM.  In other words, userspace can play games with a
      VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
      creator can do anything useful.  Explicitly reject device ioctls that
      are issued by a process other than the VM's creator, and update KVM's
      API documentation to extend its requirements to device ioctls.
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Thomas Gleixner's avatar
      x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y · a0713e81
      Thomas Gleixner authored
      commit bebd024e4815b1a170fcd21ead9c2222b23ce9e6 upstream.
      The SMT disable 'nosmt' command line argument is not working properly when
      CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
      required to be brought up due to the MCE issues, cannot work. The CPUs are
      then kept in a half dead state.
      As the 'nosmt' functionality has become popular due to the speculative
      hardware vulnerabilities, the half torn down state is not a proper solution
      to the problem.
      Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
      Reported-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Thomas Gleixner's avatar
      cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n · a56aa02e
      Thomas Gleixner authored
      commit 206b92353c839c0b27a0b9bec24195f93fd6cf7a upstream.
      Tianyu reported a crash in a CPU hotplug teardown callback when booting a
      kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
      It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
      forever in case that a bringup callback fails. Unfortunately this issue was
      not recognized when the CPU hotplug code was reworked, so the shortcoming
      just stayed in place.
      When a bringup callback fails, the CPU hotplug code rolls back the
      operation and takes the CPU offline.
      The 'nosmt' command line argument uses a bringup failure to abort the
      bringup of SMT sibling CPUs. This partial bringup is required due to the
      MCE misdesign on Intel CPUs.
      With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
      CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
      teardown of a CPU including the synchronizations in various facilities like
      RCU, NOHZ and others.
      As a consequence the teardown callbacks which must be executed on the
      outgoing CPU within stop machine with interrupts disabled are executed on
      the control CPU in interrupt enabled and preemptible context causing the
      kernel to crash and burn. The pre state machine code has a different
      failure mode which is more subtle and resulting in a less obvious use after
      free crash because the control side frees resources which are still in use
      by the undead CPU.
      But this is not a x86 only problem. Any architecture which supports the
      SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
      likely to be triggered because in 99.99999% of the cases all bringup
      callbacks succeed.
      The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
      all architectures as the following architectures have either no hotplug
      support at all or not all subarchitectures support it:
       alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
      Crashing the kernel in such a situation is not an acceptable state
      Implement a minimal rollback variant by limiting the teardown to the point
      where all regular teardown callbacks have been invoked and leave the CPU in
      the 'dead' idle state. This has the following consequences:
       - the CPU is brought down to the point where the stop_machine takedown
         would happen.
       - the CPU stays there forever and is idle
       - The CPU is cleared in the CPU active mask, but not in the CPU online
         mask which is a legit state.
       - Interrupts are not forced away from the CPU
       - All facilities which only look at online mask would still see it, but
         that is the case during normal hotplug/unplug operations as well. It's
         just a (way) longer time frame.
      This will expose issues, which haven't been exposed before or only seldom,
      because now the normally transient state of being non active but online is
      a permanent state. In testing this exposed already an issue vs. work queues
      where the vmstat code schedules work on the almost dead CPU which ends up
      in an unbound workqueue and triggers 'preemtible context' warnings. This is
      not a problem of this change, it merily exposes an already existing issue.
      Still this is better than crashing fully without a chance to debug it.
      This is mainly thought as workaround for those architectures which do not
      support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
      Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions")
      Reported-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Thomas Gleixner's avatar
      watchdog: Respect watchdog cpumask on CPU hotplug · 336f6b23
      Thomas Gleixner authored
      commit 7dd47617114921fdd8c095509e5e7b4373cc44a1 upstream.
      The rework of the watchdog core to use cpu_stop_work broke the watchdog
      cpumask on CPU hotplug.
      The watchdog_enable/disable() functions are now called unconditionally from
      the hotplug callback, i.e. even on CPUs which are not in the watchdog
      cpumask. As a consequence the watchdog can become unstoppable.
      Only invoke them when the plugged CPU is in the watchdog cpumask.
      Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
      Reported-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Michael Ellerman's avatar
      powerpc/64: Fix memcmp reading past the end of src/dest · c91d07ad
      Michael Ellerman authored
      commit d9470757398a700d9450a43508000bcfd010c7a4 upstream.
      Chandan reported that fstests' generic/026 test hit a crash:
        BUG: Unable to handle kernel data access at 0xc00000062ac40000
        Faulting instruction address: 0xc000000000092240
        Oops: Kernel access of bad area, sig: 11 [#1]
        CPU: 0 PID: 27828 Comm: chacl Not tainted 5.0.0-rc2-next-20190115-00001-g6de6dba64dda #1
        NIP:  c000000000092240 LR: c00000000066a55c CTR: 0000000000000000
        REGS: c00000062c0c3430 TRAP: 0300   Not tainted  (5.0.0-rc2-next-20190115-00001-g6de6dba64dda)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000842  XER: 20000000
        CFAR: 00007fff7f3108ac DAR: c00000062ac40000 DSISR: 40000000 IRQMASK: 0
        GPR00: 0000000000000000 c00000062c0c36c0 c0000000017f4c00 c00000000121a660
        GPR04: c00000062ac3fff9 0000000000000004 0000000000000020 00000000275b19c4
        GPR08: 000000000000000c 46494c4500000000 5347495f41434c5f c0000000026073a0
        GPR12: 0000000000000000 c0000000027a0000 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: c00000062ea70020 c00000062c0c38d0 0000000000000002 0000000000000002
        GPR24: c00000062ac3ffe8 00000000275b19c4 0000000000000001 c00000062ac30000
        GPR28: c00000062c0c38d0 c00000062ac30050 c00000062ac30058 0000000000000000
        NIP memcmp+0x120/0x690
        LR  xfs_attr3_leaf_lookup_int+0x53c/0x5b0
        Call Trace:
          xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
        Instruction dump:
        7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 2c050000
        4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 7d4a3436 7c295040
      The instruction dump decodes as:
        subfic  r6,r5,8
        rlwinm  r6,r6,3,0,28
        ldbrx   r9,0,r3
        ldbrx   r10,0,r4      <-
      Which shows us doing an 8 byte load from c00000062ac3fff9, which
      crosses the page boundary at c00000062ac40000 and faults.
      It's not OK for memcmp to read past the end of the source or
      destination buffers if that would cross a page boundary, because we
      don't know that the next page is mapped.
      As pointed out by Segher, we can read past the end of the source or
      destination as long as we don't cross a 4K boundary, because that's
      our minimum page size on all platforms.
      The bug is in the code at the .Lcmp_rest_lt8bytes label. When we get
      there we know that s1 is 8-byte aligned and we have at least 1 byte to
      read, so a single 8-byte load won't read past the end of s1 and cross
      a page boundary.
      But we have to be more careful with s2. So check if it's within 8
      bytes of a 4K boundary and if so go to the byte-by-byte loop.
      Fixes: 2d9ee327 ("powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()")
      Cc: stable@vger.kernel.org # v4.19+
      Reported-by: default avatarChandan Rajendra <chandan@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Tested-by: default avatarChandan Rajendra <chandan@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Gautham R. Shenoy's avatar
      powerpc/pseries/energy: Use OF accessor functions to read ibm,drc-indexes · d7c00bbb
      Gautham R. Shenoy authored
      commit ce9afe08e71e3f7d64f337a6e932e50849230fc2 upstream.
      In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent,
      we currently use of_read_property() to obtain the pointer to the array
      corresponding to the property "ibm,drc-indexes". The elements of this
      array are of type __be32, but are accessed without any conversion to
      the OS-endianness, which is buggy on a Little Endian OS.
      Fix this by using of_property_read_u32_index() accessor function to
      safely read the elements of the array.
      Fixes: e83636ac ("pseries/drc-info: Search DRC properties for CPU indexes")
      Cc: stable@vger.kernel.org # v4.16+
      Reported-by: default avatarPavithra R. Prakash <pavrampu@in.ibm.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      [mpe: Make the WARN_ON a WARN_ON_ONCE so it's not retriggerable]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Rolf Eike Beer's avatar
      objtool: Query pkg-config for libelf location · 0603e3a9
      Rolf Eike Beer authored
      commit 056d28d135bca0b1d0908990338e00e9dadaf057 upstream.
      If it is not in the default location, compilation fails at several points.
      Signed-off-by: default avatarRolf Eike Beer <eb@emlix.com>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/91a25e992566a7968fedc89ec80e7f4c83ad0548.1553622500.git.jpoimboe@redhat.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Adrian Hunter's avatar
      perf intel-pt: Fix TSC slip · a436cf64
      Adrian Hunter authored
      commit f3b4e06b3bda759afd042d3d5fa86bea8f1fe278 upstream.
      A TSC packet can slip past MTC packets so that the timestamp appears to
      go backwards. One estimate is that can be up to about 40 CPU cycles,
      which is certainly less than 0x1000 TSC ticks, but accept slippage an
      order of magnitude more to be on the safe side.
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 79b58424 ("perf tools: Add Intel PT support for decoding MTC packets")
      Link: http://lkml.kernel.org/r/20190325135135.18348-1-adrian.hunter@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kan Liang's avatar
      perf pmu: Fix parser error for uncore event alias · 5f936633
      Kan Liang authored
      commit e94d6b7f615e6dfbaf9fba7db6011db561461d0c upstream.
      Perf fails to parse uncore event alias, for example:
        # perf stat -e unc_m_clockticks -a --no-merge sleep 1
        event syntax error: 'unc_m_clockticks'
                             \___ parser error
      Current code assumes that the event alias is from one specific PMU.
      To find the PMU, perf strcmps the PMU name of event alias with the real
      PMU name on the system.
      However, the uncore event alias may be from multiple PMUs with common
      prefix. The PMU name of uncore event alias is the common prefix.
      For example, UNC_M_CLOCKTICKS is clock event for iMC, which include 6
      PMUs with the same prefix "uncore_imc" on a skylake server.
      The real PMU names on the system for iMC are uncore_imc_0 ...
      The strncmp is used to only check the common prefix for uncore event
      With the patch:
        # perf stat -e unc_m_clockticks -a --no-merge sleep 1
        Performance counter stats for 'system wide':
             723,594,722      unc_m_clockticks [uncore_imc_5]
             724,001,954      unc_m_clockticks [uncore_imc_3]
             724,042,655      unc_m_clockticks [uncore_imc_1]
             724,161,001      unc_m_clockticks [uncore_imc_4]
             724,293,713      unc_m_clockticks [uncore_imc_2]
             724,340,901      unc_m_clockticks [uncore_imc_0]
             1.002090060 seconds time elapsed
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Thomas Richter <tmricht@linux.ibm.com>
      Cc: stable@vger.kernel.org
      Fixes: ea1fa48c055f ("perf stat: Handle different PMU names with common prefix")
      Link: http://lkml.kernel.org/r/1552672814-156173-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Lars Persson's avatar
      mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate · f70ddae2
      Lars Persson authored
      commit d2b2c6dd227ba5b8a802858748ec9a780cb75b47 upstream.
      Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL
      and SIGSEGV that could not be traced back to a userspace code bug.  They
      had all the magic signs of an I/D cache coherency issue.
      Now recently we noticed that the /proc/sys/vm/compact_memory interface
      was quite efficient at provoking this class of userspace crashes.
      Studying the code in mm/migrate.c there is a distinction made between
      migrating a page that is mapped at the instant of migration and one that
      is not mapped.  Our problem turned out to be the non-mapped pages.
      For the non-mapped page the code performs a copy of the page content and
      all relevant meta-data of the page without doing the required D-cache
      maintenance.  This leaves dirty data in the D-cache of the CPU and on
      the 1004K cores this data is not visible to the I-cache.  A subsequent
      page-fault that triggers a mapping of the page will happily serve the
      process with potentially stale code.
      What about ARM then, this bug should have seen greater exposure? Well
      ARM became immune to this flaw back in 2010, see commit c0177800
      ("ARM: 6379/1: Assume new page cache pages have dirty D-cache").
      My proposed fix moves the D-cache maintenance inside move_to_new_page to
      make it common for both cases.
      Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
      Fixes: 97ee0524 ("flush cache before installing new page at migraton")
      Signed-off-by: default avatarLars Persson <larper@axis.com>
      Reviewed-by: default avatarPaul Burton <paul.burton@mips.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Yang Shi's avatar
      mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified · 5966777d
      Yang Shi authored
      commit a7f40cfe3b7ada57af9b62fd28430eeb4a7cfcb7 upstream.
      When MPOL_MF_STRICT was specified and an existing page was already on a
      node that does not follow the policy, mbind() should return -EIO.  But
      commit 6f4576e3 ("mempolicy: apply page table walker on
      queue_pages_range()") broke the rule.
      And commit c8633798 ("mm: mempolicy: mbind and migrate_pages support
      thp migration") didn't return the correct value for THP mbind() too.
      If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it
      reaches queue_pages_to_pte_range() or queue_pages_pmd() to check if an
      existing page was already on a node that does not follow the policy.
      And, non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
      MPOL_MF_MOVE_ALL was specified.
      Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c
      [akpm@linux-foundation.org: tweak code comment]
      Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: 6f4576e3 ("mempolicy: apply page table walker on queue_pages_range()")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reported-by: default avatarCyril Hrubis <chrubis@suse.cz>
      Suggested-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>