1. 03 Apr, 2019 40 commits
    • Philippe Gerum's avatar
      mm: ipipe: disable ondemand memory · 466697cd
      Philippe Gerum authored
      Co-kernels cannot bear with the extra latency caused by memory access
      faults involved in COW or
      overcommit. __ipipe_disable_ondemand_mappings() force commits all
      common memory mappings with physical RAM.
      
      In addition, the architecture code is given a chance to pre-load page
      table entries for ioremap and vmalloc memory, for preventing further
      minor faults accessing such memory due to PTE misses (if that ever
      makes sense for them).
      
      Revisit: Further COW breaking in copy_user_page() and copy_pte_range()
      may be useless once __ipipe_disable_ondemand_mappings() has run for a
      co-kernel task, since all of its mappings have been populated, and
      unCOWed if applicable.
      466697cd
    • Philippe Gerum's avatar
      fork: ipipe: announce mm dismantling · 0e1b063d
      Philippe Gerum authored
      IPIPE_KEVT_CLEANUP is emitted before a process memory context is
      entirely dropped, after all the mappings have been exited. Per-process
      resources which might be maintained by the co-kernel could be released
      there, as all tasks have exited.
      0e1b063d
    • Philippe Gerum's avatar
      sched: ipipe: announce CPU affinity change · bbba26eb
      Philippe Gerum authored
      Emit IPIPE_KEVT_SETAFFINITY to the co-kernel when the target task is
      about to move to another CPU.
      
      CPU migration can only take place from the root domain, the pipeline
      does not provide any support for migrating tasks from the head domain,
      and derives several key assumptions based on this invariant.
      bbba26eb
    • Philippe Gerum's avatar
      sched: ipipe: announce signal receipt · cfe28fb1
      Philippe Gerum authored
      Emit IPIPE_KEVT_SIGWAKE when the target task is about to receive a
      (regular) signal. The co-kernel may decide to schedule a transition of
      the recipient to the root domain in order to have it handle that
      signal asap, which is commonly required for keeping the kernel sane.
      
      This notification is always sent from the context of the issuer.
      cfe28fb1
    • Philippe Gerum's avatar
      sched: ipipe: announce task exit · d94ddc97
      Philippe Gerum authored
      Emit IPIPE_KEVT_EXIT from do_exit() to the co-kernel before the
      current task has dropped the files and mappings it owns.
      d94ddc97
    • Philippe Gerum's avatar
      KVM: ipipe: keep hypervisor state consistent across domain preemption · 88e379f8
      Philippe Gerum authored
      In order for the hypervisor to operate properly in presence of a
      co-kernel, we need:
      
      - the virtualization core to know when the hypervisor stalls due
        to a preemption by the co-kernel.
      
      - to know when the VM enters and leaves guest mode.
      88e379f8
    • Philippe Gerum's avatar
      sched: ipipe: add domain debug checks to common scheduling paths · 0088b024
      Philippe Gerum authored
      Catch invalid calls of root-only code from the head domain from common
      paths which may lead to blocking the current task linux-wise. Checks
      are enabled by CONFIG_IPIPE_DEBUG_CONTEXT.
      0088b024
    • Philippe Gerum's avatar
      sched: ipipe: enable task migration between domains · bf559171
      Philippe Gerum authored
      This is the basic code enabling alternate control of tasks between the
      regular kernel and an embedded co-kernel. The changes cover the
      following aspects:
      
      - extend the per-thread information block with a private area usable
        by the co-kernel for storing additional state information
      
      - provide the API enabling a scheduler exchange mechanism, so that
        tasks can run under the control of either kernel alternatively. This
        includes a service to move the current task to the head domain under
        the control of the co-kernel, and the converse service to re-enter
        the root domain once the co-kernel has released such task.
      
      - ensure the generic context switching code can be used from any
        domain, serializing execution as required.
      
      These changes have to be paired with arch-specific code further
      enabling context switching from the head domain.
      bf559171
    • Philippe Gerum's avatar
      clockevents: ipipe: connect clock chips to abstract tick device · f8f47d4c
      Philippe Gerum authored
      Announce all clock event chips as they are registered to the
      out-of-band tick device infrastructure, so that we can interpose on
      key handlers in their descriptors.
      f8f47d4c
    • Philippe Gerum's avatar
      ipipe: add kernel event notifiers · 79db6170
      Philippe Gerum authored
      Add the core API for enabling (regular) kernel event notifications to
      a co-kernel running over the head domain. For instance, such a
      co-kernel may need to know when a task is about to be resumed upon
      signal receipt, or when it gets an access fault trap.
      
      This commit adds the client-side API for enabling such notification
      for class of events, but does not provide the notification points per
      se, which comes later.
      79db6170
    • Philippe Gerum's avatar
      printk: ipipe: add raw console channel · 56ed1979
      Philippe Gerum authored
      A raw output handler (.write_raw) is added to the console descriptor
      for writing (short) text output unmodified, without any logging,
      header or preparation whatsoever, usable from any pipeline domain.
      
      The dedicated raw_printk() variant formats the output message then
      passes it on to the handler holding a hard spinlock, irqs off.
      
      This is a very basic debug channel for situations when resorting to
      the fairly complex printk() handling is not an option. Unlike early
      consoles, regular consoles can provide a raw output service past the
      boot sequence. Raw output handlers are typically provided by serial
      console devices.
      56ed1979
    • Philippe Gerum's avatar
      dump_stack: ipipe: make dump_stack() domain-aware · 6f013d57
      Philippe Gerum authored
      When dumping a stack backtrace, we neither need nor want to disable
      root stage IRQs over the head stage, where CPU migration can't
      happen.
      
      Conversely, we neither need nor want to disable hard IRQs from the
      head stage, so that latency won't skyrocket either.
      6f013d57
    • Philippe Gerum's avatar
      driver core: ipipe: defer dev_printk() from head domain · 5dbea799
      Philippe Gerum authored
      Just like printk(), dev_printk() cannot run from the head domain
      and/or with hard IRQs disabled. In such a case, log the output
      directly into the staging buffer we use for printk().
      
      NOTE: when redirected to the buffer, the output does not include the
      dev_printk() header text but is merely sent as-is to the log.
      5dbea799
    • Philippe Gerum's avatar
      printk: ipipe: defer printk() from head domain · c469670d
      Philippe Gerum authored
      The printk() machinery cannot immediately invoke the console driver(s)
      when called from the head domain, since such driver code belongs to
      the root domain and cannot be shared between domains.
      
      Output issued from the head domain is formatted then logged into a
      staging buffer, and a dedicated virtual IRQ is posted to the root
      domain for notification. When the virtual IRQ handler runs, the
      contents of the staging buffer is flushed to the printk() interface
      anew, which may eventually pass the output on to the console drivers
      from such a context.
      c469670d
    • Philippe Gerum's avatar
      PM / hibernate: ipipe: protect against out-of-band interrupts · 5fb7e241
      Philippe Gerum authored
      We must not allow out-of-band activity to resume while we are busy
      suspending the devices in the system, until the PM sleep state has
      been fully entered.
      
      Pair existing virtual IRQ disabling calls which only apply to the root
      domain with hard ones.
      5fb7e241
    • Philippe Gerum's avatar
      module: ipipe: enable try_module_get() from hard atomic context · b7c424c4
      Philippe Gerum authored
      We might have out-of-band code calling try_module_get() from the head
      domain, or from a section covered by a hard spinlock where the root
      domain must not reschedule. This requires the preemption management
      calls in try_module_get() (and the converse module_put()) to be
      converted to their hard variant.
      
      REVISIT: using try_module_get() from such contexts is questionable,
      client domains should be fixed.
      b7c424c4
    • Philippe Gerum's avatar
      KGDB: ipipe: enable debugging over the head domain · 5f37bade
      Philippe Gerum authored
      Make the KGDB stub runnable over the head domain since we may take
      traps and interrupts from that context too, by converting the locks to
      hard spinlocks.
      5f37bade
    • Philippe Gerum's avatar
      context_tracking: ipipe: do not track over the head domain · 5ac01b95
      Philippe Gerum authored
      Context tracking is a prerequisite for FULL_NOHZ, so that the RCU
      subsystem can detect CPU idleness without relying on the (regular)
      timer tick.
      
      Out-of-band activity running over the head domain should by definition
      not be involved in such detection logic, as the root domain has no
      knowledge of what happens - and when - on the head domain whatsoever.
      5ac01b95
    • Philippe Gerum's avatar
      lib/smp_processor_id: ipipe: exclude head domain from preemption check · bc085b2b
      Philippe Gerum authored
      There can be no CPU migration from the head stage, however the
      out-of-band code currently running smp_processor_id() might have
      preempted the regular kernel code from within a preemptible section,
      which might cause false positive in the end.
      
      These are the two reasons why we certainly neither need nor want to do
      the preemption check in that case.
      bc085b2b
    • Philippe Gerum's avatar
      atomic: ipipe: keep atomic when pipelining IRQs · 43381d19
      Philippe Gerum authored
      Because of the virtualization of interrupt masking for the regular
      kernel code when the pipeline is enabled, atomic helpers relying on
      common interrupt disabling helpers such as local_irq_save/restore
      pairs would not be atomic anymore, leading to data corruption.
      
      This commit restores true atomicity for the atomic helpers that would
      be otherwise affected by interrupt virtualization.
      43381d19
    • Philippe Gerum's avatar
      preempt: ipipe: : add preemption-safe hard_preempt_{enable, disable}() ops · e68fafcc
      Philippe Gerum authored
      Some inner code of the interrupt pipeline may have to traverse regular
      kernel code which manipulates the preemption count, expecting full
      serialization including with out-of-band contexts.
      
      The hard_preempt_*() variants are substituted to the original
      preempt_{enable, disable}() calls in these cases.
      e68fafcc
    • Philippe Gerum's avatar
      genirq: ipipe: protect generic chip against domain preemption · 4d7b948f
      Philippe Gerum authored
      As described in Documentation/ipipe.rst, irq_chip drivers need to be
      specifically adapted for dealing with interrupt pipelining safely.
      
      The basic issue to address is proper serialization between some
      irq_chip handlers which may be called from out-of-band context
      immediately upon receipt of an IRQ, and the rest of the driver which
      may access the same data / IO registers from the regular - in-band -
      context on the same CPU.
      
      This commit converts the generic irq_chip lock to a hard spinlock,
      which ensures such serialization.
      4d7b948f
    • Philippe Gerum's avatar
      ipipe: add latency tracer · a8de4bad
      Philippe Gerum authored
      The latency tracer is a variant of ftrace's 'function' tracer
      providing detailed information about the current interrupt state at
      each function entry (i.e. virtual interrupt flag and CPU interrupt
      disable bit). This commit introduces the generic tracer code, which
      builds upon the regular ftrace API.
      
      The arch-specific code should provide for ipipe_read_tsc(), a helper
      routine returning a 64bit monotonic time value for timestamping
      purpose. HAVE_IPIPE_TRACER_SUPPORT should be selected by the
      arch-specific code for enabling the tracer, which in turn makes
      CONFIG_IPIPE_TRACE available from the Kconfig interface.
      a8de4bad
    • Philippe Gerum's avatar
      ipipe: add out-of-band tick device · 216ce73c
      Philippe Gerum authored
      The out-of-band tick device manages the timer hardware by interposing
      on selected clockevent handlers transparently, so that a client domain
      (e.g. a co-kernel) eventually controls such hardware for scheduling
      the high-precision timer events it needs to. Those events are
      delivered to out-of-hand activities running on the head stage,
      unimpeded by (only virtually) interrupt-free sections of the regular
      kernel code.
      
      This commit introduces the generic API for controlling the out-of-band
      tick device from a co-kernel. It also provides for the internal API
      clock event chip drivers should use for enabling high-precision
      timing for their hardware.
      216ce73c
    • Philippe Gerum's avatar
      locking: ipipe: add hard lock alternative to regular spinlocks · d46d1207
      Philippe Gerum authored
      Hard spinlocks manipulate the CPU interrupt mask, without affecting
      the kernel preemption state in locking/unlocking operations.
      
      This type of spinlock is useful for implementing a critical section to
      serialize concurrent accesses from both in-band and out-of-band
      contexts, i.e. from root and head stages.
      
      Hard spinlocks exclusively depend on the pre-existing arch-specific
      bits which implement regular spinlocks. They can be seen as basic
      spinlocks still affecting the CPU's interrupt state when all other
      spinlock types only deal with the virtual interrupt flag managed by
      the pipeline core - i.e. only disable interrupts for the regular
      in-band kernel activity.
      d46d1207
    • Philippe Gerum's avatar
      genirq: add generic I-pipe core · b4d5ff3f
      Philippe Gerum authored
      This commit provides the arch-independent bits for implementing the
      interrupt pipeline core, a lightweight layer introducing a separate,
      high-priority execution stage for handling all IRQs in pseudo-NMI
      mode, which cannot be delayed by the regular kernel code. See
      Documentation/ipipe.rst for details about interrupt pipelining.
      
      Architectures which support interrupt pipelining should select
      HAVE_IPIPE_SUPPORT, along with implementing the required arch-specific
      code. In such a case, CONFIG_IPIPE becomes available to the user via
      the Kconfig interface for enabling the feature.
      b4d5ff3f
    • Greg Kroah-Hartman's avatar
      Linux 4.19.33 · 4b3a3ab0
      Greg Kroah-Hartman authored
      4b3a3ab0
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Remove the old connections for the muxes · 11008a9b
      Heikki Krogerus authored
      commit 148b0aa78e4e1077e38f928124bbc9c2d2d24006 upstream.
      
      USB Type-C class driver now expects the muxes to be always
      assigned to the ports and not controllers, so the
      connections for the mux and fusb302 can be removed.
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11008a9b
    • Heikki Krogerus's avatar
      usb: typec: class: Don't use port parent for getting mux handles · 056cda45
      Heikki Krogerus authored
      commit 23481121c81d984193edf1532f5e123637e50903 upstream.
      
      It is not possible to use the parent of the port device when
      requesting mux handles as the parent may be a multiport USB
      Type-C or PD controller. The muxes must be assigned to the
      ports, not the controllers.
      
      This will also move the requesting of the muxes after the
      port device is initialized.
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      056cda45
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Add connections for the USB Type-C port · 6875404a
      Heikki Krogerus authored
      commit 495965a1002a0b301bf4fbfd1aed3233f3e7db1b upstream.
      
      Assigning the mux to the USB Type-C port on top of fusb302.
      That will prepare this driver for the change in the USB
      Type-C class code, where the class driver will assume the
      muxes to be always assigned to the ports and not the
      controllers.
      
      Once the USB Type-C class driver has been updated, the
      connections between the mux and fusb302 can be dropped.
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6875404a
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Add connection for the DP alt mode · 681a9fc1
      Heikki Krogerus authored
      commit 78d2b54b134ea6059e2b1554ad53fab2300a4cc6 upstream.
      
      Adding a connection for the DisplayPort alternate mode.
      PI3USB30532 is used for muxing the port to DisplayPort on
      CHT platforms. The connection allows the alternate mode
      device to get handle to the mux, and therefore make it
      possible to use the USB Type-C connector as DisplayPort.
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      681a9fc1
    • Heikki Krogerus's avatar
      platform: x86: intel_cht_int33fe: Register all connections at once · 3bb446a3
      Heikki Krogerus authored
      commit 140a4ec4adddda615b4e8e8055ca37a30c7fe5e8 upstream.
      
      We can register all device connection descriptors with a
      single call to device_connections_add().
      Acked-by: default avatarAndy Shevchenko <andy.shevchenko@gmail.com>
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bb446a3
    • Heikki Krogerus's avatar
      drivers: base: Helpers for adding device connection descriptions · e99d90ce
      Heikki Krogerus authored
      commit cd7753d371388e712e3ee52b693459f9b71aaac2 upstream.
      
      Introducing helpers for adding and removing multiple device
      connection descriptions at once.
      Acked-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e99d90ce
    • Xu Yu's avatar
      bpf: do not restore dst_reg when cur_state is freed · f5959dec
      Xu Yu authored
      commit 0803278b0b4d8eeb2b461fb698785df65a725d9e upstream.
      
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      
      Call trace:
      
        dump_stack+0xbf/0x12e
        print_address_description+0x6a/0x280
        kasan_report+0x237/0x360
        sanitize_ptr_alu+0x85a/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fault injection trace:
      
        kfree+0xea/0x290
        free_func_state+0x4a/0x60
        free_verifier_state+0x61/0xe0
        push_stack+0x216/0x2f0	          <- inject failslab
        sanitize_ptr_alu+0x2b1/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5959dec
    • Gao Xiang's avatar
      staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir() · 738dda85
      Gao Xiang authored
      commit 33bac912840fe64dbc15556302537dc6a17cac63 upstream.
      
      After commit 419d6efc50e9, kernel cannot be crashed in the namei
      path. However, corrupted nameoff can do harm in the process of
      readdir for scenerios without dm-verity as well. Fix it now.
      
      Fixes: 3aa8ec71 ("staging: erofs: add directory operations")
      Cc: <stable@vger.kernel.org> # 4.19+
      Signed-off-by: default avatarGao Xiang <gaoxiang25@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      738dda85
    • Gao Xiang's avatar
      staging: erofs: fix error handling when failed to read compresssed data · 83bbd66b
      Gao Xiang authored
      commit b6391ac73400eff38377a4a7364bd3df5efb5178 upstream.
      
      Complete read error handling paths for all three kinds of
      compressed pages:
      
       1) For cache-managed pages, PG_uptodate will be checked since
          read_endio will unlock and SetPageUptodate for these pages;
      
       2) For inplaced pages, read_endio cannot SetPageUptodate directly
          since it should be used to mark the final decompressed data,
          PG_error will be set with page locked for IO error instead;
      
       3) For staging pages, PG_error is used, which is similar to
          what we do for inplaced pages.
      
      Fixes: 3883a79a ("staging: erofs: introduce VLE decompression support")
      Cc: <stable@vger.kernel.org> # 4.19+
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarGao Xiang <gaoxiang25@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      83bbd66b
    • Sean Christopherson's avatar
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 3a18eaba
      Sean Christopherson authored
      commit 0cf9135b773bf32fba9dd8e6699c1b331ee4b749 upstream.
      
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a18eaba
    • Sean Christopherson's avatar
      KVM: x86: update %rip after emulating IO · b9733a74
      Sean Christopherson authored
      commit 45def77ebf79e2e8942b89ed79294d97ce914fa0 upstream.
      
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9733a74
    • Sean Christopherson's avatar
      KVM: Reject device ioctls from processes other than the VM's creator · 7ceedcef
      Sean Christopherson authored
      commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream.
      
      KVM's API requires thats ioctls must be issued from the same process
      that created the VM.  In other words, userspace can play games with a
      VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
      creator can do anything useful.  Explicitly reject device ioctls that
      are issued by a process other than the VM's creator, and update KVM's
      API documentation to extend its requirements to device ioctls.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ceedcef
    • Thomas Gleixner's avatar
      x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y · a0713e81
      Thomas Gleixner authored
      commit bebd024e4815b1a170fcd21ead9c2222b23ce9e6 upstream.
      
      The SMT disable 'nosmt' command line argument is not working properly when
      CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
      required to be brought up due to the MCE issues, cannot work. The CPUs are
      then kept in a half dead state.
      
      As the 'nosmt' functionality has become popular due to the speculative
      hardware vulnerabilities, the half torn down state is not a proper solution
      to the problem.
      
      Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
      possible.
      Reported-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0713e81