1. 05 Apr, 2019 3 commits
  2. 23 Feb, 2019 1 commit
    • Joerg Roedel's avatar
      KVM: VMX: Fix x2apic check in vmx_msr_bitmap_mode() · 49e1a9d1
      Joerg Roedel authored
      The stable backport of upstream commit
      
      	904e14fb KVM: VMX: make MSR bitmaps per-VCPU
      
      has a bug in vmx_msr_bitmap_mode(). It enables the x2apic
      MSR-bitmap when the kernel emulates x2apic for the guest in
      software. The upstream version of the commit checkes whether
      the hardware has virtualization enabled for x2apic
      emulation.
      
      Since KVM emulates x2apic for guests even when the host does
      not support x2apic in hardware, this causes the intercept of
      at least the X2APIC_TASKPRI MSR to be disabled on machines
      not supporting that MSR. The result is undefined behavior,
      on some machines (Intel Westmere based) it causes a crash of
      the guest kernel when it tries to access that MSR.
      
      Change the check in vmx_msr_bitmap_mode() to match the upstream
      code. This fixes the guest crashes observed with stable
      kernels starting with v4.4.168 through v4.4.175.
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49e1a9d1
  3. 20 Feb, 2019 1 commit
  4. 13 Jan, 2019 1 commit
  5. 17 Dec, 2018 13 commits
  6. 06 Aug, 2018 1 commit
    • Roman Kagan's avatar
      kvm: x86: vmx: fix vpid leak · 314b4655
      Roman Kagan authored
      commit 63aff65573d73eb8dda4732ad4ef222dd35e4862 upstream.
      
      VPID for the nested vcpu is allocated at vmx_create_vcpu whenever nested
      vmx is turned on with the module parameter.
      
      However, it's only freed if the L1 guest has executed VMXON which is not
      a given.
      
      As a result, on a system with nested==on every creation+deletion of an
      L1 vcpu without running an L2 guest results in leaking one vpid.  Since
      the total number of vpids is limited to 64k, they can eventually get
      exhausted, preventing L2 from starting.
      
      Delay allocation of the L2 vpid until VMXON emulation, thus matching its
      freeing.
      
      Fixes: 5c614b35
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      314b4655
  7. 25 Jul, 2018 1 commit
  8. 16 Jun, 2018 1 commit
  9. 30 May, 2018 1 commit
    • Sean Christopherson's avatar
      KVM: VMX: raise internal error for exception during invalid protected mode state · ffd0502d
      Sean Christopherson authored
      [ Upstream commit add5ff7a216ee545a214013f26d1ef2f44a9c9f8 ]
      
      Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter
      an exception in Protected Mode while emulating guest due to invalid
      guest state.  Unlike Big RM, KVM doesn't support emulating exceptions
      in PM, i.e. PM exceptions are always injected via the VMCS.  Because
      we will never do VMRESUME due to emulation_required, the exception is
      never realized and we'll keep emulating the faulting instruction over
      and over until we receive a signal.
      
      Exit to userspace iff there is a pending exception, i.e. don't exit
      simply on a requested event. The purpose of this check and exit is to
      aid in debugging a guest that is in all likelihood already doomed.
      Invalid guest state in PM is extremely limited in normal operation,
      e.g. it generally only occurs for a few instructions early in BIOS,
      and any exception at this time is all but guaranteed to be fatal.
      Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly
      handled/emulated, while checking for vectored interrupts, e.g. INTR
      and NMI, without hitting false positives would add a fair amount of
      complexity for almost no benefit (getting hit by lightning seems
      more likely than encountering this specific scenario).
      
      Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an
      exception via the VMCS and emulation_required is true.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffd0502d
  10. 13 Apr, 2018 1 commit
  11. 28 Mar, 2018 1 commit
    • Linus Torvalds's avatar
      kvm/x86: fix icebp instruction handling · 5e4e65a9
      Linus Torvalds authored
      commit 32d43cd391bacb5f0814c2624399a5dad3501d09 upstream.
      
      The undocumented 'icebp' instruction (aka 'int1') works pretty much like
      'int3' in the absense of in-circuit probing equipment (except,
      obviously, that it raises #DB instead of raising #BP), and is used by
      some validation test-suites as such.
      
      But Andy Lutomirski noticed that his test suite acted differently in kvm
      than on bare hardware.
      
      The reason is that kvm used an inexact test for the icebp instruction:
      it just assumed that an all-zero VM exit qualification value meant that
      the VM exit was due to icebp.
      
      That is not unlike the guess that do_debug() does for the actual
      exception handling case, but it's purely a heuristic, not an absolute
      rule.  do_debug() does it because it wants to ascribe _some_ reasons to
      the #DB that happened, and an empty %dr6 value means that 'icebp' is the
      most likely casue and we have no better information.
      
      But kvm can just do it right, because unlike the do_debug() case, kvm
      actually sees the real reason for the #DB in the VM-exit interruption
      information field.
      
      So instead of relying on an inexact heuristic, just use the actual VM
      exit information that says "it was 'icebp'".
      
      Right now the 'icebp' instruction isn't technically documented by Intel,
      but that will hopefully change.  The special "privileged software
      exception" information _is_ actually mentioned in the Intel SDM, even
      though the cause of it isn't enumerated.
      Reported-by: default avatarAndy Lutomirski <luto@kernel.org>
      Tested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e4e65a9
  12. 25 Feb, 2018 6 commits
  13. 16 Feb, 2018 1 commit
    • Liran Alon's avatar
      KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2 · 0e524b26
      Liran Alon authored
      commit 6b697711 upstream.
      
      Consider the following scenario:
      1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI
      to CPU B via virtual posted-interrupt mechanism.
      2. CPU B is currently executing L2 guest.
      3. vmx_deliver_nested_posted_interrupt() calls
      kvm_vcpu_trigger_posted_interrupt() which will note that
      vcpu->mode == IN_GUEST_MODE.
      4. Assume that before CPU A sends the physical POSTED_INTR_NESTED_VECTOR
      IPI, CPU B exits from L2 to L0 during event-delivery
      (valid IDT-vectoring-info).
      5. CPU A now sends the physical IPI. The IPI is received in host and
      it's handler (smp_kvm_posted_intr_nested_ipi()) does nothing.
      6. Assume that before CPU A sets pi_pending=true and KVM_REQ_EVENT,
      CPU B continues to run in L0 and reach vcpu_enter_guest(). As
      KVM_REQ_EVENT is not set yet, vcpu_enter_guest() will continue and resume
      L2 guest.
      7. At this point, CPU A sets pi_pending=true and KVM_REQ_EVENT but
      it's too late! CPU B already entered L2 and KVM_REQ_EVENT will only be
      consumed at next L2 entry!
      
      Another scenario to consider:
      1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI
      to CPU B via virtual posted-interrupt mechanism.
      2. Assume that before CPU A calls kvm_vcpu_trigger_posted_interrupt(),
      CPU B is at L0 and is about to resume into L2. Further assume that it is
      in vcpu_enter_guest() after check for KVM_REQ_EVENT.
      3. At this point, CPU A calls kvm_vcpu_trigger_posted_interrupt() which
      will note that vcpu->mode != IN_GUEST_MODE. Therefore, do nothing and
      return false. Then, will set pi_pending=true and KVM_REQ_EVENT.
      4. Now CPU B continue and resumes into L2 guest without processing
      the posted-interrupt until next L2 entry!
      
      To fix both issues, we just need to change
      vmx_deliver_nested_posted_interrupt() to set pi_pending=true and
      KVM_REQ_EVENT before calling kvm_vcpu_trigger_posted_interrupt().
      
      It will fix the first scenario by chaging step (6) to note that
      KVM_REQ_EVENT and pi_pending=true and therefore process
      nested posted-interrupt.
      
      It will fix the second scenario by two possible ways:
      1. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B has changed
      vcpu->mode to IN_GUEST_MODE, physical IPI will be sent and will be received
      when CPU resumes into L2.
      2. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B hasn't yet
      changed vcpu->mode to IN_GUEST_MODE, then after CPU B will change
      vcpu->mode it will call kvm_request_pending() which will return true and
      therefore force another round of vcpu_enter_guest() which will note that
      KVM_REQ_EVENT and pi_pending=true and therefore process nested
      posted-interrupt.
      
      Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing")
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      [Add kvm_vcpu_kick to also handle the case where L1 doesn't intercept L2 HLT
       and L2 executes HLT instruction. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e524b26
  14. 03 Feb, 2018 2 commits
    • Wanpeng Li's avatar
      KVM: VMX: Fix rflags cache during vCPU reset · fe7832ff
      Wanpeng Li authored
      
      [ Upstream commit c37c2873 ]
      
      Reported by syzkaller:
      
         *** Guest State ***
         CR0: actual=0x0000000080010031, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
         CR4: actual=0x0000000000002061, shadow=0x0000000000000000, gh_mask=ffffffffffffe8f1
         CR3 = 0x000000002081e000
         RSP = 0x000000000000fffa  RIP = 0x0000000000000000
         RFLAGS=0x00023000         DR7 = 0x00000000000000
                ^^^^^^^^^^
         ------------[ cut here ]------------
         WARNING: CPU: 6 PID: 24431 at /home/kernel/linux/arch/x86/kvm//x86.c:7302 kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
         CPU: 6 PID: 24431 Comm: reprotest Tainted: G        W  OE   4.14.0+ #26
         RIP: 0010:kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
         RSP: 0018:ffff880291d179e0 EFLAGS: 00010202
         Call Trace:
          kvm_vcpu_ioctl+0x479/0x880 [kvm]
          do_vfs_ioctl+0x142/0x9a0
          SyS_ioctl+0x74/0x80
          entry_SYSCALL_64_fastpath+0x23/0x9a
      
      The failed vmentry is triggered by the following beautified testcase:
      
          #include <unistd.h>
          #include <sys/syscall.h>
          #include <string.h>
          #include <stdint.h>
          #include <linux/kvm.h>
          #include <fcntl.h>
          #include <sys/ioctl.h>
      
          long r[5];
          int main()
          {
              struct kvm_debugregs dr = { 0 };
      
              r[2] = open("/dev/kvm", O_RDONLY);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
              r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
              struct kvm_guest_debug debug = {
                      .control = 0xf0403,
                      .arch = {
                              .debugreg[6] = 0x2,
                              .debugreg[7] = 0x2
                      }
              };
              ioctl(r[4], KVM_SET_GUEST_DEBUG, &debug);
              ioctl(r[4], KVM_RUN, 0);
          }
      
      which testcase tries to setup the processor specific debug
      registers and configure vCPU for handling guest debug events through
      KVM_SET_GUEST_DEBUG.  The KVM_SET_GUEST_DEBUG ioctl will get and set
      rflags in order to set TF bit if single step is needed. All regs' caches
      are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU
      reset. However, the cache of rflags is not reset during vCPU reset. The
      function vmx_get_rflags() returns an unreset rflags cache value since
      the cache is marked avail, it is 0 after boot. Vmentry fails if the
      rflags reserved bit 1 is 0.
      
      This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and
      its cache to 0x2 during vCPU reset.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe7832ff
    • Liran Alon's avatar
      KVM: x86: Don't re-execute instruction when not passing CR2 value · 80d2b5af
      Liran Alon authored
      
      [ Upstream commit 9b8ae637 ]
      
      In case of instruction-decode failure or emulation failure,
      x86_emulate_instruction() will call reexecute_instruction() which will
      attempt to use the cr2 value passed to x86_emulate_instruction().
      However, when x86_emulate_instruction() is called from
      emulate_instruction(), cr2 is not passed (passed as 0) and therefore
      it doesn't make sense to execute reexecute_instruction() logic at all.
      
      Fixes: 51d8b661 ("KVM: cleanup emulate_instruction")
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80d2b5af
  15. 23 Jan, 2018 1 commit
  16. 17 Jan, 2018 2 commits
  17. 25 Dec, 2017 1 commit
    • Wanpeng Li's avatar
      KVM: VMX: Fix enable VPID conditions · f1fdf68b
      Wanpeng Li authored
      
      [ Upstream commit 08d839c4 ]
      
      This can be reproduced by running L2 on L1, and disable VPID on L0
      if w/o commit "KVM: nVMX: Fix nested VPID vmx exec control", the L2
      crash as below:
      
      KVM: entry failed, hardware error 0x7
      EAX=00000000 EBX=00000000 ECX=00000000 EDX=000306c3
      ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
      EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =0000 00000000 0000ffff 00009300
      CS =f000 ffff0000 0000ffff 00009b00
      SS =0000 00000000 0000ffff 00009300
      DS =0000 00000000 0000ffff 00009300
      FS =0000 00000000 0000ffff 00009300
      GS =0000 00000000 0000ffff 00009300
      LDT=0000 00000000 0000ffff 00008200
      TR =0000 00000000 0000ffff 00008b00
      GDT=     00000000 0000ffff
      IDT=     00000000 0000ffff
      CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
      DR6=00000000ffff0ff0 DR7=0000000000000400
      EFER=0000000000000000
      
      Reference SDM 30.3 INVVPID:
      
      Protected Mode Exceptions
      - #UD
        - If not in VMX operation.
        - If the logical processor does not support VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=0).
        - If the logical processor supports VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=1) but does
          not support the INVVPID instruction (IA32_VMX_EPT_VPID_CAP[32]=0).
      
      So we should check both VPID enable bit in vmx exec control and INVVPID support bit
      in vmx capability MSRs to enable VPID. This patch adds the guarantee to not enable
      VPID if either INVVPID or single-context/all-context invalidation is not exposed in
      vmx capability MSRs.
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1fdf68b
  18. 16 Dec, 2017 2 commits
    • Wanpeng Li's avatar
      KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset · ccf72fe2
      Wanpeng Li authored
      
      [ Upstream commit 2f707d97 ]
      
      Reported by syzkaller:
      
          WARNING: CPU: 1 PID: 27742 at arch/x86/kvm/vmx.c:11029
          nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029
          CPU: 1 PID: 27742 Comm: a.out Not tainted 4.10.0+ #229
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:15 [inline]
           dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
           panic+0x1fb/0x412 kernel/panic.c:179
           __warn+0x1c4/0x1e0 kernel/panic.c:540
           warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
           nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029
           vmx_leave_nested arch/x86/kvm/vmx.c:11136 [inline]
           vmx_set_msr+0x1565/0x1910 arch/x86/kvm/vmx.c:3324
           kvm_set_msr+0xd4/0x170 arch/x86/kvm/x86.c:1099
           do_set_msr+0x11e/0x190 arch/x86/kvm/x86.c:1128
           __msr_io arch/x86/kvm/x86.c:2577 [inline]
           msr_io+0x24b/0x450 arch/x86/kvm/x86.c:2614
           kvm_arch_vcpu_ioctl+0x35b/0x46a0 arch/x86/kvm/x86.c:3497
           kvm_vcpu_ioctl+0x232/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2721
           vfs_ioctl fs/ioctl.c:43 [inline]
           do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683
           SYSC_ioctl fs/ioctl.c:698 [inline]
           SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
           entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      The syzkaller folks reported a nested_run_pending warning during userspace
      clear VMX capability which is exposed to L1 before.
      
      The warning gets thrown while doing
      
      (*(uint32_t*)0x20aecfe8 = (uint32_t)0x1);
      (*(uint32_t*)0x20aecfec = (uint32_t)0x0);
      (*(uint32_t*)0x20aecff0 = (uint32_t)0x3a);
      (*(uint32_t*)0x20aecff4 = (uint32_t)0x0);
      (*(uint64_t*)0x20aecff8 = (uint64_t)0x0);
      r[29] = syscall(__NR_ioctl, r[4], 0x4008ae89ul,
      		0x20aecfe8ul, 0, 0, 0, 0, 0, 0);
      
      i.e. KVM_SET_MSR ioctl with
      
      struct kvm_msrs {
      	.nmsrs = 1,
      		.pad = 0,
      		.entries = {
      			{.index = MSR_IA32_FEATURE_CONTROL,
      			 .reserved = 0,
      			 .data = 0}
      		}
      }
      
      The VMLANCH/VMRESUME emulation should be stopped since the CPU is going to
      reset here. This patch resets the nested_run_pending since the CPU is going
      to be reset hence there should be nothing pending.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Suggested-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ccf72fe2
    • Jim Mattson's avatar
      kvm: nVMX: VMCLEAR should not cause the vCPU to shut down · f9b291ae
      Jim Mattson authored
      
      [ Upstream commit 587d7e72 ]
      
      VMCLEAR should silently ignore a failure to clear the launch state of
      the VMCS referenced by the operand.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      [Changed "kvm_write_guest(vcpu->kvm" to "kvm_vcpu_write_guest(vcpu".]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9b291ae