Commit 15303ba5 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'kvm-4.16-1' of git://

Pull KVM updates from Radim Krčmář:

   - icache invalidation optimizations, improving VM startup time

   - support for forwarded level-triggered interrupts, improving
     performance for timers and passthrough platform devices

   - a small fix for power-management notifiers, and some cosmetic


   - add MMIO emulation for vector loads and stores

   - allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
     requiring the complex thread synchronization of older CPU versions

   - improve the handling of escalation interrupts with the XIVE
     interrupt controller

   - support decrement register migration

   - various cleanups and bugfixes.


   - Cornelia Huck passed maintainership to Janosch Frank

   - exitless interrupts for emulated devices

   - cleanup of cpuflag handling

   - kvm_stat counter improvements

   - VSIE improvements

   - mm cleanup


   - hypervisor part of SEV

   - UMIP, RDPID, and MSR_SMI_COUNT emulation

   - paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit

   - allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
     AVX512 features

   - show vcpu id in its anonymous inode name

   - many fixes and cleanups

   - per-VCPU MSR bitmaps (already merged through x86/pti branch)

   - stable KVM clock when nesting on Hyper-V (merged through

* tag 'kvm-4.16-1' of git:// (197 commits)
  KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
  KVM: PPC: Book3S HV: Branch inside feature section
  KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
  KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
  KVM: PPC: Book3S PR: Fix broken select due to misspelling
  KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
  KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
  KVM: PPC: Book3S HV: Drop locks before reading guest memory
  kvm: x86: remove efer_reload entry in kvm_vcpu_stat
  KVM: x86: AMD Processor Topology Information
  x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
  kvm: embed vcpu id to dentry of vcpu anon inode
  kvm: Map PFN-type memory regions as writable (if possible)
  x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
  KVM: arm/arm64: Fixup userspace irqchip static key optimization
  KVM: arm/arm64: Fix userspace_irqchip_in_use counting
  KVM: arm/arm64: Fix incorrect timer_is_pending logic
  MAINTAINERS: update KVM/s390 maintainers
  MAINTAINERS: add Halil as additional vfio-ccw maintainer
  MAINTAINERS: add David as a reviewer for KVM/s390
parents 9a61df9e 1ab03c07
......@@ -26,3 +26,6 @@ s390-diag.txt
- Diagnose hypercall description (for IBM S/390)
- timekeeping virtualization for x86-based architectures.
- notes on AMD Secure Encrypted Virtualization feature and SEV firmware
command description
Secure Encrypted Virtualization (SEV)
Secure Encrypted Virtualization (SEV) is a feature found on AMD processors.
SEV is an extension to the AMD-V architecture which supports running
virtual machines (VMs) under the control of a hypervisor. When enabled,
the memory contents of a VM will be transparently encrypted with a key
unique to that VM.
The hypervisor can determine the SEV support through the CPUID
instruction. The CPUID function 0x8000001f reports information related
to SEV::
Bit[1] indicates support for SEV
Bits[31:0] Number of encrypted guests supported simultaneously
If support for SEV is present, MSR 0xc001_0010 (MSR_K8_SYSCFG) and MSR 0xc001_0015
(MSR_K7_HWCR) can be used to determine if it can be enabled::
Bit[23] 1 = memory encryption can be enabled
0 = memory encryption can not be enabled
Bit[0] 1 = memory encryption can be enabled
0 = memory encryption can not be enabled
When SEV support is available, it can be enabled in a specific VM by
setting the SEV bit before executing VMRUN.::
Bit[1] 1 = SEV is enabled
0 = SEV is disabled
SEV hardware uses ASIDs to associate a memory encryption key with a VM.
Hence, the ASID for the SEV-enabled guests must be from 1 to a maximum value
defined in the CPUID 0x8000001f[ecx] field.
SEV Key Management
The SEV guest key management is handled by a separate processor called the AMD
Secure Processor (AMD-SP). Firmware running inside the AMD-SP provides a secure
key management interface to perform common hypervisor activities such as
encrypting bootstrap code, snapshot, migrating and debugging the guest. For more
information, see the SEV Key Management spec [api-spec]_
KVM implements the following commands to support common lifecycle events of SEV
guests, such as launching, running, snapshotting, migrating and decommissioning.
The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform
context. In a typical workflow, this command should be the first command issued.
Returns: 0 on success, -negative on error
The KVM_SEV_LAUNCH_START command is used for creating the memory encryption
context. To create the encryption context, user must provide a guest policy,
the owner's public Diffie-Hellman (PDH) key and session information.
Parameters: struct kvm_sev_launch_start (in/out)
Returns: 0 on success, -negative on error
struct kvm_sev_launch_start {
__u32 handle; /* if zero then firmware creates a new handle */
__u32 policy; /* guest's policy */
__u64 dh_uaddr; /* userspace address pointing to the guest owner's PDH key */
__u32 dh_len;
__u64 session_addr; /* userspace address which points to the guest session information */
__u32 session_len;
On success, the 'handle' field contains a new handle and on error, a negative value.
For more details, see SEV spec Section 6.2.
The KVM_SEV_LAUNCH_UPDATE_DATA is used for encrypting a memory region. It also
calculates a measurement of the memory contents. The measurement is a signature
of the memory contents that can be sent to the guest owner as an attestation
that the memory was encrypted correctly by the firmware.
Parameters (in): struct kvm_sev_launch_update_data
Returns: 0 on success, -negative on error
struct kvm_sev_launch_update {
__u64 uaddr; /* userspace address to be encrypted (must be 16-byte aligned) */
__u32 len; /* length of the data to be encrypted (must be 16-byte aligned) */
For more details, see SEV spec Section 6.3.
The KVM_SEV_LAUNCH_MEASURE command is used to retrieve the measurement of the
data encrypted by the KVM_SEV_LAUNCH_UPDATE_DATA command. The guest owner may
wait to provide the guest with confidential information until it can verify the
measurement. Since the guest owner knows the initial contents of the guest at
boot, the measurement can be verified by comparing it to what the guest owner
Parameters (in): struct kvm_sev_launch_measure
Returns: 0 on success, -negative on error
struct kvm_sev_launch_measure {
__u64 uaddr; /* where to copy the measurement */
__u32 len; /* length of measurement blob */
For more details on the measurement verification flow, see SEV spec Section 6.4.
After completion of the launch flow, the KVM_SEV_LAUNCH_FINISH command can be
issued to make the guest ready for the execution.
Returns: 0 on success, -negative on error
The KVM_SEV_GUEST_STATUS command is used to retrieve status information about a
SEV-enabled guest.
Parameters (out): struct kvm_sev_guest_status
Returns: 0 on success, -negative on error
struct kvm_sev_guest_status {
__u32 handle; /* guest handle */
__u32 policy; /* guest policy */
__u8 state; /* guest state (see enum below) */
SEV guest state:
enum {
SEV_STATE_LAUNCHING, /* guest is currently being launched */
SEV_STATE_SECRET, /* guest is being launched and ready to accept the ciphertext data */
SEV_STATE_RUNNING, /* guest is fully launched and running */
SEV_STATE_RECEIVING, /* guest is being migrated in from another SEV machine */
SEV_STATE_SENDING /* guest is getting migrated out to another SEV machine */
The KVM_SEV_DEBUG_DECRYPT command can be used by the hypervisor to request the
firmware to decrypt the data at the given memory region.
Parameters (in): struct kvm_sev_dbg
Returns: 0 on success, -negative on error
struct kvm_sev_dbg {
__u64 src_uaddr; /* userspace address of data to decrypt */
__u64 dst_uaddr; /* userspace address of destination */
__u32 len; /* length of memory region to decrypt */
The command returns an error if the guest policy does not allow debugging.
The KVM_SEV_DEBUG_ENCRYPT command can be used by the hypervisor to request the
firmware to encrypt the data at the given memory region.
Parameters (in): struct kvm_sev_dbg
Returns: 0 on success, -negative on error
struct kvm_sev_dbg {
__u64 src_uaddr; /* userspace address of data to encrypt */
__u64 dst_uaddr; /* userspace address of destination */
__u32 len; /* length of memory region to encrypt */
The command returns an error if the guest policy does not allow debugging.
The KVM_SEV_LAUNCH_SECRET command can be used by the hypervisor to inject secret
data after the measurement has been validated by the guest owner.
Parameters (in): struct kvm_sev_launch_secret
Returns: 0 on success, -negative on error
struct kvm_sev_launch_secret {
__u64 hdr_uaddr; /* userspace address containing the packet header */
__u32 hdr_len;
__u64 guest_uaddr; /* the guest memory region where the secret should be injected */
__u32 guest_len;
__u64 trans_uaddr; /* the hypervisor memory region which contains the secret */
__u32 trans_len;
.. [white-paper]
.. [api-spec]
.. [amd-apm] (section 15.34)
.. [kvm-forum]
......@@ -1841,6 +1841,7 @@ registers, find a list below:
......@@ -3403,7 +3404,7 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
or if no page table is present for the addresses (e.g. when using
Architectures: powerpc
......@@ -3449,6 +3450,57 @@ array bounds check and the array access.
These fields use the same bit definitions as the new
Capability: basic
Architectures: x86
Type: system
Parameters: an opaque platform specific structure (in/out)
Returns: 0 on success; -1 on error
If the platform supports creating encrypted VMs then this ioctl can be used
for issuing platform-specific memory encryption commands to manage those
encrypted VMs.
Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Capability: basic
Architectures: x86
Type: system
Parameters: struct kvm_enc_region (in)
Returns: 0 on success; -1 on error
This ioctl can be used to register a guest memory region which may
contain encrypted data (e.g. guest RAM, SMRAM etc).
It is used in the SEV-enabled guest. When encryption is enabled, a guest
memory region may contain encrypted data. The SEV memory encryption
engine uses a tweak such that two identical plaintext pages, each at
different locations will have differing ciphertexts. So swapping or
moving ciphertext of those pages will not result in plaintext being
swapped. So relocating (or migrating) physical backing pages for the SEV
guest will require some additional steps.
Note: The current SEV key management spec does not provide commands to
swap or migrate (move) ciphertext pages. Hence, for now we pin the guest
memory region registered with the ioctl.
Capability: basic
Architectures: x86
Type: system
Parameters: struct kvm_enc_region (in)
Returns: 0 on success; -1 on error
This ioctl can be used to unregister the guest memory region registered
5. The kvm_run structure
KVM/ARM VGIC Forwarded Physical Interrupts
The KVM/ARM code implements software support for the ARM Generic
Interrupt Controller's (GIC's) hardware support for virtualization by
allowing software to inject virtual interrupts to a VM, which the guest
OS sees as regular interrupts. The code is famously known as the VGIC.
Some of these virtual interrupts, however, correspond to physical
interrupts from real physical devices. One example could be the
architected timer, which itself supports virtualization, and therefore
lets a guest OS program the hardware device directly to raise an
interrupt at some point in time. When such an interrupt is raised, the
host OS initially handles the interrupt and must somehow signal this
event as a virtual interrupt to the guest. Another example could be a
passthrough device, where the physical interrupts are initially handled
by the host, but the device driver for the device lives in the guest OS
and KVM must therefore somehow inject a virtual interrupt on behalf of
the physical one to the guest OS.
These virtual interrupts corresponding to a physical interrupt on the
host are called forwarded physical interrupts, but are also sometimes
referred to as 'virtualized physical interrupts' and 'mapped interrupts'.
Forwarded physical interrupts are handled slightly differently compared
to virtual interrupts generated purely by a software emulated device.
The HW bit
Virtual interrupts are signalled to the guest by programming the List
Registers (LRs) on the GIC before running a VCPU. The LR is programmed
with the virtual IRQ number and the state of the interrupt (Pending,
Active, or Pending+Active). When the guest ACKs and EOIs a virtual
interrupt, the LR state moves from Pending to Active, and finally to
The LRs include an extra bit, called the HW bit. When this bit is set,
KVM must also program an additional field in the LR, the physical IRQ
number, to link the virtual with the physical IRQ.
When the HW bit is set, KVM must EITHER set the Pending OR the Active
bit, never both at the same time.
Setting the HW bit causes the hardware to deactivate the physical
interrupt on the physical distributor when the guest deactivates the
corresponding virtual interrupt.
Forwarded Physical Interrupts Life Cycle
The state of forwarded physical interrupts is managed in the following way:
- The physical interrupt is acked by the host, and becomes active on
the physical distributor (*).
- KVM sets the LR.Pending bit, because this is the only way the GICV
interface is going to present it to the guest.
- LR.Pending will stay set as long as the guest has not acked the interrupt.
- LR.Pending transitions to LR.Active on the guest read of the IAR, as
- On guest EOI, the *physical distributor* active bit gets cleared,
but the LR.Active is left untouched (set).
- KVM clears the LR on VM exits when the physical distributor
active state has been cleared.
(*): The host handling is slightly more complicated. For some forwarded
interrupts (shared), KVM directly sets the active state on the physical
distributor before entering the guest, because the interrupt is never actually
handled on the host (see details on the timer as an example below). For other
forwarded interrupts (non-shared) the host does not deactivate the interrupt
when the host ISR completes, but leaves the interrupt active until the guest
deactivates it. Leaving the interrupt active is allowed, because Linux
configures the physical GIC with EOIMode=1, which causes EOI operations to
perform a priority drop allowing the GIC to receive other interrupts of the
default priority.
Forwarded Edge and Level Triggered PPIs and SPIs
Forwarded physical interrupts injected should always be active on the
physical distributor when injected to a guest.
Level-triggered interrupts will keep the interrupt line to the GIC
asserted, typically until the guest programs the device to deassert the
line. This means that the interrupt will remain pending on the physical
distributor until the guest has reprogrammed the device. Since we
always run the VM with interrupts enabled on the CPU, a pending
interrupt will exit the guest as soon as we switch into the guest,
preventing the guest from ever making progress as the process repeats
over and over. Therefore, the active state on the physical distributor
must be set when entering the guest, preventing the GIC from forwarding
the pending interrupt to the CPU. As soon as the guest deactivates the
interrupt, the physical line is sampled by the hardware again and the host
takes a new interrupt if and only if the physical line is still asserted.
Edge-triggered interrupts do not exhibit the same problem with
preventing guest execution that level-triggered interrupts do. One
option is to not use HW bit at all, and inject edge-triggered interrupts
from a physical device as pure virtual interrupts. But that would
potentially slow down handling of the interrupt in the guest, because a
physical interrupt occurring in the middle of the guest ISR would
preempt the guest for the host to handle the interrupt. Additionally,
if you configure the system to handle interrupts on a separate physical
core from that running your VCPU, you still have to interrupt the VCPU
to queue the pending state onto the LR, even though the guest won't use
this information until the guest ISR completes. Therefore, the HW
bit should always be set for forwarded edge-triggered interrupts. With
the HW bit set, the virtual interrupt is injected and additional
physical interrupts occurring before the guest deactivates the interrupt
simply mark the state on the physical distributor as Pending+Active. As
soon as the guest deactivates the interrupt, the host takes another
interrupt if and only if there was a physical interrupt between injecting
the forwarded interrupt to the guest and the guest deactivating the
Consequently, whenever we schedule a VCPU with one or more LRs with the
HW bit set, the interrupt must also be active on the physical
Forwarded LPIs
LPIs, introduced in GICv3, are always edge-triggered and do not have an
active state. They become pending when a device signal them, and as
soon as they are acked by the CPU, they are inactive again.
It therefore doesn't make sense, and is not supported, to set the HW bit
for physical LPIs that are forwarded to a VM as virtual interrupts,
typically virtual SPIs.
For LPIs, there is no other choice than to preempt the VCPU thread if
necessary, and queue the pending state onto the LR.
Putting It Together: The Architected Timer
The architected timer is a device that signals interrupts with level
triggered semantics. The timer hardware is directly accessed by VCPUs
which program the timer to fire at some point in time. Each VCPU on a
system programs the timer to fire at different times, and therefore the
hardware is multiplexed between multiple VCPUs. This is implemented by
context-switching the timer state along with each VCPU thread.
However, this means that a scenario like the following is entirely
possible, and in fact, typical:
1. KVM runs the VCPU
2. The guest programs the time to fire in T+100
3. The guest is idle and calls WFI (wait-for-interrupts)
4. The hardware traps to the host
5. KVM stores the timer state to memory and disables the hardware timer
6. KVM schedules a soft timer to fire in T+(100 - time since step 2)
7. KVM puts the VCPU thread to sleep (on a waitqueue)
8. The soft timer fires, waking up the VCPU thread
9. KVM reprograms the timer hardware with the VCPU's values
10. KVM marks the timer interrupt as active on the physical distributor
11. KVM injects a forwarded physical interrupt to the guest
12. KVM runs the VCPU
Notice that KVM injects a forwarded physical interrupt in step 11 without
the corresponding interrupt having actually fired on the host. That is
exactly why we mark the timer interrupt as active in step 10, because
the active state on the physical distributor is part of the state
belonging to the timer hardware, which is context-switched along with
the VCPU thread.
If the guest does not idle because it is busy, the flow looks like this
1. KVM runs the VCPU
2. The guest programs the time to fire in T+100
4. At T+100 the timer fires and a physical IRQ causes the VM to exit
(note that this initially only traps to EL2 and does not run the host ISR
until KVM has returned to the host).
5. With interrupts still disabled on the CPU coming back from the guest, KVM
stores the virtual timer state to memory and disables the virtual hw timer.
6. KVM looks at the timer state (in memory) and injects a forwarded physical
interrupt because it concludes the timer has expired.
7. KVM marks the timer interrupt as active on the physical distributor
7. KVM enables the timer, enables interrupts, and runs the VCPU
Notice that again the forwarded physical interrupt is injected to the
guest without having actually been handled on the host. In this case it
is because the physical interrupt is never actually seen by the host because the
timer is disabled upon guest return, and the virtual forwarded interrupt is
injected on the KVM guest entry path.
......@@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit
|| || before enabling paravirtualized
|| || spinlock support.
KVM_FEATURE_PV_TLB_FLUSH || 9 || guest checks this feature bit
|| || before enabling paravirtualized
|| || tlb flush.
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
......@@ -7748,7 +7748,9 @@ F: arch/powerpc/kernel/kvm*
M: Christian Borntraeger <>
M: Cornelia Huck <>
M: Janosch Frank <>
R: David Hildenbrand <>
R: Cornelia Huck <>
T: git git://
......@@ -12026,6 +12028,7 @@ F: drivers/pci/hotplug/s390_pci_hpc.c
M: Cornelia Huck <>
M: Dong Jia Shi <>
M: Halil Pasic <>
S: Supported
......@@ -131,7 +131,7 @@ static inline bool mode_has_spsr(struct kvm_vcpu *vcpu)
static inline bool vcpu_mode_priv(struct kvm_vcpu *vcpu)
unsigned long cpsr_mode = vcpu->arch.ctxt.gp_regs.usr_regs.ARM_cpsr & MODE_MASK;
return cpsr_mode > USR_MODE;;
return cpsr_mode > USR_MODE;
static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
......@@ -48,6 +48,8 @@
u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode);
int __attribute_const__ kvm_target_cpu(void);
int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
......@@ -21,7 +21,6 @@
#include <linux/compiler.h>
#include <linux/kvm_host.h>
#include <asm/cp15.h>
#include <asm/kvm_mmu.h>
#include <asm/vfp.h>
#define __hyp_text __section(.hyp.text) notrace
......@@ -69,6 +68,8 @@
#define HIFAR __ACCESS_CP15(c6, 4, c0, 2)
#define HPFAR __ACCESS_CP15(c6, 4, c0, 4)
#define ICIALLUIS __ACCESS_CP15(c7, 0, c1, 0)
#define BPIALLIS __ACCESS_CP15(c7, 0, c1, 6)
#define ICIMVAU __ACCESS_CP15(c7, 0, c5, 1)
#define ATS1CPR __ACCESS_CP15(c7, 0, c8, 0)
#define TLBIALLIS __ACCESS_CP15(c8, 0, c3, 0)
#define TLBIALL __ACCESS_CP15(c8, 0, c7, 0)
......@@ -37,6 +37,8 @@
#include <linux/highmem.h>
#include <asm/cacheflush.h>
#include <asm/cputype.h>
#include <asm/kvm_hyp.h>
#include <asm/pgalloc.h>
#include <asm/stage2_pgtable.h>
......@@ -83,6 +85,18 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
return pmd;
static inline pte_t kvm_s2pte_mkexec(pte_t pte)
pte_val(pte) &= ~L_PTE_XN;
return pte;
static inline pmd_t kvm_s2pmd_mkexec(pmd_t pmd)
pmd_val(pmd) &= ~PMD_SECT_XN;
return pmd;