1. 01 Nov, 2018 3 commits
    • Philippe Gerum's avatar
      PM: ipipe: converge to Dovetail's CPUIDLE management · cb5702e0
      Philippe Gerum authored
      Handle requests for transitioning to deeper C-states the way Dovetail
      does, which prevents us from losing the timer when grabbed by a
      co-kernel, in presence of a CPUIDLE driver.
      cb5702e0
    • Philippe Gerum's avatar
      ipipe: add cpuidle control interface · caf90a26
      Philippe Gerum authored
      Add a kernel interface for sharing CPU idling control between the host
      kernel and a co-kernel. The former invokes ipipe_cpuidle_control()
      which the latter should implement, for determining whether entering a
      sleep state is ok. This hook should return boolean true if so.
      
      The co-kernel may veto such entry if need be, in order to prevent
      latency spikes, as exiting sleep states might be costly depending on
      the CPU idling operation being used.
      caf90a26
    • Philippe Gerum's avatar
      genirq: add generic I-pipe core · d9f057db
      Philippe Gerum authored
      This commit provides the arch-independent bits for implementing the
      interrupt pipeline core, a lightweight layer introducing a separate,
      high-priority execution stage for handling all IRQs in pseudo-NMI
      mode, which cannot be delayed by the regular kernel code. See
      Documentation/ipipe.rst for details about interrupt pipelining.
      
      Architectures which support interrupt pipelining should select
      HAVE_IPIPE_SUPPORT, along with implementing the required arch-specific
      code. In such a case, CONFIG_IPIPE becomes available to the user via
      the Kconfig interface for enabling the feature.
      d9f057db
  2. 18 Oct, 2018 1 commit
  3. 04 Oct, 2018 1 commit
  4. 19 Sep, 2018 2 commits
    • Eric Dumazet's avatar
      inet: frags: break the 2GB limit for frags storage · 990204dd
      Eric Dumazet authored
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      (cherry picked from commit 3e67f106)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      990204dd
    • Eric Dumazet's avatar
      inet: frags: use rhashtables for reassembly units · 9aee41ef
      Eric Dumazet authored
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      (cherry picked from commit 648700f7)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9aee41ef
  5. 17 Aug, 2018 1 commit
  6. 15 Aug, 2018 10 commits
    • Paolo Bonzini's avatar
      KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 1110cb2a
      Paolo Bonzini authored
      commit 5b76a3cf upstream
      
      When nested virtualization is in use, VMENTER operations from the nested
      hypervisor into the nested guest will always be processed by the bare metal
      hypervisor, and KVM's "conditional cache flushes" mode in particular does a
      flush on nested vmentry.  Therefore, include the "skip L1D flush on
      vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
      
      Add the relevant Documentation.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1110cb2a
    • Tom Lendacky's avatar
      KVM: x86: Add a framework for supporting MSR-based features · f0660d58
      Tom Lendacky authored
      commit 801e459a upstream
      
      Provide a new KVM capability that allows bits within MSRs to be recognized
      as features.  Two new ioctls are added to the /dev/kvm ioctl routine to
      retrieve the list of these MSRs and then retrieve their values. A kvm_x86_ops
      callback is used to determine support for the listed MSR-based features.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [Tweaked documentation. - Radim]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f0660d58
    • Thomas Gleixner's avatar
      Documentation/l1tf: Remove Yonah processors from not vulnerable list · dc6c443e
      Thomas Gleixner authored
      commit 58331136 upstream
      
      Dave reported, that it's not confirmed that Yonah processors are
      unaffected. Remove them from the list.
      Reported-by: default avatarave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc6c443e
    • Tony Luck's avatar
      Documentation/l1tf: Fix typos · 40b696da
      Tony Luck authored
      commit 1949f9f4 upstream
      
      Fix spelling and other typos
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40b696da
    • Thomas Gleixner's avatar
      Documentation: Add section about CPU vulnerabilities · a20c88c2
      Thomas Gleixner authored
      commit 3ec8ce5d upstream
      
      Add documentation for the L1TF vulnerability and the mitigation mechanisms:
      
        - Explain the problem and risks
        - Document the mitigation mechanisms
        - Document the command line controls
        - Document the sysfs files
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180713142323.287429944@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a20c88c2
    • Jiri Kosina's avatar
      x86/bugs, kvm: Introduce boot-time control of L1TF mitigations · fc083988
      Jiri Kosina authored
      commit d90a7a0e upstream
      
      Introduce the 'l1tf=' kernel command line option to allow for boot-time
      switching of mitigation that is used on processors affected by L1TF.
      
      The possible values are:
      
        full
      	Provides all available mitigations for the L1TF vulnerability. Disables
      	SMT and enables all mitigations in the hypervisors. SMT control via
      	/sys/devices/system/cpu/smt/control is still possible after boot.
      	Hypervisors will issue a warning when the first VM is started in
      	a potentially insecure configuration, i.e. SMT enabled or L1D flush
      	disabled.
      
        full,force
      	Same as 'full', but disables SMT control. Implies the 'nosmt=force'
      	command line option. sysfs control of SMT and the hypervisor flush
      	control is disabled.
      
        flush
      	Leaves SMT enabled and enables the conditional hypervisor mitigation.
      	Hypervisors will issue a warning when the first VM is started in a
      	potentially insecure configuration, i.e. SMT enabled or L1D flush
      	disabled.
      
        flush,nosmt
      	Disables SMT and enables the conditional hypervisor mitigation. SMT
      	control via /sys/devices/system/cpu/smt/control is still possible
      	after boot. If SMT is reenabled or flushing disabled at runtime
      	hypervisors will issue a warning.
      
        flush,nowarn
      	Same as 'flush', but hypervisors will not warn when
      	a VM is started in a potentially insecure configuration.
      
        off
      	Disables hypervisor mitigations and doesn't emit any warnings.
      
      Default is 'flush'.
      
      Let KVM adhere to these semantics, which means:
      
        - 'lt1f=full,force'	: Performe L1D flushes. No runtime control
          			  possible.
      
        - 'l1tf=full'
        - 'l1tf-flush'
        - 'l1tf=flush,nosmt'	: Perform L1D flushes and warn on VM start if
      			  SMT has been runtime enabled or L1D flushing
      			  has been run-time enabled
      
        - 'l1tf=flush,nowarn'	: Perform L1D flushes and no warnings are emitted.
      
        - 'l1tf=off'		: L1D flushes are not performed and no warnings
      			  are emitted.
      
      KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
      module parameter except when lt1f=full,force is set.
      
      This makes KVM's private 'nosmt' option redundant, and as it is a bit
      non-systematic anyway (this is something to control globally, not on
      hypervisor level), remove that option.
      
      Add the missing Documentation entry for the l1tf vulnerability sysfs file
      while at it.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarJiri Kosina <jkosina@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc083988
    • Konrad Rzeszutek Wilk's avatar
      x86/KVM/VMX: Add module argument for L1TF mitigation · 77c8220e
      Konrad Rzeszutek Wilk authored
      commit a399477e upstream
      
      Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
      L1 terminal fault. The valid arguments are:
      
       - "always" 	L1D cache flush on every VMENTER.
       - "cond"	Conditional L1D cache flush, explained below
       - "never"	Disable the L1D cache flush mitigation
      
      "cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
      between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
      interesting information into L1D which might exploited.
      
      [ tglx: Split out from a larger patch ]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77c8220e
    • Konrad Rzeszutek Wilk's avatar
      x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present · c2fdbbb4
      Konrad Rzeszutek Wilk authored
      commit 26acfb66 upstream
      
      If the L1TF CPU bug is present we allow the KVM module to be loaded as the
      major of users that use Linux and KVM have trusted guests and do not want a
      broken setup.
      
      Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
      such they are the ones that should set nosmt to one.
      
      Setting 'nosmt' means that the system administrator also needs to disable
      SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
      parameter, or via the /sys/devices/system/cpu/smt/control. See commit
      05736e4a ("cpu/hotplug: Provide knobs to control SMT").
      
      Other mitigations are to use task affinity, cpu sets, interrupt binding,
      etc - anything to make sure that _only_ the same guests vCPUs are running
      on sibling threads.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c2fdbbb4
    • Thomas Gleixner's avatar
      Revert "x86/apic: Ignore secondary threads if nosmt=force" · f3e68ab4
      Thomas Gleixner authored
      commit 506a66f3 upstream
      
      Dave Hansen reported, that it's outright dangerous to keep SMT siblings
      disabled completely so they are stuck in the BIOS and wait for SIPI.
      
      The reason is that Machine Check Exceptions are broadcasted to siblings and
      the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
      logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
      reboots the machine. The MCE chapter in the SDM contains the following
      blurb:
      
          Because the logical processors within a physical package are tightly
          coupled with respect to shared hardware resources, both logical
          processors are notified of machine check errors that occur within a
          given physical processor. If machine-check exceptions are enabled when
          a fatal error is reported, all the logical processors within a physical
          package are dispatched to the machine-check exception handler. If
          machine-check exceptions are disabled, the logical processors enter the
          shutdown state and assert the IERR# signal. When enabling machine-check
          exceptions, the MCE flag in control register CR4 should be set for each
          logical processor.
      
      Reverting the commit which ignores siblings at enumeration time solves only
      half of the problem. The core cpuhotplug logic needs to be adjusted as
      well.
      
      This thoughtful engineered mechanism also turns the boot process on all
      Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
      before the secondary CPUs are brought up. Depending on the number of
      physical cores the window in which this situation can happen is smaller or
      larger. On a HSW-EX it's about 750ms:
      
      MCE is enabled on the boot CPU:
      
      [    0.244017] mce: CPU supports 22 MCE banks
      
      The corresponding sibling #72 boots:
      
      [    1.008005] .... node  #0, CPUs:    #72
      
      That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
      between these two points the machine is going to shutdown. At least it's a
      known safe state.
      
      It's obvious that the early boot can be hit by an MCE as well and then runs
      into the same situation because MCEs are not yet enabled on the boot CPU.
      But after enabling them on the boot CPU, it does not make any sense to
      prevent the kernel from recovering.
      
      Adjust the nosmt kernel parameter documentation as well.
      
      Reverts: 2207def7 ("x86/apic: Ignore secondary threads if nosmt=force")
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3e68ab4
    • Thomas Gleixner's avatar
      cpu/hotplug: Provide knobs to control SMT · c5ac43ee
      Thomas Gleixner authored
      commit 05736e4a upstream
      
      Provide a command line and a sysfs knob to control SMT.
      
      The command line options are:
      
       'nosmt':	Enumerate secondary threads, but do not online them
      
       'nosmt=force': Ignore secondary threads completely during enumeration
       		via MP table and ACPI/MADT.
      
      The sysfs control file has the following states (read/write):
      
       'on':		 SMT is enabled. Secondary threads can be freely onlined
       'off':		 SMT is disabled. Secondary threads, even if enumerated
       		 cannot be onlined
       'forceoff':	 SMT is permanentely disabled. Writes to the control
       		 file are rejected.
       'notsupported': SMT is not supported by the CPU
      
      The command line option 'nosmt' sets the sysfs control to 'off'. This
      can be changed to 'on' to reenable SMT during runtime.
      
      The command line option 'nosmt=force' sets the sysfs control to
      'forceoff'. This cannot be changed during runtime.
      
      When SMT is 'on' and the control file is changed to 'off' then all online
      secondary threads are offlined and attempts to online a secondary thread
      later on are rejected.
      
      When SMT is 'off' and the control file is changed to 'on' then secondary
      threads can be onlined again. The 'off' -> 'on' transition does not
      automatically online the secondary threads.
      
      When the control file is set to 'forceoff', the behaviour is the same as
      setting it to 'off', but the operation is irreversible and later writes to
      the control file are rejected.
      
      When the control status is 'notsupported' then writes to the control file
      are rejected.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5ac43ee
  7. 03 Aug, 2018 4 commits
  8. 22 Jul, 2018 1 commit
  9. 17 Jul, 2018 1 commit
  10. 08 Jul, 2018 1 commit
  11. 03 Jul, 2018 2 commits
    • Vaibhav Jain's avatar
      cxl: Disable prefault_mode in Radix mode · c9debbd1
      Vaibhav Jain authored
      commit b6c84ba2 upstream.
      
      Currently we see a kernel-oops reported on Power-9 while attaching a
      context to an AFU, with radix-mode and sysfs attr 'prefault_mode' set
      to anything other than 'none'. The backtrace of the oops is of this
      form:
      
        Unable to handle kernel paging request for data at address 0x00000080
        Faulting instruction address: 0xc00800000bcf3b20
        cpu 0x1: Vector: 300 (Data Access) at [c00000037f003800]
            pc: c00800000bcf3b20: cxl_load_segment+0x178/0x290 [cxl]
            lr: c00800000bcf39f0: cxl_load_segment+0x48/0x290 [cxl]
            sp: c00000037f003a80
           msr: 9000000000009033
           dar: 80
         dsisr: 40000000
          current = 0xc00000037f280000
          paca    = 0xc0000003ffffe600   softe: 3        irq_happened: 0x01
            pid   = 3529, comm = afp_no_int
        <snip>
        cxl_prefault+0xfc/0x248 [cxl]
        process_element_entry_psl9+0xd8/0x1a0 [cxl]
        cxl_attach_dedicated_process_psl9+0x44/0x130 [cxl]
        native_attach_process+0xc0/0x130 [cxl]
        afu_ioctl+0x3f4/0x5e0 [cxl]
        do_vfs_ioctl+0xdc/0x890
        ksys_ioctl+0x68/0xf0
        sys_ioctl+0x40/0xa0
        system_call+0x58/0x6c
      
      The issue is caused as on Power-8 the AFU attr 'prefault_mode' was
      used to improve initial storage fault performance by prefaulting
      process segments. However on Power-9 with radix mode we don't have
      Storage-Segments that we can prefault. Also prefaulting process Pages
      will be too costly and fine-grained.
      
      Hence, since the prefaulting mechanism doesn't makes sense of
      radix-mode, this patch updates prefault_mode_store() to not allow any
      other value apart from CXL_PREFAULT_NONE when radix mode is enabled.
      
      Fixes: f24be42a ("cxl: Add psl9 specific code")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Acked-by: default avatarFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Acked-by: default avatarAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9debbd1
    • Geert Uytterhoeven's avatar
      lib/vsprintf: Remove atomic-unsafe support for %pCr · ea0ac01f
      Geert Uytterhoeven authored
      commit 666902e4 upstream.
      
      "%pCr" formats the current rate of a clock, and calls clk_get_rate().
      The latter obtains a mutex, hence it must not be called from atomic
      context.
      
      Remove support for this rarely-used format, as vsprintf() (and e.g.
      printk()) must be callable from any context.
      
      Any remaining out-of-tree users will start seeing the clock's name
      printed instead of its rate.
      Reported-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Fixes: 900cca29 ("lib/vsprintf: add %pC{,n,r} format specifiers for clocks")
      Link: http://lkml.kernel.org/r/1527845302-12159-5-git-send-email-geert+renesas@glider.be
      To: Jia-Ju Bai <baijiaju1990@gmail.com>
      To: Jonathan Corbet <corbet@lwn.net>
      To: Michael Turquette <mturquette@baylibre.com>
      To: Stephen Boyd <sboyd@kernel.org>
      To: Zhang Rui <rui.zhang@intel.com>
      To: Eduardo Valentin <edubezval@gmail.com>
      To: Eric Anholt <eric@anholt.net>
      To: Stefan Wahren <stefan.wahren@i2se.com>
      To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-doc@vger.kernel.org
      Cc: linux-clk@vger.kernel.org
      Cc: linux-pm@vger.kernel.org
      Cc: linux-serial@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-renesas-soc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: stable@vger.kernel.org # 4.1+
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea0ac01f
  12. 20 Jun, 2018 6 commits
  13. 11 Jun, 2018 1 commit
  14. 30 May, 2018 3 commits
  15. 22 May, 2018 3 commits