1. 05 Dec, 2018 2 commits
    • Thomas Gleixner's avatar
      x86/Kconfig: Select SCHED_SMT if SMP enabled · 44ac7cd0
      Thomas Gleixner authored
      commit dbe733642e01dd108f71436aaea7b328cb28fd87 upstream
      
      CONFIG_SCHED_SMT is enabled by all distros, so there is not a real point to
      have it configurable. The runtime overhead in the core scheduler code is
      minimal because the actual SMT scheduling parts are conditional on a static
      key.
      
      This allows to expose the scheduler's SMT state static key to the
      speculation control code. Alternatively the scheduler's static key could be
      made always available when CONFIG_SMP is enabled, but that's just adding an
      unused static key to every other architecture for nothing.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.337452245@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44ac7cd0
    • Zhenzhong Duan's avatar
      x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support · a9c90037
      Zhenzhong Duan authored
      commit 4cd24de3a0980bf3100c9dcb08ef65ca7c31af48 upstream
      
      Since retpoline capable compilers are widely available, make
      CONFIG_RETPOLINE hard depend on the compiler capability.
      
      Break the build when CONFIG_RETPOLINE is enabled and the compiler does not
      support it. Emit an error message in that case:
      
       "arch/x86/Makefile:226: *** You are building kernel with non-retpoline
        compiler, please update your compiler..  Stop."
      
      [dwmw: Fail the build with non-retpoline compiler]
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michal Marek <michal.lkml@markovi.net>
      Cc: <srinivas.eeda@oracle.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/cca0cb20-f9e2-4094-840b-fb0f8810cd34@defaultSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9c90037
  2. 25 Sep, 2018 6 commits
  3. 05 Sep, 2018 1 commit
    • Peter Zijlstra's avatar
      mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE · e9afa7c1
      Peter Zijlstra authored
      commit d86564a2 upstream.
      
      Jann reported that x86 was missing required TLB invalidates when he
      hit the !*batch slow path in tlb_remove_table().
      
      This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
      invalidates, the PowerPC-hash where this code originated and the
      Sparc-hash where this was subsequently used did not need that. ARM
      which later used this put an explicit TLB invalidate in their
      __p*_free_tlb() functions, and PowerPC-radix followed that example.
      
      But when we hooked up x86 we failed to consider this. Fix this by
      (optionally) hooking tlb_remove_table() into the TLB invalidate code.
      
      NOTE: s390 was also needing something like this and might now
            be able to use the generic code again.
      
      [ Modified to be on top of Nick's cleanups, which simplified this patch
        now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]
      
      Fixes: 9e52fc2b ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9afa7c1
  4. 15 Aug, 2018 1 commit
    • Thomas Gleixner's avatar
      cpu/hotplug: Provide knobs to control SMT · c5ac43ee
      Thomas Gleixner authored
      commit 05736e4a upstream
      
      Provide a command line and a sysfs knob to control SMT.
      
      The command line options are:
      
       'nosmt':	Enumerate secondary threads, but do not online them
      
       'nosmt=force': Ignore secondary threads completely during enumeration
       		via MP table and ACPI/MADT.
      
      The sysfs control file has the following states (read/write):
      
       'on':		 SMT is enabled. Secondary threads can be freely onlined
       'off':		 SMT is disabled. Secondary threads, even if enumerated
       		 cannot be onlined
       'forceoff':	 SMT is permanentely disabled. Writes to the control
       		 file are rejected.
       'notsupported': SMT is not supported by the CPU
      
      The command line option 'nosmt' sets the sysfs control to 'off'. This
      can be changed to 'on' to reenable SMT during runtime.
      
      The command line option 'nosmt=force' sets the sysfs control to
      'forceoff'. This cannot be changed during runtime.
      
      When SMT is 'on' and the control file is changed to 'off' then all online
      secondary threads are offlined and attempts to online a secondary thread
      later on are rejected.
      
      When SMT is 'off' and the control file is changed to 'on' then secondary
      threads can be onlined again. The 'off' -> 'on' transition does not
      automatically online the secondary threads.
      
      When the control file is set to 'forceoff', the behaviour is the same as
      setting it to 'off', but the operation is irreversible and later writes to
      the control file are rejected.
      
      When the control status is 'notsupported' then writes to the control file
      are rejected.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5ac43ee
  5. 15 Mar, 2018 1 commit
  6. 22 Feb, 2018 1 commit
  7. 17 Jan, 2018 2 commits
  8. 29 Dec, 2017 1 commit
    • Thomas Gleixner's avatar
      x86/Kconfig: Limit NR_CPUS on 32-bit to a sane amount · 662fd946
      Thomas Gleixner authored
      commit 7bbcbd3d upstream.
      
      The recent cpu_entry_area changes fail to compile on 32-bit when BIGSMP=y
      and NR_CPUS=512, because the fixmap area becomes too big.
      
      Limit the number of CPUs with BIGSMP to 64, which is already way to big for
      32-bit, but it's at least a working limitation.
      
      We performed a quick survey of 32-bit-only machines that might be affected
      by this change negatively, but found none.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      662fd946
  9. 25 Dec, 2017 3 commits
  10. 10 Dec, 2017 1 commit
  11. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  12. 09 Sep, 2017 2 commits
  13. 07 Sep, 2017 1 commit
    • Rik van Riel's avatar
      x86,mpx: make mpx depend on x86-64 to free up VMA flag · df3735c5
      Rik van Riel authored
      Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      This patch (of 2):
      
      MPX only seems to be available on 64 bit CPUs, starting with Skylake and
      Goldmont.  Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
      order to free up a VMA flag.
      
      Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.comSigned-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Colm MacCártaigh <colm@allcosts.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df3735c5
  14. 31 Aug, 2017 2 commits
    • Robin Murphy's avatar
      libnvdimm, nd_blk: remove mmio_flush_range() · 5deb67f7
      Robin Murphy authored
      mmio_flush_range() suffers from a lack of clearly-defined semantics,
      and is somewhat ambiguous to port to other architectures where the
      scope of the writeback implied by "flush" and ordering might matter,
      but MMIO would tend to imply non-cacheable anyway. Per the rationale
      in 67a3e8fe ("nd_blk: change aperture mapping from WC to WB"), the
      only existing use is actually to invalidate clean cache lines for
      ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
      cleanup of the pmem API, that also now happens to be the exact purpose
      of arch_invalidate_pmem(), which would be a far more well-defined tool
      for the job.
      
      Rather than risk potentially inconsistent implementations of
      mmio_flush_range() for the sake of one callsite, streamline things by
      removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
      definitions up to the libnvdimm level, so they can be shared by NFIT
      as well. This allows NFIT to be enabled for arm64.
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      5deb67f7
    • Vitaly Kuznetsov's avatar
      x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y) · 9e52fc2b
      Vitaly Kuznetsov authored
      There's a subtle bug in how some of the paravirt guest code handles
      page table freeing on x86:
      
      On x86 software page table walkers depend on the fact that remote TLB flush
      does an IPI: walk is performed lockless but with interrupts disabled and in
      case the page table is freed the freeing CPU will get blocked as remote TLB
      flush is required. On other architectures which don't require an IPI to do
      remote TLB flush we have an RCU-based mechanism (see
      include/asm-generic/tlb.h for more details).
      
      In virtualized environments we may want to override the ->flush_tlb_others
      callback in pv_mmu_ops and use a hypercall asking the hypervisor to do a
      remote TLB flush for us. This breaks the assumption about IPIs. Xen PV has
      been doing this for years and the upcoming remote TLB flush for Hyper-V will
      do it too.
      
      This is not safe, as software page table walkers may step on an already
      freed page.
      
      Fix the bug by enabling the RCU-based page table freeing mechanism,
      CONFIG_HAVE_RCU_TABLE_FREE=y.
      
      Testing with kernbench and mmap/munmap microbenchmarks, and neither showed
      any noticeable performance impact.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/20170828082251.5562-1-vkuznets@redhat.com
      [ Rewrote/fixed/clarified the changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9e52fc2b
  15. 29 Aug, 2017 1 commit
    • Ingo Molnar's avatar
      locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being · 7b3d61cc
      Ingo Molnar authored
      Mike Galbraith bisected a boot crash back to the following commit:
      
        7a46ec0e ("locking/refcounts, x86/asm: Implement fast refcount overflow protection")
      
      The crash/hang pattern is:
      
       > Symptom is a few splats as below, with box finally hanging.  Network
       > comes up, but neither ssh nor console login is possible.
       >
       >  ------------[ cut here ]------------
       >  WARNING: CPU: 4 PID: 0 at net/netlink/af_netlink.c:374 netlink_sock_destruct+0x82/0xa0
       >  ...
       >  __sk_destruct()
       >  rcu_process_callbacks()
       >  __do_softirq()
       >  irq_exit()
       >  smp_apic_timer_interrupt()
       >  apic_timer_interrupt()
      
      We are at -rc7 already, and the code has grown some dependencies, so
      instead of a plain revert disable the config temporarily, in the hope
      of getting real fixes.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/tip-7a46ec0e2f4850407de5e1d19a44edee6efa58ec@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7b3d61cc
  16. 24 Aug, 2017 1 commit
  17. 18 Aug, 2017 2 commits
    • Nicholas Piggin's avatar
      kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog · 92e5aae4
      Nicholas Piggin authored
      Commit 05a4a952 ("kernel/watchdog: split up config options") lost
      the perf-based hardlockup detector's dependency on PERF_EVENTS, which
      can result in broken builds with some powerpc configurations.
      
      Restore the dependency.  Add it in for x86 too, despite x86 always
      selecting PERF_EVENTS it seems reasonable to make the dependency
      explicit.
      
      Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com
      Fixes: 05a4a952 ("kernel/watchdog: split up config options")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92e5aae4
    • Thomas Gleixner's avatar
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner authored
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: default avatarKan Liang <Kan.liang@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
  18. 17 Aug, 2017 1 commit
    • Kees Cook's avatar
      locking/refcounts, x86/asm: Implement fast refcount overflow protection · 7a46ec0e
      Kees Cook authored
      This implements refcount_t overflow protection on x86 without a noticeable
      performance impact, though without the fuller checking of REFCOUNT_FULL.
      
      This is done by duplicating the existing atomic_t refcount implementation
      but with normally a single instruction added to detect if the refcount
      has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
      the handler saturates the refcount_t to INT_MIN / 2. With this overflow
      protection, the erroneous reference release that would follow a wrap back
      to zero is blocked from happening, avoiding the class of refcount-overflow
      use-after-free vulnerabilities entirely.
      
      Only the overflow case of refcounting can be perfectly protected, since
      it can be detected and stopped before the reference is freed and left to
      be abused by an attacker. There isn't a way to block early decrements,
      and while REFCOUNT_FULL stops increment-from-zero cases (which would
      be the state _after_ an early decrement and stops potential double-free
      conditions), this fast implementation does not, since it would require
      the more expensive cmpxchg loops. Since the overflow case is much more
      common (e.g. missing a "put" during an error path), this protection
      provides real-world protection. For example, the two public refcount
      overflow use-after-free exploits published in 2016 would have been
      rendered unexploitable:
      
        http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/
      
        http://cyseclabs.com/page?n=02012016
      
      This implementation does, however, notice an unchecked decrement to zero
      (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
      resulted in a zero). Decrements under zero are noticed (since they will
      have resulted in a negative value), though this only indicates that a
      use-after-free may have already happened. Such notifications are likely
      avoidable by an attacker that has already exploited a use-after-free
      vulnerability, but it's better to have them reported than allow such
      conditions to remain universally silent.
      
      On first overflow detection, the refcount value is reset to INT_MIN / 2
      (which serves as a saturation value) and a report and stack trace are
      produced. When operations detect only negative value results (such as
      changing an already saturated value), saturation still happens but no
      notification is performed (since the value was already saturated).
      
      On the matter of races, since the entire range beyond INT_MAX but before
      0 is negative, every operation at INT_MIN / 2 will trap, leaving no
      overflow-only race condition.
      
      As for performance, this implementation adds a single "js" instruction
      to the regular execution flow of a copy of the standard atomic_t refcount
      operations. (The non-"and_test" refcount_dec() function, which is uncommon
      in regular refcount design patterns, has an additional "jz" instruction
      to detect reaching exactly zero.) Since this is a forward jump, it is by
      default the non-predicted path, which will be reinforced by dynamic branch
      prediction. The result is this protection having virtually no measurable
      change in performance over standard atomic_t operations. The error path,
      located in .text.unlikely, saves the refcount location and then uses UD0
      to fire a refcount exception handler, which resets the refcount, handles
      reporting, and returns to regular execution. This keeps the changes to
      .text size minimal, avoiding return jumps and open-coded calls to the
      error reporting routine.
      
      Example assembly comparison:
      
      refcount_inc() before:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
      
      refcount_inc() after:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
        ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
        ...
        .text.unlikely:
        ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
        ffffffff816c36d7:       0f ff                   (bad)
      
      These are the cycle counts comparing a loop of refcount_inc() from 1
      to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
      unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
      (refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
      
        2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
      		    cycles		protections
        atomic_t           82249267387	none
        refcount_t-fast    82211446892	overflow, untested dec-to-zero
        refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero
      
      This code is a modified version of the x86 PAX_REFCOUNT atomic_t
      overflow defense from the last public patch of PaX/grsecurity, based
      on my understanding of the code. Changes or omissions from the original
      code are mine and don't reflect the original grsecurity/PaX code. Thanks
      to PaX Team for various suggestions for improvement for repurposing this
      code to be a refcount-only protection.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Hans Liljestrand <ishkamiel@gmail.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arozansk@redhat.com
      Cc: axboe@kernel.dk
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-arch <linux-arch@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170815161924.GA133115@beastSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7a46ec0e
  19. 01 Aug, 2017 1 commit
  20. 26 Jul, 2017 2 commits
    • Josh Poimboeuf's avatar
      x86/kconfig: Consolidate unwinders into multiple choice selection · 81d38719
      Josh Poimboeuf authored
      There are three mutually exclusive unwinders.  Make that more obvious by
      combining them into a multiple-choice selection:
      
        CONFIG_FRAME_POINTER_UNWINDER
        CONFIG_ORC_UNWINDER
        CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)
      
      Frame pointers are still the default (for now).
      
      The old CONFIG_FRAME_POINTER option is still used in some
      arch-independent places, so keep it around, but make it
      invisible to the user on x86 - it's now selected by
      CONFIG_FRAME_POINTER_UNWINDER=y.
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@trebleSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      81d38719
    • Josh Poimboeuf's avatar
      x86/unwind: Add the ORC unwinder · ee9f8fce
      Josh Poimboeuf authored
      Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
      It plugs into the existing x86 unwinder framework.
      
      It relies on objtool to generate the needed .orc_unwind and
      .orc_unwind_ip sections.
      
      For more details on why ORC is used instead of DWARF, see
      Documentation/x86/orc-unwinder.txt - but the short version is
      that it's a simplified, fundamentally more robust debugninfo
      data structure, which also allows up to two orders of magnitude
      faster lookups than the DWARF unwinder - which matters to
      profiling workloads like perf.
      
      Thanks to Andy Lutomirski for the performance improvement ideas:
      splitting the ORC unwind table into two parallel arrays and creating a
      fast lookup table to search a subset of the unwind table.
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
      [ Extended the changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ee9f8fce
  21. 21 Jul, 2017 1 commit
  22. 18 Jul, 2017 2 commits
    • Tom Lendacky's avatar
      x86/mm: Extend early_memremap() support with additional attrs · f88a68fa
      Tom Lendacky authored
      Add early_memremap() support to be able to specify encrypted and
      decrypted mappings with and without write-protection. The use of
      write-protection is necessary when encrypting data "in place". The
      write-protect attribute is considered cacheable for loads, but not
      stores. This implies that the hardware will never give the core a
      dirty line with this memtype.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kvm@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/479b5832c30fae3efa7932e48f81794e86397229.1500319216.git.thomas.lendacky@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f88a68fa
    • Tom Lendacky's avatar
      x86/mm: Add Secure Memory Encryption (SME) support · 7744ccdb
      Tom Lendacky authored
      Add support for Secure Memory Encryption (SME). This initial support
      provides a Kconfig entry to build the SME support into the kernel and
      defines the memory encryption mask that will be used in subsequent
      patches to mark pages as encrypted.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kvm@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/a6c34d16caaed3bc3e2d6f0987554275bd291554.1500319216.git.thomas.lendacky@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7744ccdb
  23. 12 Jul, 2017 2 commits
    • Daniel Micay's avatar
      include/linux/string.h: add the option of fortified string.h functions · 6974f0c4
      Daniel Micay authored
      This adds support for compiling with a rough equivalent to the glibc
      _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer
      overflow checks for string.h functions when the compiler determines the
      size of the source or destination buffer at compile-time.  Unlike glibc,
      it covers buffer reads in addition to writes.
      
      GNU C __builtin_*_chk intrinsics are avoided because they would force a
      much more complex implementation.  They aren't designed to detect read
      overflows and offer no real benefit when using an implementation based
      on inline checks.  Inline checks don't add up to much code size and
      allow full use of the regular string intrinsics while avoiding the need
      for a bunch of _chk functions and per-arch assembly to avoid wrapper
      overhead.
      
      This detects various overflows at compile-time in various drivers and
      some non-x86 core kernel code.  There will likely be issues caught in
      regular use at runtime too.
      
      Future improvements left out of initial implementation for simplicity,
      as it's all quite optional and can be done incrementally:
      
      * Some of the fortified string functions (strncpy, strcat), don't yet
        place a limit on reads from the source based on __builtin_object_size of
        the source buffer.
      
      * Extending coverage to more string functions like strlcat.
      
      * It should be possible to optionally use __builtin_object_size(x, 1) for
        some functions (C strings) to detect intra-object overflows (like
        glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative
        approach to avoid likely compatibility issues.
      
      * The compile-time checks should be made available via a separate config
        option which can be enabled by default (or always enabled) once enough
        time has passed to get the issues it catches fixed.
      
      Kees said:
       "This is great to have. While it was out-of-tree code, it would have
        blocked at least CVE-2016-3858 from being exploitable (improper size
        argument to strlcpy()). I've sent a number of fixes for
        out-of-bounds-reads that this detected upstream already"
      
      [arnd@arndb.de: x86: fix fortified memcpy]
        Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de
      [keescook@chromium.org: avoid panic() in favor of BUG()]
        Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast
      [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help]
      Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com
      Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.orgSigned-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6974f0c4
    • Nicholas Piggin's avatar
      kernel/watchdog: split up config options · 05a4a952
      Nicholas Piggin authored
      Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
      HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.
      
      LOCKUP_DETECTOR implies the general boot, sysctl, and programming
      interfaces for the lockup detectors.
      
      An architecture that wants to use a hard lockup detector must define
      HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
      minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
      does not implement the LOCKUP_DETECTOR interfaces.
      
      sparc is unusual in that it has started to implement some of the
      interfaces, but not fully yet.  It should probably be converted to a full
      HAVE_HARDLOCKUP_DETECTOR_ARCH.
      
      [npiggin@gmail.com: fix]
        Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
      Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.comSigned-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarDon Zickus <dzickus@redhat.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05a4a952
  24. 06 Jul, 2017 2 commits
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: clean up ARCH_HAS_GIGANTIC_PAGE · e1073d1e
      Aneesh Kumar K.V authored
      This moves the #ifdef in C code to a Kconfig dependency.  Also we move
      the gigantic_page_supported() function to be arch specific.
      
      This allows architectures to conditionally enable runtime allocation of
      gigantic huge page.  Architectures like ppc64 supports different
      gigantic huge page size (16G and 1G) based on the translation mode
      selected.  This provides an opportunity for ppc64 to enable runtime
      allocation only w.r.t 1G hugepage.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/1494995292-4443-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1073d1e
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying authored
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6