1. 04 Oct, 2018 1 commit
  2. 15 Aug, 2018 3 commits
    • Andi Kleen's avatar
      x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation · 8c35b2fc
      Andi Kleen authored
      commit 6b28baca upstream
      
      When PTEs are set to PROT_NONE the kernel just clears the Present bit and
      preserves the PFN, which creates attack surface for L1TF speculation
      speculation attacks.
      
      This is important inside guests, because L1TF speculation bypasses physical
      page remapping. While the host has its own migitations preventing leaking
      data from other VMs into the guest, this would still risk leaking the wrong
      page inside the current guest.
      
      This uses the same technique as Linus' swap entry patch: while an entry is
      is in PROTNONE state invert the complete PFN part part of it. This ensures
      that the the highest bit will point to non existing memory.
      
      The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
      pte/pmd/pud_pfn undo it.
      
      This assume that no code path touches the PFN part of a PTE directly
      without using these primitives.
      
      This doesn't handle the case that MMIO is on the top of the CPU physical
      memory. If such an MMIO region was exposed by an unpriviledged driver for
      mmap it would be possible to attack some real memory.  However this
      situation is all rather unlikely.
      
      For 32bit non PAE the inversion is not done because there are really not
      enough bits to protect anything.
      
      Q: Why does the guest need to be protected when the HyperVisor already has
         L1TF mitigations?
      
      A: Here's an example:
      
         Physical pages 1 2 get mapped into a guest as
         GPA 1 -> PA 2
         GPA 2 -> PA 1
         through EPT.
      
         The L1TF speculation ignores the EPT remapping.
      
         Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
         they belong to different users and should be isolated.
      
         A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
         gets read access to the underlying physical page. Which in this case
         points to PA 2, so it can read process B's data, if it happened to be in
         L1, so isolation inside the guest is broken.
      
         There's nothing the hypervisor can do about this. This mitigation has to
         be done in the guest itself.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c35b2fc
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Protect swap entries against L1TF · 83ef7e8c
      Linus Torvalds authored
      commit 2f22b4cd upstream
      
      With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
      side effects allow to read the memory the PTE is pointing too, if its
      values are still in the L1 cache.
      
      For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
      them.
      
      To protect against L1TF it must be ensured that the swap entry is not
      pointing to valid memory, which requires setting higher bits (between bit
      36 and bit 45) that are inside the CPUs physical address space, but outside
      any real memory.
      
      To do this invert the offset to make sure the higher bits are always set,
      as long as the swap file is not too big.
      
      Note there is no workaround for 32bit !PAE, or on systems which have more
      than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
      real systems.
      
      [AK: updated description and minor tweaks by. Split out from the original
           patch ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83ef7e8c
    • Linus Torvalds's avatar
      x86/speculation/l1tf: Change order of offset/type in swap entry · 39991a7a
      Linus Torvalds authored
      commit bcd11afa upstream
      
      If pages are swapped out, the swap entry is stored in the corresponding
      PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
      on PTE entries which have the present bit set and would treat the swap
      entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
      must be set so the PTE points to non existent memory.
      
      The swap entry stores the type and the offset of a swapped out page in the
      PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
      ignores the bits beyond the phsyical address space limit, so to make the
      mitigation effective its required to start 'offset' at the lowest possible
      bit so that even large swap offsets do not reach into the physical address
      space limit bits.
      
      Move offset to bit 9-58 and type to bit 59-63 which are the bits that
      hardware generally doesn't care about.
      
      That, in turn, means that if you on desktop chip with only 40 bits of
      physical addressing, now that the offset starts at bit 9, there needs to be
      30 bits of offset actually *in use* until bit 39 ends up being set, which
      means when inverted it will again point into existing memory.
      
      So that's 4 terabyte of swap space (because the offset is counted in pages,
      so 30 bits of offset is 42 bits of actual coverage). With bigger physical
      addressing, that obviously grows further, until the limit of the offset is
      hit (at 50 bits of offset - 62 bits of actual swap file coverage).
      
      This is a preparatory change for the actual swap entry inversion to protect
      against L1TF.
      
      [ AK: Updated description and minor tweaks. Split into two parts ]
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39991a7a
  3. 09 Mar, 2018 1 commit
  4. 02 Jan, 2018 1 commit
    • Dave Hansen's avatar
      x86/mm/pti: Add mapping helper functions · b9feab7d
      Dave Hansen authored
      commit 61e9b367 upstream.
      
      Add the pagetable helper functions do manage the separate user space page
      tables.
      
      [ tglx: Split out from the big combo kaiser patch. Folded Andys
      	simplification and made it out of line as Boris suggested ]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9feab7d
  5. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  6. 09 Sep, 2017 2 commits
    • Zi Yan's avatar
      mm: thp: enable thp migration in generic path · 616b8371
      Zi Yan authored
      Add thp migration's core code, including conversions between a PMD entry
      and a swap entry, setting PMD migration entry, removing PMD migration
      entry, and waiting on PMD migration entries.
      
      This patch makes it possible to support thp migration.  If you fail to
      allocate a destination page as a thp, you just split the source thp as
      we do now, and then enter the normal page migration.  If you succeed to
      allocate destination thp, you enter thp migration.  Subsequent patches
      actually enable thp migration for each caller of page migration by
      allowing its get_new_page() callback to allocate thps.
      
      [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
        Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
      [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      616b8371
    • Naoya Horiguchi's avatar
      mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 · eee4818b
      Naoya Horiguchi authored
      _PAGE_PSE is used to distinguish between a truly non-present
      (_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and
      should be treated as present.
      
      But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would
      cause confusion between one of those PMDs undergoing a THP split, and a
      soft-dirty PMD.  Dropping _PAGE_PSE check in pmd_present() does not work
      well, because it can hurt optimization of tlb handling in thp split.
      
      Thus, we need to move the bit.
      
      In the current kernel, bits 1-4 are not used in non-present format since
      commit 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work
      around erratum").  So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.  Bit 7
      is used as reserved (always clear), so please don't use it for other
      purpose.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-3-zi.yan@sent.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee4818b
  7. 13 Jun, 2017 3 commits
  8. 23 Apr, 2017 1 commit
    • Ingo Molnar's avatar
      Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" · 6dd29b3d
      Ingo Molnar authored
      This reverts commit 2947ba05.
      
      Dan Williams reported dax-pmem kernel warnings with the following signature:
      
         WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
         percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
      
      ... and bisected it to this commit, which suggests possible memory corruption
      caused by the x86 fast-GUP conversion.
      
      He also pointed out:
      
       "
        This is similar to the backtrace when we were not properly handling
        pud faults and was fixed with this commit: 220ced16 "mm: fix
        get_user_pages() vs device-dax pud mappings"
      
        I've found some missing _devmap checks in the generic
        get_user_pages_fast() path, but this does not fix the regression
        [...]
       "
      
      So given that there are known bugs, and a pretty robust looking bisection
      points to this commit suggesting that are unknown bugs in the conversion
      as well, revert it for the time being - we'll re-try in v4.13.
      Reported-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dann.frazier@canonical.com
      Cc: dave.hansen@intel.com
      Cc: steve.capper@linaro.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6dd29b3d
  9. 04 Apr, 2017 1 commit
  10. 27 Mar, 2017 1 commit
  11. 21 Mar, 2017 1 commit
    • Thomas Garnier's avatar
      x86/headers: Simplify asm/fixmap.h inclusion into asm/pgtable*.h · ef37bc36
      Thomas Garnier authored
      Instead of including fixmap.h twice in pgtable_32.h and pgtable_64.h,
      include it only once, in the common asm/pgtable.h header.
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-mm@kvack.org
      Cc: richard.weiyang@gmail.com
      Cc: zijun_hu <zijun_hu@htc.com>
      Link: http://lkml.kernel.org/r/20170321071725.GA15782@gmail.com
      [ Generated this patch from two other patches and wrote changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ef37bc36
  12. 18 Mar, 2017 2 commits
  13. 25 Feb, 2017 1 commit
  14. 15 Dec, 2016 1 commit
  15. 11 Aug, 2016 1 commit
    • Dave Hansen's avatar
      x86/mm: Fix swap entry comment and macro · ace7fab7
      Dave Hansen authored
      A recent patch changed the format of a swap PTE.
      
      The comment explaining the format of the swap PTE is wrong about
      the bits used for the swap type field.  Amusingly, the ASCII art
      and the patch description are correct, but the comment itself
      is wrong.
      
      As I was looking at this, I also noticed that the
      SWP_OFFSET_FIRST_BIT has an off-by-one error.  This does not
      really hurt anything.  It just wasted a bit of space in the PTE,
      giving us 2^59 bytes of addressable space in our swapfiles
      instead of 2^60.  But, it doesn't match with the comments, and it
      wastes a bit of space, so fix it.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Fixes: 00839ee3 ("x86/mm: Move swap offset/type up in PTE to work around erratum")
      Link: http://lkml.kernel.org/r/20160810172325.E56AD7DA@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ace7fab7
  16. 13 Jul, 2016 1 commit
    • Dave Hansen's avatar
      x86/mm: Move swap offset/type up in PTE to work around erratum · 00839ee3
      Dave Hansen authored
      This erratum can result in Accessed/Dirty getting set by the hardware
      when we do not expect them to be (on !Present PTEs).
      
      Instead of trying to fix them up after this happens, we just
      allow the bits to get set and try to ignore them.  We do this by
      shifting the layout of the bits we use for swap offset/type in
      our 64-bit PTEs.
      
      It looks like this:
      
       bitnrs: |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0|
       names:  |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P|
       before: |         OFFSET (9-63)          |0|X|X| TYPE(1-5) |0|
        after: | OFFSET (14-63)  |  TYPE (9-13) |0|X|X|X| X| X|X|X|0|
      
      Note that D was already a don't care (X) even before.  We just
      move TYPE up and turn its old spot (which could be hit by the
      A bit) into all don't cares.
      
      We take 5 bits away from the offset, but that still leaves us
      with 50 bits which lets us index into a 62-bit swapfile (4 EiB).
      I think that's probably fine for the moment.  We could
      theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it
      doesn't gain us anything.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: dave.hansen@intel.com
      Cc: linux-mm@kvack.org
      Cc: mhocko@suse.com
      Link: http://lkml.kernel.org/r/20160708001911.9A3FD2B6@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      00839ee3
  17. 13 Feb, 2015 1 commit
    • Mel Gorman's avatar
      mm: remove remaining references to NUMA hinting bits and helpers · 21d9ee3e
      Mel Gorman authored
      This patch removes the NUMA PTE bits and associated helpers.  As a
      side-effect it increases the maximum possible swap space on x86-64.
      
      One potential source of problems is races between the marking of PTEs
      PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
      a PTE being protected is not faulted in parallel, seen as a pte_none and
      corrupting memory.  The base case is safe but transhuge has problems in
      the past due to an different migration mechanism and a dependance on page
      lock to serialise migrations and warrants a closer look.
      
      task_work hinting update			parallel fault
      ------------------------			--------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						pmd_none
      						  do_huge_pmd_anonymous_page
      						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
      						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
            pmd_modify
            set_pmd_at
      
      task_work hinting update			parallel migration
      ------------------------			------------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						  do_huge_pmd_numa_page
      						    migrate_misplaced_transhuge_page
      						    pmd_lock waits for updates to complete, recheck pmd_same
            pmd_modify
            set_pmd_at
      
      Both of those are safe and the case where a transhuge page is inserted
      during a protection update is unchanged.  The case where two processes try
      migrating at the same time is unchanged by this series so should still be
      ok.  I could not find a case where we are accidentally depending on the
      PTE not being cleared and flushed.  If one is missed, it'll manifest as
      corruption problems that start triggering shortly after this series is
      merged and only happen when NUMA balancing is enabled.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21d9ee3e
  18. 10 Feb, 2015 1 commit
  19. 16 Sep, 2014 1 commit
    • Yasuaki Ishimatsu's avatar
      x86/mm/hotplug: Modify PGD entry when removing memory · 9661d5bc
      Yasuaki Ishimatsu authored
      When hot-adding/removing memory, sync_global_pgds() is called
      for synchronizing PGD to PGD entries of all processes MM.  But
      when hot-removing memory, sync_global_pgds() does not work
      correctly.
      
      At first, sync_global_pgds() checks whether target PGD is none
      or not.  And if PGD is none, the PGD is skipped.  But when
      hot-removing memory, PGD may be none since PGD may be cleared by
      free_pud_table().  So when sync_global_pgds() is called after
      hot-removing memory, sync_global_pgds() should not skip PGD even
      if the PGD is none.  And sync_global_pgds() must clear PGD
      entries of all processes MM.
      
      Currently sync_global_pgds() does not clear PGD entries of all
      processes MM when hot-removing memory.  So when hot adding
      memory which is same memory range as removed memory after
      hot-removing memory, following call traces are shown:
      
       kernel BUG at arch/x86/mm/init_64.c:206!
       ...
       [<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
       [<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
       [<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
       [<ffffffff815d03d9>] add_memory+0xb9/0x1b0
       [<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
       [<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
       [<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
       [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
       [<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
       [<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
       [<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
       [<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
       [<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
       [<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
       [<ffffffff8107e02b>] process_one_work+0x17b/0x460
       [<ffffffff8107edfb>] worker_thread+0x11b/0x400
       [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
       [<ffffffff81085aef>] kthread+0xcf/0xe0
       [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
       [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
      
      This patch clears PGD entries of all processes MM when
      sync_global_pgds() is called after hot-removing memory
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9661d5bc
  20. 10 Sep, 2014 1 commit
    • Stefan Bader's avatar
      x86/xen: don't copy bogus duplicate entries into kernel page tables · 0b5a5063
      Stefan Bader authored
      When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
      modules exceeds 512 MiB, then loading modules fails with a warning
      (and hence a vmalloc allocation failure) because the PTEs for the
      newly-allocated vmalloc address space are not zero.
      
        WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
                 vmap_page_range_noflush+0x2a1/0x360()
      
      This is caused by xen_setup_kernel_pagetables() copying
      level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
      entries.
      
      Without KASLR, the normal kernel image size only covers the first half
      of level2_kernel_pgt and module space starts after that.
      
      L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                        [256..511]->module
                                [511]->level2_fixmap_pgt[  0..505]->module
      
      This allows 512 MiB of of module vmalloc space to be used before
      having to use the corrupted level2_fixmap_pgt entries.
      
      With KASLR enabled, the kernel image uses the full PUD range of 1G and
      module space starts in the level2_fixmap_pgt. So basically:
      
      L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                                [511]->level2_fixmap_pgt[0..505]->module
      
      And now no module vmalloc space can be used without using the corrupt
      level2_fixmap_pgt entries.
      
      Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
      and setting level1_fixmap_pgt as read-only.
      
      A number of comments were also using the the wrong L3 offset for
      level2_kernel_pgt.  These have been corrected.
      Signed-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: stable@vger.kernel.org
      0b5a5063
  21. 04 Jun, 2014 2 commits
    • Cyrill Gorcunov's avatar
      mm: x86 pgtable: drop unneeded preprocessor ifdef · 2373eaec
      Cyrill Gorcunov authored
      _PAGE_BIT_FILE (bit 6) is always less than _PAGE_BIT_PROTNONE (bit 8), so
      drop redundant #ifdef.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2373eaec
    • Mel Gorman's avatar
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman authored
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
  22. 24 Jan, 2013 1 commit
  23. 17 Nov, 2012 1 commit
  24. 09 Oct, 2012 1 commit
    • David Miller's avatar
      mm: Add and use update_mmu_cache_pmd() in transparent huge page code. · b113da65
      David Miller authored
      The transparent huge page code passes a PMD pointer in as the third
      argument of update_mmu_cache(), which expects a PTE pointer.
      
      This never got noticed because X86 implements update_mmu_cache() as a
      macro and thus we don't get any type checking, and X86 is the only
      architecture which supports transparent huge pages currently.
      
      Before other architectures can support transparent huge pages properly we
      need to add a new interface which will take a PMD pointer as the third
      argument rather than a PTE pointer.
      
      [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b113da65
  25. 06 Jun, 2012 1 commit
  26. 14 Jan, 2011 4 commits
    • Johannes Weiner's avatar
      thp: add x86 32bit support · f2d6bfe9
      Johannes Weiner authored
      Add support for transparent hugepages to x86 32bit.
      
      Share the same VM_ bitflag for VM_MAPPED_COPY.  mm/nommu.c will never
      support transparent hugepages.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2d6bfe9
    • Andrea Arcangeli's avatar
      thp: transparent hugepage core · 71e3aac0
      Andrea Arcangeli authored
      Lately I've been working to make KVM use hugepages transparently without
      the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
      see removed:
      
      1) hugepages have to be swappable or the guest physical memory remains
         locked in RAM and can't be paged out to swap
      
      2) if a hugepage allocation fails, regular pages should be allocated
         instead and mixed in the same vma without any failure and without
         userland noticing
      
      3) if some task quits and more hugepages become available in the
         buddy, guest physical memory backed by regular pages should be
         relocated on hugepages automatically in regions under
         madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
         kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
         not null)
      
      4) avoidance of reservation and maximization of use of hugepages whenever
         possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
         1 machine with 1 database with 1 database cache with 1 database cache size
         known at boot time. It's definitely not feasible with a virtualization
         hypervisor usage like RHEV-H that runs an unknown number of virtual machines
         with an unknown size of each virtual machine with an unknown amount of
         pagecache that could be potentially useful in the host for guest not using
         O_DIRECT (aka cache=off).
      
      hugepages in the virtualization hypervisor (and also in the guest!) are
      much more important than in a regular host not using virtualization,
      becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
      to 19 in case only the hypervisor uses transparent hugepages, and they
      decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
      linux hypervisor and the linux guest both uses this patch (though the
      guest will limit the addition speedup to anonymous regions only for
      now...).  Even more important is that the tlb miss handler is much slower
      on a NPT/EPT guest than for a regular shadow paging or no-virtualization
      scenario.  So maximizing the amount of virtual memory cached by the TLB
      pays off significantly more with NPT/EPT than without (even if there would
      be no significant speedup in the tlb-miss runtime).
      
      The first (and more tedious) part of this work requires allowing the VM to
      handle anonymous hugepages mixed with regular pages transparently on
      regular anonymous vmas.  This is what this patch tries to achieve in the
      least intrusive possible way.  We want hugepages and hugetlb to be used in
      a way so that all applications can benefit without changes (as usual we
      leverage the KVM virtualization design: by improving the Linux VM at
      large, KVM gets the performance boost too).
      
      The most important design choice is: always fallback to 4k allocation if
      the hugepage allocation fails!  This is the _very_ opposite of some large
      pagecache patches that failed with -EIO back then if a 64k (or similar)
      allocation failed...
      
      Second important decision (to reduce the impact of the feature on the
      existing pagetable handling code) is that at any time we can split an
      hugepage into 512 regular pages and it has to be done with an operation
      that can't fail.  This way the reliability of the swapping isn't decreased
      (no need to allocate memory when we are short on memory to swap) and it's
      trivial to plug a split_huge_page* one-liner where needed without
      polluting the VM.  Over time we can teach mprotect, mremap and friends to
      handle pmd_trans_huge natively without calling split_huge_page*.  The fact
      it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
      (instead of the current void) we'd need to rollback the mprotect from the
      middle of it (ideally including undoing the split_vma) which would be a
      big change and in the very wrong direction (it'd likely be simpler not to
      call split_huge_page at all and to teach mprotect and friends to handle
      hugepages instead of rolling them back from the middle).  In short the
      very value of split_huge_page is that it can't fail.
      
      The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
      incremental and it'll just be an "harmless" addition later if this initial
      part is agreed upon.  It also should be noted that locking-wise replacing
      regular pages with hugepages is going to be very easy if compared to what
      I'm doing below in split_huge_page, as it will only happen when
      page_count(page) matches page_mapcount(page) if we can take the PG_lock
      and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
      that (unlike split_huge_page) can fail at the minimal sign of trouble and
      we can try again later.  collapse_huge_page will be similar to how KSM
      works and the madvise(MADV_HUGEPAGE) will work similar to
      madvise(MADV_MERGEABLE).
      
      The default I like is that transparent hugepages are used at page fault
      time.  This can be changed with
      /sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
      to three values "always", "madvise", "never" which mean respectively that
      hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
      or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
      controls if the hugepage allocation should defrag memory aggressively
      "always", only inside "madvise" regions, or "never".
      
      The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
      put_page (from get_user_page users that can't use mmu notifier like
      O_DIRECT) that runs against a __split_huge_page_refcount instead was a
      pain to serialize in a way that would result always in a coherent page
      count for both tail and head.  I think my locking solution with a
      compound_lock taken only after the page_first is valid and is still a
      PageHead should be safe but it surely needs review from SMP race point of
      view.  In short there is no current existing way to serialize the O_DIRECT
      final put_page against split_huge_page_refcount so I had to invent a new
      one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
      returns so...).  And I didn't want to impact all gup/gup_fast users for
      now, maybe if we change the gup interface substantially we can avoid this
      locking, I admit I didn't think too much about it because changing the gup
      unpinning interface would be invasive.
      
      If we ignored O_DIRECT we could stick to the existing compound refcounting
      code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
      (and any other mmu notifier user) would call it without FOLL_GET (and if
      FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
      current task mmu notifier list yet).  But O_DIRECT is fundamental for
      decent performance of virtualized I/O on fast storage so we can't avoid it
      to solve the race of put_page against split_huge_page_refcount to achieve
      a complete hugepage feature for KVM.
      
      Swap and oom works fine (well just like with regular pages ;).  MMU
      notifier is handled transparently too, with the exception of the young bit
      on the pmd, that didn't have a range check but I think KVM will be fine
      because the whole point of hugepages is that EPT/NPT will also use a huge
      pmd when they notice gup returns pages with PageCompound set, so they
      won't care of a range and there's just the pmd young bit to check in that
      case.
      
      NOTE: in some cases if the L2 cache is small, this may slowdown and waste
      memory during COWs because 4M of memory are accessed in a single fault
      instead of 8k (the payoff is that after COW the program can run faster).
      So we might want to switch the copy_huge_page (and clear_huge_page too) to
      not temporal stores.  I also extensively researched ways to avoid this
      cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
      up to 1M (I can send those patches that fully implemented prefault) but I
      concluded they're not worth it and they add an huge additional complexity
      and they remove all tlb benefits until the full hugepage has been faulted
      in, to save a little bit of memory and some cache during app startup, but
      they still don't improve substantially the cache-trashing during startup
      if the prefault happens in >4k chunks.  One reason is that those 4k pte
      entries copied are still mapped on a perfectly cache-colored hugepage, so
      the trashing is the worst one can generate in those copies (cow of 4k page
      copies aren't so well colored so they trashes less, but again this results
      in software running faster after the page fault).  Those prefault patches
      allowed things like a pte where post-cow pages were local 4k regular anon
      pages and the not-yet-cowed pte entries were pointing in the middle of
      some hugepage mapped read-only.  If it doesn't payoff substantially with
      todays hardware it will payoff even less in the future with larger l2
      caches, and the prefault logic would blot the VM a lot.  If one is
      emebdded transparent_hugepage can be disabled during boot with sysfs or
      with the boot commandline parameter transparent_hugepage=0 (or
      transparent_hugepage=2 to restrict hugepages inside madvise regions) that
      will ensure not a single hugepage is allocated at boot time.  It is simple
      enough to just disable transparent hugepage globally and let transparent
      hugepages be allocated selectively by applications in the MADV_HUGEPAGE
      region (both at page fault time, and if enabled with the
      collapse_huge_page too through the kernel daemon).
      
      This patch supports only hugepages mapped in the pmd, archs that have
      smaller hugepages will not fit in this patch alone.  Also some archs like
      power have certain tlb limits that prevents mixing different page size in
      the same regions so they will not fit in this framework that requires
      "graceful fallback" to basic PAGE_SIZE in case of physical memory
      fragmentation.  hugetlbfs remains a perfect fit for those because its
      software limits happen to match the hardware limits.  hugetlbfs also
      remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
      to be found not fragmented after a certain system uptime and that would be
      very expensive to defragment with relocation, so requiring reservation.
      hugetlbfs is the "reservation way", the point of transparent hugepages is
      not to have any reservation at all and maximizing the use of cache and
      hugepages at all times automatically.
      
      Some performance result:
      
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
      ages3
      memset page fault 1566023
      memset tlb miss 453854
      memset second tlb miss 453321
      random access tlb miss 41635
      random access second tlb miss 41658
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
      memset page fault 1566471
      memset tlb miss 453375
      memset second tlb miss 453320
      random access tlb miss 41636
      random access second tlb miss 41637
      vmx andrea # ./largepages3
      memset page fault 1566642
      memset tlb miss 453417
      memset second tlb miss 453313
      random access tlb miss 41630
      random access second tlb miss 41647
      vmx andrea # ./largepages3
      memset page fault 1566872
      memset tlb miss 453418
      memset second tlb miss 453315
      random access tlb miss 41618
      random access second tlb miss 41659
      vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
      vmx andrea # ./largepages3
      memset page fault 2182476
      memset tlb miss 460305
      memset second tlb miss 460179
      random access tlb miss 44483
      random access second tlb miss 44186
      vmx andrea # ./largepages3
      memset page fault 2182791
      memset tlb miss 460742
      memset second tlb miss 459962
      random access tlb miss 43981
      random access second tlb miss 43988
      
      ============
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (3UL*1024*1024*1024)
      
      int main()
      {
      	char *p = malloc(SIZE), *p2;
      	struct timeval before, after;
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset page fault %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	return 0;
      }
      ============
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71e3aac0
    • Andrea Arcangeli's avatar
      thp: add pmd mangling functions to x86 · db3eb96f
      Andrea Arcangeli authored
      Add needed pmd mangling functions with symmetry with their pte
      counterparts.  pmdp_splitting_flush() is the only new addition on the pmd_
      methods and it's needed to serialize the VM against split_huge_page.  It
      simply atomically sets the splitting bit in a similar way
      pmdp_clear_flush_young atomically clears the accessed bit.
      pmdp_splitting_flush() also has to flush the tlb to make it effective
      against gup_fast, but it wouldn't really require to flush the tlb too.
      Just the tlb flush is the simplest operation we can invoke to serialize
      pmdp_splitting_flush() against gup_fast.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db3eb96f
    • Andrea Arcangeli's avatar
      thp: special pmd_trans_* functions · 5f6e8da7
      Andrea Arcangeli authored
      These returns 0 at compile time when the config option is disabled, to
      allow gcc to eliminate the transparent hugepage function calls at compile
      time without additional #ifdefs (only the export of those functions have
      to be visible to gcc but they won't be required at link time and
      huge_memory.o can be not built at all).
      
      _PAGE_BIT_UNUSED1 is never used for pmd, only on pte.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f6e8da7
  27. 26 Oct, 2010 1 commit
  28. 26 Aug, 2010 1 commit
  29. 10 Aug, 2010 1 commit
  30. 20 Feb, 2010 1 commit
    • Russell King's avatar
      MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself · 4b3073e1
      Russell King authored
      On VIVT ARM, when we have multiple shared mappings of the same file
      in the same MM, we need to ensure that we have coherency across all
      copies.  We do this via make_coherent() by making the pages
      uncacheable.
      
      This used to work fine, until we allowed highmem with highpte - we
      now have a page table which is mapped as required, and is not available
      for modification via update_mmu_cache().
      
      Ralf Beache suggested getting rid of the PTE value passed to
      update_mmu_cache():
      
        On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
        to construct a pointer to the pte again.  Passing a pte_t * is much
        more elegant.  Maybe we might even replace the pte argument with the
        pte_t?
      
      Ben Herrenschmidt would also like the pte pointer for PowerPC:
      
        Passing the ptep in there is exactly what I want.  I want that
        -instead- of the PTE value, because I have issue on some ppc cases,
        for I$/D$ coherency, where set_pte_at() may decide to mask out the
        _PAGE_EXEC.
      
      So, pass in the mapped page table pointer into update_mmu_cache(), and
      remove the PTE value, updating all implementations and call sites to
      suit.
      
      Includes a fix from Stephen Rothwell:
      
        sparc: fix fallout from update_mmu_cache API change
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      4b3073e1