1. 08 Dec, 2018 2 commits
  2. 21 Nov, 2018 2 commits
    • Mike Kravetz's avatar
      mm: migration: fix migration of huge PMD shared pages · 9c34ad0c
      Mike Kravetz authored
      commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.
      
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c34ad0c
    • Mike Kravetz's avatar
      hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! · f8d4c943
      Mike Kravetz authored
      commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.
      
      This bug has been experienced several times by the Oracle DB team.  The
      BUG is in remove_inode_hugepages() as follows:
      
      	/*
      	 * If page is mapped, it was faulted in after being
      	 * unmapped in caller.  Unmap (again) now after taking
      	 * the fault mutex.  The mutex will prevent faults
      	 * until we finish removing the page.
      	 *
      	 * This race can only happen in the hole punch case.
      	 * Getting here in a truncate operation is a bug.
      	 */
      	if (unlikely(page_mapped(page))) {
      		BUG_ON(truncate_op);
      
      In this case, the elevated map count is not the result of a race.
      Rather it was incorrectly incremented as the result of a bug in the huge
      pmd sharing code.  Consider the following:
      
       - Process A maps a hugetlbfs file of sufficient size and alignment
         (PUD_SIZE) that a pmd page could be shared.
      
       - Process B maps the same hugetlbfs file with the same size and
         alignment such that a pmd page is shared.
      
       - Process B then calls mprotect() to change protections for the mapping
         with the shared pmd. As a result, the pmd is 'unshared'.
      
       - Process B then calls mprotect() again to chage protections for the
         mapping back to their original value. pmd remains unshared.
      
       - Process B then forks and process C is created. During the fork
         process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
         tables. Copying page tables for hugetlb mappings is done in the
         routine copy_hugetlb_page_range.
      
      In copy_hugetlb_page_range(), the destination pte is obtained by:
      
      	dst_pte = huge_pte_alloc(dst, addr, sz);
      
      If pmd sharing is possible, the returned pointer will be to a pte in an
      existing page table.  In the situation above, process C could share with
      either process A or process B.  Since process A is first in the list,
      the returned pte is a pointer to a pte in process A's page table.
      
      However, the check for pmd sharing in copy_hugetlb_page_range is:
      
      	/* If the pagetables are shared don't copy or take references */
      	if (dst_pte == src_pte)
      		continue;
      
      Since process C is sharing with process A instead of process B, the
      above test fails.  The code in copy_hugetlb_page_range which follows
      assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
      from src_pte to dst_pte and increments this map count of the associated
      page.  This is how we end up with an elevated map count.
      
      To solve, check the dst_pte entry for huge_pte_none.  If !none, this
      implies PMD sharing so do not copy.
      
      Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
      Fixes: c5c99429 ("fix hugepages leak due to pagetable page sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8d4c943
  3. 13 Nov, 2018 1 commit
    • Mike Kravetz's avatar
      hugetlbfs: dirty pages as they are added to pagecache · cbf05aa9
      Mike Kravetz authored
      commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream.
      
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMihcla Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbf05aa9
  4. 11 Jul, 2018 1 commit
  5. 05 Dec, 2017 1 commit
  6. 08 Apr, 2017 1 commit
    • Naoya Horiguchi's avatar
      mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() · 40c5b99f
      Naoya Horiguchi authored
      commit c9d398fa upstream.
      
      I found the race condition which triggers the following bug when
      move_pages() and soft offline are called on a single hugetlb page
      concurrently.
      
          Soft offlining page 0x119400 at 0x700000000000
          BUG: unable to handle kernel paging request at ffffea0011943820
          IP: follow_huge_pmd+0x143/0x190
          PGD 7ffd2067
          PUD 7ffd1067
          PMD 0
              [61163.582052] Oops: 0000 [#1] SMP
          Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
          CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P           OE   4.11.0-rc2-mm1+ #2
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:follow_huge_pmd+0x143/0x190
          RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
          RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
          RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
          RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
          R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
          R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
          FS:  00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
          Call Trace:
           follow_page_mask+0x270/0x550
           SYSC_move_pages+0x4ea/0x8f0
           SyS_move_pages+0xe/0x10
           do_syscall_64+0x67/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: 0033:0x7fc976e03949
          RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
          RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
          RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
          RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
          R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
          R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
          Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
          RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
          CR2: ffffea0011943820
          ---[ end trace e4f81353a2d23232 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
      
      This bug is triggered when pmd_present() returns true for non-present
      hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
      Using pmd_present() to determine present/non-present for hugetlb is not
      correct, because pmd_present() checks multiple bits (not only
      _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
      
      Fixes: e66f17ff ("mm/hugetlb: take page table lock in follow_huge_pmd()")
      Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40c5b99f
  7. 19 Jan, 2017 1 commit
  8. 12 Jan, 2017 1 commit
  9. 11 Nov, 2016 1 commit
  10. 08 Oct, 2016 5 commits
  11. 11 Aug, 2016 1 commit
  12. 02 Aug, 2016 2 commits
  13. 01 Aug, 2016 1 commit
  14. 28 Jul, 2016 1 commit
  15. 26 Jul, 2016 2 commits
  16. 15 Jul, 2016 1 commit
    • Hugh Dickins's avatar
      mm: thp: refix false positive BUG in page_move_anon_rmap() · 5a49973d
      Hugh Dickins authored
      The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
      worth: the syzkaller fuzzer hit it again.  It's still wrong for some THP
      cases, because linear_page_index() was never intended to apply to
      addresses before the start of a vma.
      
      That's easily fixed with a signed long cast inside linear_page_index();
      and Dmitry has tested such a patch, to verify the false positive.  But
      why extend linear_page_index() just for this case? when the avoidance in
      page_move_anon_rmap() has already grown ugly, and there's no reason for
      the check at all (nothing else there is using address or index).
      
      Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
      remove CONFIG_DEBUG_VM PageTransHuge adjustment.
      
      And one more thing: should the compound_head(page) be done inside or
      outside page_move_anon_rmap()? It's usually pushed down to the lowest
      level nowadays (and mm/memory.c shows no other explicit use of it), so I
      think it's better done in page_move_anon_rmap() than by caller.
      
      Fixes: 0798d3c0 ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvilsSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a49973d
  17. 06 Jul, 2016 1 commit
  18. 25 Jun, 2016 2 commits
    • Gerald Schaefer's avatar
      mm/hugetlb: clear compound_mapcount when freeing gigantic pages · c8cc708a
      Gerald Schaefer authored
      While working on s390 support for gigantic hugepages I ran into the
      following "Bad page state" warning when freeing gigantic pages:
      
        BUG: Bad page state in process bash  pfn:580001
        page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
        flags: 0x7fffc0000000000()
        page dumped because: non-NULL mapping
      
      This is because page->compound_mapcount, which is part of a union with
      page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
      and not cleared again during destroy_compound_gigantic_page().  Fix this
      by clearing the compound_mapcount in destroy_compound_gigantic_page()
      before clearing compound_head.
      
      Interestingly enough, the warning will not show up on x86_64, although
      this should not be architecture specific.  Apparently there is an
      endianness issue, combined with the fact that the union contains both a
      64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
      members.  The resulting bogus page->mapping on x86_64 therefore contains
      00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
      trigger the PageAnon() check in free_pages_prepare() because
      page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
      like x86_64 in this case (the page is not compound anymore,
      ->compound_head was already cleared before).  As a result, page->mapping
      will be cleared before doing the checks in free_pages_check().
      
      Not sure if the bogus "PageAnon() returning true" on x86_64 for the
      first tail page of a gigantic page (at this stage) has other theoretical
      implications, but they would also be fixed with this patch.
      
      Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.comSigned-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8cc708a
    • Kirill A. Shutemov's avatar
      hugetlb: fix nr_pmds accounting with shared page tables · c17b1f42
      Kirill A. Shutemov authored
      We account HugeTLB's shared page table to all processes who share it.
      The accounting happens during huge_pmd_share().
      
      If somebody populates pud entry under us, we should decrease pagetable's
      refcount and decrease nr_pmds of the process.
      
      By mistake, I increase nr_pmds again in this case.  :-/ It will lead to
      "BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.
      
      Let's fix this by increasing nr_pmds only when we're sure that the page
      table will be used.
      
      Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
      Fixes: dc6c9a35 ("mm: account pmd page tables to the process")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c17b1f42
  19. 09 Jun, 2016 1 commit
    • Mike Kravetz's avatar
      mm/hugetlb: fix huge page reserve accounting for private mappings · 67961f9d
      Mike Kravetz authored
      When creating a private mapping of a hugetlbfs file, it is possible to
      unmap pages via ftruncate or fallocate hole punch.  If subsequent faults
      repopulate these mappings, the reserve counts will go negative.  This is
      because the code currently assumes all faults to private mappings will
      consume reserves.  The problem can be recreated as follows:
      
       - mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
       - write fault in pages in the mapping
       - fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
       - write fault in pages in the hole
      
      This will result in negative huge page reserve counts and negative
      subpool usage counts for the hugetlbfs.  Note that this can also be
      recreated with ftruncate, but fallocate is more straight forward.
      
      This patch modifies the routines vma_needs_reserves and vma_has_reserves
      to examine the reserve map associated with private mappings similar to
      that for shared mappings.  However, the reserve map semantics for
      private and shared mappings are very different.  This results in subtly
      different code that is explained in the comments.
      
      Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67961f9d
  20. 29 May, 2016 1 commit
  21. 21 May, 2016 1 commit
    • Dan Williams's avatar
      /dev/dax, core: file operations and dax-mmap · dee41079
      Dan Williams authored
      The "Device DAX" core enables dax mappings of performance / feature
      differentiated memory.  An open mapping or file handle keeps the backing
      struct device live, but new mappings are only possible while the device
      is enabled.   Faults are handled under rcu_read_lock to synchronize
      with the enabled state of the device.
      
      Similar to the filesystem-dax case the backing memory may optionally
      have struct page entries.  However, unlike fs-dax there is no support
      for private mappings, or mappings that are not backed by media (see
      use of zero-page in fs-dax).
      
      Mappings are always guaranteed to match the alignment of the dax_region.
      If the dax_region is configured to have a 2MB alignment, all mappings
      are guaranteed to be backed by a pmd entry.  Contrast this determinism
      with the fs-dax case where pmd mappings are opportunistic.  If userspace
      attempts to force a misaligned mapping, the driver will fail the mmap
      attempt.  See dax_dev_check_vma() for other scenarios that are rejected,
      like MAP_PRIVATE mappings.
      
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatar"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      dee41079
  22. 20 May, 2016 5 commits
    • Joonsoo Kim's avatar
      mm/hugetlb: add same zone check in pfn_range_valid_gigantic() · f44b2dda
      Joonsoo Kim authored
      This patchset deals with some problematic sites that iterate pfn ranges.
      
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to take care of this overlapping when iterating pfn
      range.
      
      I audit many iterating sites that uses pfn_valid(), pfn_valid_within(),
      zone_start_pfn and etc.  and others looks safe to me.  This is a
      preparation step for a new CMA implementation, ZONE_CMA
      (https://lkml.org/lkml/2015/2/12/95), because it would be easily
      overlapped with other zones.  But, zone overlap check is also needed for
      the general case so I send it separately.
      
      This patch (of 5):
      
      alloc_gigantic_page() uses alloc_contig_range() and this requires that
      the requested range is in a single zone.  To satisfy this requirement,
      add this check to pfn_range_valid_gigantic().
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f44b2dda
    • Andrew Morton's avatar
      mm/hugetlb.c: use first_memory_node · 54f18d35
      Andrew Morton authored
      Instead of open-coding it.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f18d35
    • Vaishali Thakkar's avatar
      mm/hugetlb: introduce hugetlb_bad_size() · 9fee021d
      Vaishali Thakkar authored
      When any unsupported hugepage size is specified, 'hugepagesz=' and
      'hugepages=' should be ignored during command line parsing until any
      supported hugepage size is found.  But currently incorrect number of
      hugepages are allocated when unsupported size is specified as it fails
      to ignore the 'hugepages=' command.
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'hugepagesz=256M hugepages=X'.
      After boot, dmesg output shows that X number of hugepages of the size 2M
      is pre-allocated instead of 0.
      
      So, to handle such command line options, introduce new routine
      hugetlb_bad_size.  The routine hugetlb_bad_size sets the global variable
      parsed_valid_hugepagesz.  We are using parsed_valid_hugepagesz to save
      the state when unsupported hugepagesize is found so that we can ignore
      the 'hugepages=' parameters after that and then reset the variable when
      supported hugepage size is found.
      
      The routine hugetlb_bad_size can be called while setting 'hugepagesz='
      parameter in an architecture specific code.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fee021d
    • Mike Kravetz's avatar
      mm/hugetlb: optimize minimum size (min_size) accounting · 09a95e29
      Mike Kravetz authored
      It was observed that minimum size accounting associated with the
      hugetlbfs min_size mount option may not perform optimally and as
      expected.  As huge pages/reservations are released from the filesystem
      and given back to the global pools, they are reserved for subsequent
      filesystem use as long as the subpool reserved count is less than
      subpool minimum size.  It does not take into account used pages within
      the filesystem.  The filesystem size limits are not exceeded and this is
      technically not a bug.  However, better behavior would be to wait for
      the number of used pages/reservations associated with the filesystem to
      drop below the minimum size before taking reservations to satisfy
      minimum size.
      
      An optimization is also made to the hugepage_subpool_get_pages() routine
      which is called when pages/reservations are allocated.  This does not
      change behavior, but simply avoids the accounting if all reservations
      have already been taken (subpool reserved count == 0).
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a95e29
    • Andrew Morton's avatar
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton authored
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  23. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  24. 17 Mar, 2016 1 commit
  25. 09 Mar, 2016 2 commits
    • Jan Stancek's avatar
      mm/hugetlb: use EOPNOTSUPP in hugetlb sysctl handlers · 86613628
      Jan Stancek authored
      Replace ENOTSUPP with EOPNOTSUPP.  If hugepages are not supported, this
      value is propagated to userspace.  EOPNOTSUPP is part of uapi and is
      widely supported by libc libraries.
      
      It gives nicer message to user, rather than:
      
        # cat /proc/sys/vm/nr_hugepages
        cat: /proc/sys/vm/nr_hugepages: Unknown error 524
      
      And also LTP's proc01 test was failing because this ret code (524)
      was unexpected:
      
        proc01      1  TFAIL  :  proc01.c:396: read failed: /proc/sys/vm/nr_hugepages: errno=???(524): Unknown error 524
        proc01      2  TFAIL  :  proc01.c:396: read failed: /proc/sys/vm/nr_hugepages_mempolicy: errno=???(524): Unknown error 524
        proc01      3  TFAIL  :  proc01.c:396: read failed: /proc/sys/vm/nr_overcommit_hugepages: errno=???(524): Unknown error 524
      Signed-off-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86613628
    • Geoffrey Thomas's avatar
      mm/hugetlb: hugetlb_no_page: rate-limit warning message · 910154d5
      Geoffrey Thomas authored
      The warning message "killed due to inadequate hugepage pool" simply
      indicates that SIGBUS was sent, not that the process was forcibly killed.
      If the process has a signal handler installed does not fix the problem,
      this message can rapidly spam the kernel log.
      
      On my amd64 dev machine that does not have hugepages configured, I can
      reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
      4 megabytes of huge pages) and running something that sets a signal
      handler and forks, like
      
        #include <sys/mman.h>
        #include <signal.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        sig_atomic_t counter = 10;
        void handler(int signal)
        {
            if (counter-- == 0)
               exit(0);
        }
      
        int main(void)
        {
            int status;
            char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
            if (addr == MAP_FAILED) {perror("mmap"); return 1;}
            *addr = 'x';
            switch (fork()) {
               case -1:
                  perror("fork"); return 1;
               case 0:
                  signal(SIGBUS, handler);
                  *addr = 'x';
                  break;
               default:
                  *addr = 'x';
                  wait(&status);
                  if (WIFSIGNALED(status)) {
                     psignal(WTERMSIG(status), "child");
                  }
                  break;
            }
        }
      Signed-off-by: default avatarGeoffrey Thomas <geofft@ldpreload.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      910154d5
  26. 19 Feb, 2016 1 commit
    • Vaishali Thakkar's avatar
      mm/hugetlb.c: fix incorrect proc nr_hugepages value · f8b74815
      Vaishali Thakkar authored
      Currently incorrect default hugepage pool size is reported by proc
      nr_hugepages when number of pages for the default huge page size is
      specified twice.
      
      When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
      indicates the current number of pre-allocated huge pages of the default
      size.  Basically /proc/sys/vm/nr_hugepages displays default_hstate->
      max_huge_pages and after boot time pre-allocation, max_huge_pages should
      equal the number of pre-allocated pages (nr_hugepages).
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'default_hugepagesz=1G
      hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'.  After
      boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
      returns the value X.  However, dmesg output shows that Z huge pages were
      pre-allocated.
      
      So, the root cause of the problem here is that the global variable
      default_hstate_max_huge_pages is set if a default huge page size is
      specified (directly or indirectly) on the command line.  After the command
      line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
      the value is assigned to default_hstae.max_huge_pages.  However,
      default_hstate.max_huge_pages may have already been set based on the
      number of pre-allocated huge pages of default_hstate size.
      
      The solution to this problem is if hstate->max_huge_pages is already set
      then it should not set as a result of global max_huge_pages value.
      Basically if the value of the variable hugepages is set multiple times on
      a command line for a specific supported hugepagesize then proc layer
      should consider the last specified value.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8b74815