1. 16 May, 2019 1 commit
  2. 02 May, 2019 1 commit
    • Jan Kara's avatar
      mm: Fix warning in insert_pfn() · f7adeff6
      Jan Kara authored
      commit f2c57d91b0d96aa13ccff4e3b178038f17b00658 upstream.
      In DAX mode a write pagefault can race with write(2) in the following
      CPU0                            CPU1
                                      write fault for mapped zero page (hole)
            - allocates blocks
              - invalidates radix tree entries in given range
                                          - no entry found, creates empty
                                          - finds already allocated block
                                          - WARNs and does nothing because there
                                            is still zero page mapped in PTE
      This race results in WARN_ON from insert_pfn() and is occasionally
      triggered by fstest generic/344. Note that the race is otherwise
      harmless as before write(2) on CPU0 is finished, we will invalidate page
      tables properly and thus user of mmap will see modified data from
      write(2) from that point on. So just restrict the warning only to the
      case when the PFN in PTE is not zero page.
      Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 16 Jan, 2019 1 commit
    • Michal Hocko's avatar
      mm, memcg: fix reclaim deadlock with writeback · 8c4da113
      Michal Hocko authored
      commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.
      Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
      ext4 writeback
          mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
      He adds
       "task1 is waiting for the PageWriteback bit of the page that task2 has
        collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
        LOCKED bit the page which tasks1 has locked"
      More precisely task1 is handling a page fault and it has a page locked
      while it charges a new page table to a memcg.  That in turn hits a
      memory limit reclaim and the memcg reclaim for legacy controller is
      waiting on the writeback but that is never going to finish because the
      writeback itself is waiting for the page locked in the #PF path.  So
      this is essentially ABBA deadlock:
                                              # flush A, B to clear the writeback
      This accumulating of more pages to flush is used by several filesystems
      to generate a more optimal IO patterns.
      Waiting for the writeback in legacy memcg controller is a workaround for
      pre-mature OOM killer invocations because there is no dirty IO
      throttling available for the controller.  There is no easy way around
      that unfortunately.  Therefore fix this specific issue by pre-allocating
      the page table outside of the page lock.  We have that handy
      infrastructure for that already so simply reuse the fault-around pattern
      which already does this.
      There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
      from under a fs page locked but they should be really rare.  I am not
      aware of a better solution unfortunately.
      [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: enhance comment, per Johannes]
        Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Debugged-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 01 Dec, 2018 1 commit
  5. 23 Sep, 2018 1 commit
    • Philippe Gerum's avatar
      mm: ipipe: disable ondemand memory · 8168e4dd
      Philippe Gerum authored
      Co-kernels cannot bear with the extra latency caused by memory access
      faults involved in COW or
      overcommit. __ipipe_disable_ondemand_mappings() force commits all
      common memory mappings with physical RAM.
      In addition, the architecture code is given a chance to pre-load page
      table entries for ioremap and vmalloc memory, for preventing further
      minor faults accessing such memory due to PTE misses (if that ever
      makes sense for them).
      Revisit: Further COW breaking in copy_user_page() and copy_pte_range()
      may be useless once __ipipe_disable_ondemand_mappings() has run for a
      co-kernel task, since all of its mappings have been populated, and
      unCOWed if applicable.
  6. 09 Sep, 2018 1 commit
  7. 05 Sep, 2018 4 commits
  8. 15 Aug, 2018 1 commit
    • Andi Kleen's avatar
      x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings · a4116334
      Andi Kleen authored
      commit 42e4089c upstream
      For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
      table entry. This sets the high bits in the CPU's address space, thus
      making sure to point to not point an unmapped entry to valid cached memory.
      Some server system BIOSes put the MMIO mappings high up in the physical
      address space. If such an high mapping was mapped to unprivileged users
      they could attack low memory by setting such a mapping to PROT_NONE. This
      could happen through a special device driver which is not access
      protected. Normal /dev/mem is of course access protected.
      To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
      Valid page mappings are allowed because the system is then unsafe anyways.
      It's not expected that users commonly use PROT_NONE on MMIO. But to
      minimize any impact this is only enforced if the mapping actually refers to
      a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
      the check for root.
      For mmaps this is straight forward and can be handled in vm_insert_pfn and
      in remap_pfn_range().
      For mprotect it's a bit trickier. At the point where the actual PTEs are
      accessed a lot of state has been changed and it would be difficult to undo
      on an error. Since this is a uncommon case use a separate early page talk
      walk pass for MMIO PROT_NONE mappings that checks for this condition
      early. For non MMIO and non PROT_NONE there are no changes.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  9. 22 Feb, 2018 1 commit
  10. 04 Oct, 2017 1 commit
  11. 09 Sep, 2017 6 commits
  12. 07 Sep, 2017 5 commits
    • Huang Ying's avatar
      mm: hugetlb: clear target sub-page last when clearing huge page · c79b57e4
      Huang Ying authored
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      clearing huge page on x86_64 platform, the cache footprint is 2M.  But
      on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
      LLC (last level cache).  That is, in average, there are 2.5M LLC for
      each core and 1.25M LLC for each thread.
      If the cache pressure is heavy when clearing the huge page, and we clear
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing clearing the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after clearing the huge page.
      To help the above situation, in this patch, when we clear a huge page,
      the order to clear sub-pages is changed.  In quite some situation, we
      can get the address that the application will access after we clear the
      huge page, for example, in a page fault handler.  Instead of clearing
      the huge page from begin to end, we will clear the sub-pages farthest
      from the the sub-page to access firstly, and clear the sub-page to
      access last.  This will make the sub-page to access most cache-hot and
      sub-pages around it more cache-hot too.  If we cannot know the address
      the application will access, the begin of the huge page is assumed to be
      the the address the application will access.
      With this patch, the throughput increases ~28.3% in vm-scalability
      anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case creates 72 processes, each
      process mmap a big anonymous memory area and writes to it from the begin
      to the end.  For each process, other processes could be seen as other
      workload which generates heavy cache pressure.  At the same time, the
      cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
      cycle) increased from 0.56 to 0.74, and the time spent in user space is
      reduced ~7.9%
      Christopher Lameter suggests to clear bytes inside a sub-page from end
      to begin too.  But tests show no visible performance difference in the
      tests.  May because the size of page is small compared with the cache
      Thanks Andi Kleen to propose to use address to access to determine the
      order of sub-pages to clear.
      The hugetlbfs access address could be improved, will do that in another
      [ying.huang@intel.com: improve readability of clear_huge_page()]
        Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.comSuggested-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Huang Ying's avatar
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying authored
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      The test and result is as follow,
      Common test condition
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      Micro-benchmark with combined access pattern
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      The test memory size is bigger than RAM to trigger swapping.
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Huang Ying's avatar
      mm, THP, swap: make reuse_swap_page() works for THP swapped out · ba3c4ce6
      Huang Ying authored
      After supporting to delay THP (Transparent Huge Page) splitting after
      swapped out, it is possible that some page table mappings of the THP are
      turned into swap entries.  So reuse_swap_page() need to check the swap
      count in addition to the map count as before.  This patch done that.
      In the huge PMD write protect fault handler, in addition to the page map
      count, the swap count need to be checked too, so the page lock need to
      be acquired too when calling reuse_swap_page() in addition to the page
      table lock.
      [ying.huang@intel.com: silence a compiler warning]
        Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: always flush VMA ranges affected by zap_page_range · 4647706e
      Mel Gorman authored
      Nadav Amit report zap_page_range only specifies that the caller protect
      the VMA list but does not specify whether it is held for read or write
      with callers using either.  madvise holds mmap_sem for read meaning that
      a parallel zap operation can unmap PTEs which are then potentially
      skipped by madvise which potentially returns with stale TLB entries
      present.  While the API could be extended, it would be a difficult API
      to use.  This patch causes zap_page_range() to always consider flushing
      the full affected range.  For small ranges or sparsely populated
      mappings, this may result in one additional spurious TLB flush.  For
      larger ranges, it is possible that the TLB has already been flushed and
      the overhead is negligible.  Either way, this approach is safer overall
      and avoids stale entries being present when madvise returns.
      This can be illustrated with the following program provided by Nadav
      Amit and slightly modified.  With the patch applied, it has an exit code
      of 0 indicating a stale TLB entry did not leak to userspace.
      volatile int sync_step = 0;
      volatile char *p;
      static inline unsigned long rdtsc()
      	unsigned long hi, lo;
      	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
      	 return lo | (hi << 32);
      static inline void wait_rdtsc(unsigned long cycles)
      	unsigned long tsc = rdtsc();
      	while (rdtsc() - tsc < cycles);
      void *big_madvise_thread(void *ign)
      	sync_step = 1;
      	while (sync_step != 2);
      	madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_DONTNEED);
      int main(void)
      	pthread_t aux_thread;
      	memset((void*)p, 8, PAGE_SIZE * N_PAGES);
      	pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
      	while (sync_step != 1);
      	*p = 8;		// Cache in TLB
      	sync_step = 2;
      	madvise((void*)p, PAGE_SIZE, MADV_DONTNEED);
      	printf("data: %d (%s)\n", *p, (*p == 8 ? "stale, broken" : "cleared, fine"));
      	return *p == 8 ? -1 : 0;
      Link: http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Ross Zwisler's avatar
      mm: add vm_insert_mixed_mkwrite() · b2770da6
      Ross Zwisler authored
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.  This has three major
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory.
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it.
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
      This series solves these issues by following the lead of the DAX PMD
      code and using a common 4k zero page instead.  This reduces memory usage
      and decreases latencies for some workloads, and it simplifies the DAX
      code, removing over 100 lines in total.
      This patch (of 5):
      To be able to use the common 4k zero page in DAX we need to have our PTE
      fault path look more like our PMD fault path where a PTE entry can be
      marked as dirty and writeable as it is first inserted rather than
      waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
      Right now we can rely on having a dax_pfn_mkwrite() call because we can
      distinguish between these two cases in do_wp_page():
      	case 1: 4k zero page => writable DAX storage
      	case 2: read-only DAX storage => writeable DAX storage
      This distinction is made by via vm_normal_page().  vm_normal_page()
      returns false for the common 4k zero page, though, just as it does for
      DAX ptes.  Instead of special casing the DAX + 4k zero page case we will
      simplify our DAX PTE page fault sequence so that it matches our DAX PMD
      sequence, and get rid of the dax_pfn_mkwrite() helper.  We will instead
      use dax_iomap_fault() to handle write-protection faults.
      This means that insert_pfn() needs to follow the lead of
      insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag.  If 'mkwrite'
      is set insert_pfn() will do the work that was previously done by
      wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
      Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 31 Aug, 2017 1 commit
    • Jérôme Glisse's avatar
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse authored
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 18 Aug, 2017 2 commits
    • Michal Hocko's avatar
      mm, oom: fix potential data corruption when oom_reaper races with writer · 6b31d595
      Michal Hocko authored
      Wenwei Tao has noticed that our current assumption that the oom victim
      is dying and never doing any visible changes after it dies, and so the
      oom_reaper can tear it down, is not entirely true.
      __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
      but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
      So there is a race window when some threads won't have
      fatal_signal_pending while the oom_reaper could start unmapping the
      address space.  Moreover some paths might not check for fatal signals
      before each PF/g-u-p/copy_from_user.
      We already have a protection for oom_reaper vs.  PF races by checking
      MMF_UNSTABLE.  This has been, however, checked only for kernel threads
      (use_mm users) which can outlive the oom victim.  A simple fix would be
      to extend the current check in handle_mm_fault for all tasks but that
      wouldn't be sufficient because the current check assumes that a kernel
      thread would bail out after EFAULT from get_user*/copy_from_user and
      never re-read the same address which would succeed because the PF path
      has established page tables already.  This seems to be the case for the
      only existing use_mm user currently (virtio driver) but it is rather
      fragile in general.
      This is even more fragile in general for more complex paths such as
      generic_perform_write which can re-read the same address more times
      (e.g.  iov_iter_copy_from_user_atomic to fail and then
      iov_iter_fault_in_readable on retry).
      Therefore we have to implement MMF_UNSTABLE protection in a robust way
      and never make a potentially corrupted content visible.  That requires
      to hook deeper into the PF path and check for the flag _every time_
      before a pte for anonymous memory is established (that means all
      !VM_SHARED mappings).
      The corruption can be triggered artificially
      but there doesn't seem to be any real life bug report.  The race window
      should be quite tight to trigger most of the time.
      Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarWenwei Tao <wenwei.tww@alibaba-inc.com>
      Tested-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS · 5b53a6ea
      Michal Hocko authored
      Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
      handle_mm_fault causes a lockdep splat
        Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
        Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
        a.out (1169) used greatest stack depth: 11664 bytes left
        DEBUG_LOCKS_WARN_ON(depth <= 0)
        ------------[ cut here ]------------
        WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
        CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
        RIP: 0010:lock_release+0x172/0x1e0
        Call Trace:
      The reason is that the page fault path might have dropped the mmap_sem
      and returned with VM_FAULT_RETRY.  MMF_UNSTABLE check however rewrites
      the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
      that path.  Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
      the MMF_UNSTABLE path.
      We cannot simply add VM_FAULT_SIGBUS to the existing error code because
      all arch specific page fault handlers and g-u-p would have to learn a
      new error code combination.
      Link: http://lkml.kernel.org/r/20170807113839.16695-2-mhocko@kernel.org
      Fixes: 3f70dc38 ("mm: make sure that kthreads will not refault oom reaped memory")
      Reported-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Wenwei Tao <wenwei.tww@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 10 Aug, 2017 2 commits
    • Minchan Kim's avatar
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim authored
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Minchan Kim's avatar
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim authored
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      It shouldn't change any behavior.
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 02 Aug, 2017 1 commit
    • Mel Gorman's avatar
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman authored
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      He described the race as follows:
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
                                              user writes using cached RW PTE
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 12 Jul, 2017 1 commit
  18. 10 Jul, 2017 1 commit
  19. 06 Jul, 2017 2 commits
  20. 19 Jun, 2017 1 commit
    • Hugh Dickins's avatar
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins authored
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: default avatarOleg Nesterov <oleg@redhat.com>
      Original-patch-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  21. 02 Jun, 2017 1 commit
    • Ross Zwisler's avatar
      mm: avoid spurious 'bad pmd' warning messages · d0f0931d
      Ross Zwisler authored
      When the pmd_devmap() checks were added by 5c7fb56e ("mm, dax:
      dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
      pages, they were all added to the end of if() statements after existing
      pmd_trans_huge() checks.  So, things like:
        -       if (pmd_trans_huge(*pmd))
        +       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
      When further checks were added after pmd_trans_unstable() checks by
      commit 7267ec00 ("mm: postpone page table allocation until we have
      page to map") they were also added at the end of the conditional:
        +       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
      This ordering is fine for pmd_trans_huge(), but doesn't work for
      pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
      check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
      pmd_trans_unstable()), which prints out a warning and returns 1.  So, we
      do end up doing the right thing, but only after spamming dmesg with
      suspicious looking messages:
        mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
      Reorder these checks in a helper so that pmd_devmap() is checked first,
      avoiding the error messages, and add a comment explaining why the
      ordering is important.
      Fixes: commit 7267ec00 ("mm: postpone page table allocation until we have page to map")
      Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  22. 28 Mar, 2017 1 commit
  23. 09 Mar, 2017 2 commits
  24. 02 Mar, 2017 1 commit