1. 13 Dec, 2018 1 commit
    • Tetsuo Handa's avatar
      mm: don't warn about allocations which stall for too long · fbb78e97
      Tetsuo Handa authored
      commit 400e2249 upstream.
      
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avataryuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      [Resolved backport conflict due to missing 82251963, a8e99259, 9e80c719 and
       9a67f648 in 4.9 -- all of which modified this hunk being removed.]
      Signed-off-by: default avatarAmit Shah <amit@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fbb78e97
  2. 08 Dec, 2018 3 commits
  3. 05 Dec, 2018 13 commits
  4. 01 Dec, 2018 2 commits
    • Yufen Yu's avatar
      tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset · d77eacdb
      Yufen Yu authored
      [ Upstream commit 1a413646931cb14442065cfc17561e50f5b5bb44 ]
      
      Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
      lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.
      
      man 2 lseek says
      
      :      EINVAL whence  is  not  valid.   Or: the resulting file offset would be
      :             negative, or beyond the end of a seekable device.
      :
      :      ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is  beyond
      :             the end of the file.
      
      Make tmpfs return ENXIO under these circumstances as well.  After this,
      tmpfs also passes xfstests's generic/448.
      
      [akpm@linux-foundation.org: rewrite changelog]
      Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.comSigned-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d77eacdb
    • Dmitry Vyukov's avatar
      mm: don't warn about large allocations for slab · 789c6944
      Dmitry Vyukov authored
      commit 61448479a9f2c954cde0cfe778cb6bec5d0a748d upstream.
      
      Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
      instead it falls back to kmalloc_large().
      
      For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
      kmalloc_slab() for all allocations relying on NULL return value for
      over-sized allocations.
      
      This inconsistency leads to unwanted warnings from kmalloc_slab() for
      over-sized allocations for slab.  Returning NULL for failed allocations is
      the expected behavior.
      
      Make slub and slab code consistent by checking size >
      KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().
      
      While we are here also fix the check in kmalloc_slab().  We should check
      against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE.  It all kinda
      worked because for slab the constants are the same, and slub always checks
      the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab().  But if we
      get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
      happen.  For example, in case of a newly introduced bug in slub code.
      
      Also move the check in kmalloc_slab() from function entry to the size >
      192 case.  This partially compensates for the additional check in slab
      code and makes slub code a bit faster (at least theoretically).
      
      Also drop __GFP_NOWARN in the warning check.  This warning means a bug in
      slab code itself, user-passed flags have nothing to do with it.
      
      Nothing of this affects slob.
      
      Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.comSigned-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
      Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
      Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
      Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
      Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      789c6944
  5. 21 Nov, 2018 4 commits
    • Mike Kravetz's avatar
      mm: migration: fix migration of huge PMD shared pages · 9c34ad0c
      Mike Kravetz authored
      commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.
      
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c34ad0c
    • Mike Kravetz's avatar
      hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! · f8d4c943
      Mike Kravetz authored
      commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.
      
      This bug has been experienced several times by the Oracle DB team.  The
      BUG is in remove_inode_hugepages() as follows:
      
      	/*
      	 * If page is mapped, it was faulted in after being
      	 * unmapped in caller.  Unmap (again) now after taking
      	 * the fault mutex.  The mutex will prevent faults
      	 * until we finish removing the page.
      	 *
      	 * This race can only happen in the hole punch case.
      	 * Getting here in a truncate operation is a bug.
      	 */
      	if (unlikely(page_mapped(page))) {
      		BUG_ON(truncate_op);
      
      In this case, the elevated map count is not the result of a race.
      Rather it was incorrectly incremented as the result of a bug in the huge
      pmd sharing code.  Consider the following:
      
       - Process A maps a hugetlbfs file of sufficient size and alignment
         (PUD_SIZE) that a pmd page could be shared.
      
       - Process B maps the same hugetlbfs file with the same size and
         alignment such that a pmd page is shared.
      
       - Process B then calls mprotect() to change protections for the mapping
         with the shared pmd. As a result, the pmd is 'unshared'.
      
       - Process B then calls mprotect() again to chage protections for the
         mapping back to their original value. pmd remains unshared.
      
       - Process B then forks and process C is created. During the fork
         process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
         tables. Copying page tables for hugetlb mappings is done in the
         routine copy_hugetlb_page_range.
      
      In copy_hugetlb_page_range(), the destination pte is obtained by:
      
      	dst_pte = huge_pte_alloc(dst, addr, sz);
      
      If pmd sharing is possible, the returned pointer will be to a pte in an
      existing page table.  In the situation above, process C could share with
      either process A or process B.  Since process A is first in the list,
      the returned pte is a pointer to a pte in process A's page table.
      
      However, the check for pmd sharing in copy_hugetlb_page_range is:
      
      	/* If the pagetables are shared don't copy or take references */
      	if (dst_pte == src_pte)
      		continue;
      
      Since process C is sharing with process A instead of process B, the
      above test fails.  The code in copy_hugetlb_page_range which follows
      assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
      from src_pte to dst_pte and increments this map count of the associated
      page.  This is how we end up with an elevated map count.
      
      To solve, check the dst_pte entry for huge_pte_none.  If !none, this
      implies PMD sharing so do not copy.
      
      Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
      Fixes: c5c99429 ("fix hugepages leak due to pagetable page sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8d4c943
    • Andrea Arcangeli's avatar
      mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings · 818e5846
      Andrea Arcangeli authored
      commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream.
      
      THP allocation might be really disruptive when allocated on NUMA system
      with the local node full or hard to reclaim.  Stefan has posted an
      allocation stall report on 4.12 based SLES kernel which suggests the
      same issue:
      
        kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
        kvm cpuset=/ mems_allowed=0-1
        CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
        Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
        Call Trace:
         dump_stack+0x5c/0x84
         warn_alloc+0xe0/0x180
         __alloc_pages_slowpath+0x820/0xc90
         __alloc_pages_nodemask+0x1cc/0x210
         alloc_pages_vma+0x1e5/0x280
         do_huge_pmd_wp_page+0x83f/0xf00
         __handle_mm_fault+0x93d/0x1060
         handle_mm_fault+0xc6/0x1b0
         __do_page_fault+0x230/0x430
         do_page_fault+0x2a/0x70
         page_fault+0x7b/0x80
         [...]
        Mem-Info:
        active_anon:126315487 inactive_anon:1612476 isolated_anon:5
         active_file:60183 inactive_file:245285 isolated_file:0
         unevictable:15657 dirty:286 writeback:1 unstable:0
         slab_reclaimable:75543 slab_unreclaimable:2509111
         mapped:81814 shmem:31764 pagetables:370616 bounce:0
         free:32294031 free_pcp:6233 free_cma:0
        Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      
      The defrag mode is "madvise" and from the above report it is clear that
      the THP has been allocated for MADV_HUGEPAGA vma.
      
      Andrea has identified that the main source of the problem is
      __GFP_THISNODE usage:
      
      : The problem is that direct compaction combined with the NUMA
      : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
      : hard the local node, instead of failing the allocation if there's no
      : THP available in the local node.
      :
      : Such logic was ok until __GFP_THISNODE was added to the THP allocation
      : path even with MPOL_DEFAULT.
      :
      : The idea behind the __GFP_THISNODE addition, is that it is better to
      : provide local memory in PAGE_SIZE units than to use remote NUMA THP
      : backed memory. That largely depends on the remote latency though, on
      : threadrippers for example the overhead is relatively low in my
      : experience.
      :
      : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
      : extremely slow qemu startup with vfio, if the VM is larger than the
      : size of one host NUMA node. This is because it will try very hard to
      : unsuccessfully swapout get_user_pages pinned pages as result of the
      : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
      : allocations and instead of trying to allocate THP on other nodes (it
      : would be even worse without vfio type1 GUP pins of course, except it'd
      : be swapping heavily instead).
      
      Fix this by removing __GFP_THISNODE for THP requests which are
      requesting the direct reclaim.  This effectivelly reverts 5265047a
      on the grounds that the zone/node reclaim was known to be disruptive due
      to premature reclaim when there was memory free.  While it made sense at
      the time for HPC workloads without NUMA awareness on rare machines, it
      was ultimately harmful in the majority of cases.  The existing behaviour
      is similar, if not as widespare as it applies to a corner case but
      crucially, it cannot be tuned around like zone_reclaim_mode can.  The
      default behaviour should always be to cause the least harm for the
      common case.
      
      If there are specialised use cases out there that want zone_reclaim_mode
      in specific cases, then it can be built on top.  Longterm we should
      consider a memory policy which allows for the node reclaim like behavior
      for the specific memory ranges which would allow a
      
      [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
      
      Mel said:
      
      : Both patches look correct to me but I'm responding to this one because
      : it's the fix.  The change makes sense and moves further away from the
      : severe stalling behaviour we used to see with both THP and zone reclaim
      : mode.
      :
      : I put together a basic experiment with usemem configured to reference a
      : buffer multiple times that is 80% the size of main memory on a 2-socket
      : box with symmetric node sizes and defrag set to "always".  The defrag
      : setting is not the default but it would be functionally similar to
      : accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
      : reference the buffer multiple times and while it's not an interesting
      : workload, it would be expected to complete reasonably quickly as it fits
      : within memory.  The results were;
      :
      : usemem
      :                                   vanilla           noreclaim-v1
      : Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
      : Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
      : Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
      :
      : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
      : threads referencing buffers 80% the size of memory.  With the patches
      : applied, it's 37.18% faster for the single thread and 73% faster with two
      : threads.  Note that 4 threads showing little difference does not indicate
      : the problem is related to thread counts.  It's simply the case that 4
      : threads gets spread so their workload mostly fits in one node.
      :
      : The overall view from /proc/vmstats is more startling
      :
      :                          4.19.0-rc1  4.19.0-rc1
      :                             vanillanoreclaim-v1r1
      : Minor Faults               35593425      708164
      : Major Faults                 484088          36
      : Swap Ins                    3772837           0
      : Swap Outs                   3932295           0
      :
      : Massive amounts of swap in/out without the patch
      :
      : Direct pages scanned        6013214           0
      : Kswapd pages scanned              0           0
      : Kswapd pages reclaimed            0           0
      : Direct pages reclaimed      4033009           0
      :
      : Lots of reclaim activity without the patch
      :
      : Kswapd efficiency              100%        100%
      : Kswapd velocity               0.000       0.000
      : Direct efficiency               67%        100%
      : Direct velocity           11191.956       0.000
      :
      : Mostly from direct reclaim context as you'd expect without the patch.
      :
      : Page writes by reclaim  3932314.000       0.000
      : Page writes file                 19           0
      : Page writes anon            3932295           0
      : Page reclaim immediate        42336           0
      :
      : Writes from reclaim context is never good but the patch eliminates it.
      :
      : We should never have default behaviour to thrash the system for such a
      : basic workload.  If zone reclaim mode behaviour is ever desired but on a
      : single task instead of a global basis then the sensible option is to build
      : a mempolicy that enforces that behaviour.
      
      This was a severe regression compared to previous kernels that made
      important workloads unusable and it starts when __GFP_THISNODE was
      added to THP allocations under MADV_HUGEPAGE.  It is not a significant
      risk to go to the previous behavior before __GFP_THISNODE was added, it
      worked like that for years.
      
      This was simply an optimization to some lucky workloads that can fit in
      a single node, but it ended up breaking the VM for others that can't
      possibly fit in a single node, so going back is safe.
      
      [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
      Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
      Fixes: 5265047a ("mm, thp: really limit transparent hugepage allocation to local node")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarStefan Priebe <s.priebe@profihost.ag>
      Debugged-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Tested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      818e5846
    • Michal Hocko's avatar
      mm: do not bug_on on incorrect length in __mm_populate() · 1a55a71e
      Michal Hocko authored
      commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.
      
      syzbot has noticed that a specially crafted library can easily hit
      VM_BUG_ON in __mm_populate
      
        kernel BUG at mm/gup.c:1242!
        invalid opcode: 0000 [#1] SMP
        CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
        RIP: 0010:__mm_populate+0x1e2/0x1f0
        Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
        Call Trace:
           vm_brk_flags+0xc3/0x100
           vm_brk+0x1f/0x30
           load_elf_library+0x281/0x2e0
           __ia32_sys_uselib+0x170/0x1e0
           do_fast_syscall_32+0xca/0x420
           entry_SYSENTER_compat+0x70/0x7f
      
      The reason is that the length of the new brk is not page aligned when we
      try to populate the it.  There is no reason to bug on that though.
      do_brk_flags already aligns the length properly so the mapping is
      expanded as it should.  All we need is to tell mm_populate about it.
      Besides that there is absolutely no reason to to bug_on in the first
      place.  The worst thing that could happen is that the last page wouldn't
      get populated and that is far from putting system into an inconsistent
      state.
      
      Fix the issue by moving the length sanitization code from do_brk_flags
      up to vm_brk_flags.  The only other caller of do_brk_flags is brk
      syscall entry and it makes sure to provide the proper length so t here
      is no need for sanitation and so we can use do_brk_flags without it.
      
      Also remove the bogus BUG_ONs.
      
      [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
      Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarsyzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
      Tested-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 4.9:
       - There is no do_brk_flags() function; update do_brk()
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1a55a71e
  6. 13 Nov, 2018 1 commit
    • Mike Kravetz's avatar
      hugetlbfs: dirty pages as they are added to pagecache · cbf05aa9
      Mike Kravetz authored
      commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream.
      
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMihcla Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbf05aa9
  7. 10 Nov, 2018 2 commits
  8. 20 Oct, 2018 1 commit
  9. 18 Oct, 2018 1 commit
  10. 13 Oct, 2018 1 commit
  11. 10 Oct, 2018 1 commit
  12. 04 Oct, 2018 1 commit
  13. 29 Sep, 2018 1 commit
    • Joel Fernandes (Google)'s avatar
      mm: shmem.c: Correctly annotate new inodes for lockdep · 6b1bd5ea
      Joel Fernandes (Google) authored
      commit b45d71fb89ab8adfe727b9d0ee188ed58582a647 upstream.
      
      Directories and inodes don't necessarily need to be in the same lockdep
      class.  For ex, hugetlbfs splits them out too to prevent false positives
      in lockdep.  Annotate correctly after new inode creation.  If its a
      directory inode, it will be put into a different class.
      
      This should fix a lockdep splat reported by syzbot:
      
      > ======================================================
      > WARNING: possible circular locking dependency detected
      > 4.18.0-rc8-next-20180810+ #36 Not tainted
      > ------------------------------------------------------
      > syz-executor900/4483 is trying to acquire lock:
      > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
      > include/linux/fs.h:765 [inline]
      > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
      > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
      >
      > but task is already holding lock:
      > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
      > drivers/staging/android/ashmem.c:448
      >
      > which lock already depends on the new lock.
      >
      > -> #2 (ashmem_mutex){+.+.}:
      >        __mutex_lock_common kernel/locking/mutex.c:925 [inline]
      >        __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
      >        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
      >        ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
      >        call_mmap include/linux/fs.h:1844 [inline]
      >        mmap_region+0xf27/0x1c50 mm/mmap.c:1762
      >        do_mmap+0xa10/0x1220 mm/mmap.c:1535
      >        do_mmap_pgoff include/linux/mm.h:2298 [inline]
      >        vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
      >        ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
      >        __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
      >        __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
      >        __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > -> #1 (&mm->mmap_sem){++++}:
      >        __might_fault+0x155/0x1e0 mm/memory.c:4568
      >        _copy_to_user+0x30/0x110 lib/usercopy.c:25
      >        copy_to_user include/linux/uaccess.h:155 [inline]
      >        filldir+0x1ea/0x3a0 fs/readdir.c:196
      >        dir_emit_dot include/linux/fs.h:3464 [inline]
      >        dir_emit_dots include/linux/fs.h:3475 [inline]
      >        dcache_readdir+0x13a/0x620 fs/libfs.c:193
      >        iterate_dir+0x48b/0x5d0 fs/readdir.c:51
      >        __do_sys_getdents fs/readdir.c:231 [inline]
      >        __se_sys_getdents fs/readdir.c:212 [inline]
      >        __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > -> #0 (&sb->s_type->i_mutex_key#9){++++}:
      >        lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
      >        down_write+0x8f/0x130 kernel/locking/rwsem.c:70
      >        inode_lock include/linux/fs.h:765 [inline]
      >        shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
      >        ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
      >        ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
      >        vfs_ioctl fs/ioctl.c:46 [inline]
      >        file_ioctl fs/ioctl.c:501 [inline]
      >        do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
      >        ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
      >        __do_sys_ioctl fs/ioctl.c:709 [inline]
      >        __se_sys_ioctl fs/ioctl.c:707 [inline]
      >        __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
      >        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      >        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
      >
      >  Possible unsafe locking scenario:
      >
      >        CPU0                    CPU1
      >        ----                    ----
      >   lock(ashmem_mutex);
      >                                lock(&mm->mmap_sem);
      >                                lock(ashmem_mutex);
      >   lock(&sb->s_type->i_mutex_key#9);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by syz-executor900/4483:
      >  #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
      > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448
      
      Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b1bd5ea
  14. 19 Sep, 2018 3 commits
  15. 15 Sep, 2018 2 commits
  16. 09 Sep, 2018 1 commit
  17. 05 Sep, 2018 2 commits