1. 15 Jun, 2019 1 commit
    • Linxu Fang's avatar
      mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE · 27d8fa82
      Linxu Fang authored
      [ Upstream commit 299c83dce9ea3a79bb4b5511d2cb996b6b8e5111 ]
      
      342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
      later patches rewrote the calculation of node spanned pages.
      
      e506b996 ("mem-hotplug: fix node spanned pages when we have a movable
      node"), but the current code still has problems,
      
      When we have a node with only zone_movable and the node id is not zero,
      the size of node spanned pages is double added.
      
      That's because we have an empty normal zone, and zone_start_pfn or
      zone_end_pfn is not between arch_zone_lowest_possible_pfn and
      arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
      range just like the commit <96e907d1> (bootmem: Reimplement
      __absent_pages_in_range() using for_each_mem_pfn_range()).
      
      e.g.
      Zone ranges:
        DMA      [mem 0x0000000000001000-0x0000000000ffffff]
        DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
        Normal   [mem 0x0000000100000000-0x000000023fffffff]
      Movable zone start for each node
        Node 0: 0x0000000100000000
        Node 1: 0x0000000140000000
      Early memory node ranges
        node   0: [mem 0x0000000000001000-0x000000000009efff]
        node   0: [mem 0x0000000000100000-0x00000000bffdffff]
        node   0: [mem 0x0000000100000000-0x000000013fffffff]
        node   1: [mem 0x0000000140000000-0x000000023fffffff]
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0x100000    present:0x100000	absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 2097152
      node_spanned_pages:2097152
      Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
      cma-reserved)
      
      It shows that the current memory of node 1 is double added.
      After this patch, the problem is fixed.
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0	    present:0		absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 1048576
      node_spanned_pages:1048576
      memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
      cma-reserved)
      
      Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.comSigned-off-by: default avatarLinxu Fang <fanglinxu@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      27d8fa82
  2. 05 Apr, 2019 1 commit
    • Qian Cai's avatar
      page_poison: play nicely with KASAN · d49efeb1
      Qian Cai authored
      [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]
      
      KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
      It triggers false positives in the allocation path:
      
        BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
        Read of size 8 at addr ffff88881f800000 by task swapper/0
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         __asan_report_load8_noabort+0x19/0x20
         memchr_inv+0x2ea/0x330
         kernel_poison_pages+0x103/0x3d5
         get_page_from_freelist+0x15e7/0x4d90
      
      because KASAN has not yet unpoisoned the shadow page for allocation
      before it checks memchr_inv() but only found a stale poison pattern.
      
      Also, false positives in free path,
      
        BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
        Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
        CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         check_memory_region+0x22d/0x250
         memset+0x28/0x40
         kernel_poison_pages+0x29e/0x3d5
         __free_pages_ok+0x75f/0x13e0
      
      due to KASAN adds poisoned redzones around slab objects, but the page
      poisoning needs to poison the whole page.
      
      Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pwSigned-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d49efeb1
  3. 23 Mar, 2019 1 commit
    • Jann Horn's avatar
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · a9772096
      Jann Horn authored
      [ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ]
      
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a9772096
  4. 17 Dec, 2018 1 commit
    • Wei Yang's avatar
      mm/page_alloc.c: fix calculation of pgdat->nr_zones · c7aafad0
      Wei Yang authored
      [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]
      
      init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
      'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
      case, while not exact in hot-plug situation.
      
      This function is used in two places:
      
        * free_area_init_core()
        * move_pfn_range_to_zone()
      
      In the first case, we are sure zone index increase monotonically.  While
      in the second one, this is under users control.
      
      One way to reproduce this is:
      ----------------------------
      
      1. create a virtual machine with empty node1
      
         -m 4G,slots=32,maxmem=32G \
         -smp 4,maxcpus=8          \
         -numa node,nodeid=0,mem=4G,cpus=0-3 \
         -numa node,nodeid=1,mem=0G,cpus=4-7
      
      2. hot-add cpu 3-7
      
         cpu-add [3-7]
      
      2. hot-add memory to nod1
      
         object_add memory-backend-ram,id=ram0,size=1G
         device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      
      3. online memory with following order
      
         echo online_movable > memory47/state
         echo online > memory40/state
      
      After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
      instead of (ZONE_MOVABLE + 1).
      
      Michal said:
       "Having an incorrect nr_zones might result in all sorts of problems
        which would be quite hard to debug (e.g. reclaim not considering the
        movable zone). I do not expect many users would suffer from this it
        but still this is trivial and obviously right thing to do so
        backporting to the stable tree shouldn't be harmful (last famous
        words)"
      
      Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c7aafad0
  5. 13 Dec, 2018 1 commit
    • Tetsuo Handa's avatar
      mm: don't warn about allocations which stall for too long · 4515bbc4
      Tetsuo Handa authored
      [ Upstream commit 400e2249 ]
      
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avataryuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4515bbc4
  6. 01 Dec, 2018 1 commit
    • Michal Hocko's avatar
      mm, page_alloc: check for max order in hot path · c6a4b3c3
      Michal Hocko authored
      [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]
      
      Konstantin has noticed that kvmalloc might trigger the following
      warning:
      
        WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
        [...]
        Call Trace:
         fragmentation_index+0x76/0x90
         compaction_suitable+0x4f/0xf0
         shrink_node+0x295/0x310
         node_reclaim+0x205/0x250
         get_page_from_freelist+0x649/0xad0
         __alloc_pages_nodemask+0x12a/0x2a0
         kmalloc_large_node+0x47/0x90
         __kmalloc_node+0x22b/0x2e0
         kvmalloc_node+0x3e/0x70
         xt_alloc_table_info+0x3a/0x80 [x_tables]
         do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
         nf_setsockopt+0x44/0x60
         SyS_setsockopt+0x6f/0xc0
         do_syscall_64+0x67/0x120
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      the problem is that we only check for an out of bound order in the slow
      path and the node reclaim might happen from the fast path already.  This
      is fixable by making sure that kvmalloc doesn't ever use kmalloc for
      requests that are larger than KMALLOC_MAX_SIZE but this also shows that
      the code is rather fragile.  A recent UBSAN report just underlines that
      by the following report
      
        UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
        shift exponent 51 is too large for 32-bit type 'int'
        CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0xd2/0x148 lib/dump_stack.c:113
         ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
         __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
         __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
         zone_watermark_fast mm/page_alloc.c:3216 [inline]
         get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
         __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
         alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
         alloc_pages include/linux/gfp.h:509 [inline]
         __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
         dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
         raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
         raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
         fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
         fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
         __blkdev_driver_ioctl block/ioctl.c:303 [inline]
         blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
         block_ioctl+0x105/0x150 fs/block_dev.c:1883
         vfs_ioctl fs/ioctl.c:46 [inline]
         do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
         ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
         __do_sys_ioctl fs/ioctl.c:709 [inline]
         __se_sys_ioctl fs/ioctl.c:707 [inline]
         __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
         do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note that this is not a kvmalloc path.  It is just that the fast path
      really depends on having sanitzed order as well.  Therefore move the
      order check to the fast path.
      
      Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reported-by: default avatarKyungtae Kim <kt0755@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Byoungyoung Lee <lifeasageek@gmail.com>
      Cc: "Dae R. Jeong" <threeearcat@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c6a4b3c3
  7. 18 Oct, 2018 1 commit
  8. 26 Jun, 2018 1 commit
    • Vlastimil Babka's avatar
      mm, page_alloc: do not break __GFP_THISNODE by zonelist reset · 1d26c112
      Vlastimil Babka authored
      commit 7810e678 upstream.
      
      In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
      allocations that can ignore memory policies.  The zonelist is obtained
      from current CPU's node.  This is a problem for __GFP_THISNODE
      allocations that want to allocate on a different node, e.g.  because the
      allocating thread has been migrated to a different CPU.
      
      This has been observed to break SLAB in our 4.4-based kernel, because
      there it relies on __GFP_THISNODE working as intended.  If a slab page
      is put on wrong node's list, then further list manipulations may corrupt
      the list because page_to_nid() is used to determine which node's
      list_lock should be locked and thus we may take a wrong lock and race.
      
      Current SLAB implementation seems to be immune by luck thanks to commit
      511e3a05 ("mm/slab: make cache_grow() handle the page allocated on
      arbitrary node") but there may be others assuming that __GFP_THISNODE
      works as promised.
      
      We can fix it by simply removing the zonelist reset completely.  There
      is actually no reason to reset it, because memory policies and cpusets
      don't affect the zonelist choice in the first place.  This was different
      when commit 183f6371 ("mm: ignore mempolicies when using
      ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
      own restricted zonelists.
      
      We might consider this for 4.17 although I don't know if there's
      anything currently broken.
      
      SLAB is currently not affected, but in kernels older than 4.7 that don't
      yet have 511e3a05 ("mm/slab: make cache_grow() handle the page
      allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
      ones I'll have to check.
      
      So stable backports should be more important, but will have to be
      reviewed carefully, as the code went through many changes.  BTW I think
      that also the ac->preferred_zoneref reset is currently useless if we
      don't also reset ac->nodemask from a mempolicy to NULL first (which we
      probably should for the OOM victims etc?), but I would leave that for a
      separate patch.
      
      Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Fixes: 183f6371 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d26c112
  9. 28 Mar, 2018 2 commits
    • Daniel Vacek's avatar
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · 99b6ead4
      Daniel Vacek authored
      commit f59f1caf upstream.
      
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: default avatarDaniel Vacek <neelx@redhat.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99b6ead4
    • Tetsuo Handa's avatar
      lockdep: fix fs_reclaim warning · ef006d43
      Tetsuo Handa authored
      commit 2e517d68 upstream.
      
      Dave Jones reported fs_reclaim lockdep warnings.
      
        ============================================
        WARNING: possible recursive locking detected
        4.15.0-rc9-backup-debug+ #1 Not tainted
        --------------------------------------------
        sshd/24800 is trying to acquire lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        but task is already holding lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(fs_reclaim);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        2 locks held by sshd/24800:
         #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
         #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        stack backtrace:
        CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
        Call Trace:
         dump_stack+0xbc/0x13f
         __lock_acquire+0xa09/0x2040
         lock_acquire+0x12e/0x350
         fs_reclaim_acquire.part.102+0x29/0x30
         kmem_cache_alloc+0x3d/0x2c0
         alloc_extent_state+0xa7/0x410
         __clear_extent_bit+0x3ea/0x570
         try_release_extent_mapping+0x21a/0x260
         __btrfs_releasepage+0xb0/0x1c0
         btrfs_releasepage+0x161/0x170
         try_to_release_page+0x162/0x1c0
         shrink_page_list+0x1d5a/0x2fb0
         shrink_inactive_list+0x451/0x940
         shrink_node_memcg.constprop.88+0x4c9/0x5e0
         shrink_node+0x12d/0x260
         try_to_free_pages+0x418/0xaf0
         __alloc_pages_slowpath+0x976/0x1790
         __alloc_pages_nodemask+0x52c/0x5c0
         new_slab+0x374/0x3f0
         ___slab_alloc.constprop.81+0x47e/0x5a0
         __slab_alloc.constprop.80+0x32/0x60
         __kmalloc_track_caller+0x267/0x310
         __kmalloc_reserve.isra.40+0x29/0x80
         __alloc_skb+0xee/0x390
         sk_stream_alloc_skb+0xb8/0x340
         tcp_sendmsg_locked+0x8e6/0x1d30
         tcp_sendmsg+0x27/0x40
         inet_sendmsg+0xd0/0x310
         sock_write_iter+0x17a/0x240
         __vfs_write+0x2ab/0x380
         vfs_write+0xfb/0x260
         SyS_write+0xb6/0x140
         do_syscall_64+0x1e5/0xc05
         entry_SYSCALL64_slow_path+0x25/0x25
      
      This warning is caused by commit d92a8cfc ("locking/lockdep:
      Rework FS_RECLAIM annotation") which replaced the use of
      lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
      and lockdep_trace_alloc() in slab_pre_alloc_hook() with
      fs_reclaim_acquire()/ fs_reclaim_release().
      
      Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
      __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
      __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
      trying to grab the 'fake' lock again when __perform_reclaim() already
      grabbed the 'fake' lock.
      
      The
      
        /* this guy won't enter reclaim */
        if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
                return false;
      
      test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
      was added by commit cf40bd16 ("lockdep: annotate reclaim context
      (__GFP_NOFS)").  But that test is outdated because PF_MEMALLOC thread
      won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
      341ce06f ("page allocator: calculate the alloc_flags for allocation
      only once") added the PF_MEMALLOC safeguard (
      
        /* Avoid recursion of direct reclaim */
        if (p->flags & PF_MEMALLOC)
                goto nopage;
      
      in __alloc_pages_slowpath()).
      
      Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
      allow __need_fs_reclaim() to return false.
      
      Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
      Fixes: d92a8cfc ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Tested-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef006d43
  10. 22 Feb, 2018 1 commit
  11. 31 Jan, 2018 1 commit
  12. 25 Dec, 2017 2 commits
  13. 05 Dec, 2017 2 commits
  14. 24 Nov, 2017 1 commit
  15. 04 Oct, 2017 2 commits
  16. 09 Sep, 2017 2 commits
    • Tetsuo Handa's avatar
      mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt · f19360f0
      Tetsuo Handa authored
      We are by error initializing alloc_flags before gfp_allowed_mask is
      applied.  This could cause problems after pm_restrict_gfp_mask() is called
      during suspend operation.  Apply gfp_allowed_mask before initializing
      alloc_flags so that the first allocation attempt uses correct flags.
      
      Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
      Fixes: 83d4ca81 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f19360f0
    • Kemi Wang's avatar
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang authored
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.comSigned-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
  17. 07 Sep, 2017 10 commits
    • Michal Hocko's avatar
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko authored
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
    • Michal Hocko's avatar
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko authored
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • Michal Hocko's avatar
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko authored
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • Michal Hocko's avatar
      mm, page_alloc: remove stop_machine from build_all_zonelists · 11cd8638
      Michal Hocko authored
      build_all_zonelists has been (ab)using stop_machine to make sure that
      zonelists do not change while somebody is looking at them.  This is is
      just a gross hack because a) it complicates the context from which we
      can call build_all_zonelists (see 3f906ba2 ("mm/memory-hotplug:
      switch locking to a percpu rwsem")) and b) is is not really necessary
      especially after "mm, page_alloc: simplify zonelist initialization" and
      c) it doesn't really provide the protection it claims (see below).
      
      Updates of the zonelists happen very seldom, basically only when a zone
      becomes populated during memory online or when it loses all the memory
      during offline.  A racing iteration over zonelists could either miss a
      zone or try to work on one zone twice.  Both of these are something we
      can live with occasionally because there will always be at least one
      zone visible so we are not likely to fail allocation too easily for
      example.
      
      Please note that the original stop_machine approach doesn't really
      provide a better exclusion because the iteration might be interrupted
      half way (unless the whole iteration is preempt disabled which is not
      the case in most cases) so the some zones could still be seen twice or a
      zone missed.
      
      I have run the pathological online/offline of the single memblock in the
      movable zone while stressing the same small node with some memory
      pressure.
      
      Node 1, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 943, 943, 943)
      Node 1, zone    DMA32
        pages free     227310
              min      8294
              low      10367
              high     12440
              spanned  262112
              present  262112
              managed  241436
              protection: (0, 0, 0, 0)
      Node 1, zone   Normal
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 0, 1024)
      Node 1, zone  Movable
        pages free     32722
              min      85
              low      117
              high     149
              spanned  32768
              present  32768
              managed  32768
              protection: (0, 0, 0, 0)
      
      root@test1:/sys/devices/system/node/node1# while true
      do
      	echo offline > memory34/state
      	echo online_movable > memory34/state
      done
      
      root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4
      
      and it survived without any unexpected behavior.  While this is not
      really a great testing coverage it should exercise the allocation path
      quite a lot.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11cd8638
    • Michal Hocko's avatar
      mm, page_alloc: simplify zonelist initialization · 9d3be21b
      Michal Hocko authored
      build_zonelists gradually builds zonelists from the nearest to the most
      distant node.  As we do not know how many populated zones we will have
      in each node we rely on the _zoneref to terminate initialized part of
      the zonelist by a NULL zone.  While this is functionally correct it is
      quite suboptimal because we cannot allow updaters to race with zonelists
      users because they could see an empty zonelist and fail the allocation
      or hit the OOM killer in the worst case.
      
      We can do much better, though.  We can store the node ordering into an
      already existing node_order array and then give this array to
      build_zonelists_in_node_order and do the whole initialization at once.
      zonelists consumers still might see halfway initialized state but that
      should be much more tolerateable because the list will not be empty and
      they would either see some zone twice or skip over some zone(s) in the
      worst case which shouldn't lead to immediate failures.
      
      While at it let's simplify build_zonelists_node which is rather
      confusing now.  It gets an index into the zoneref array and returns the
      updated index for the next iteration.  Let's rename the function to
      build_zonerefs_node to better reflect its purpose and give it zoneref
      array to update.  The function doesn't the index anymore.  It just
      returns the number of added zones so that the caller can advance the
      zonered array start for the next update.
      
      This patch alone doesn't introduce any functional change yet, though, it
      is merely a preparatory work for later changes.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d3be21b
    • Michal Hocko's avatar
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko authored
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • Michal Hocko's avatar
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization · d9c9a0b9
      Michal Hocko authored
      __build_all_zonelists reinitializes each online cpu local node for
      CONFIG_HAVE_MEMORYLESS_NODES.  This makes sense because previously
      memory less nodes could gain some memory during memory hotplug and so
      the local node should be changed for CPUs close to such a node.  It
      makes less sense to do that unconditionally for a newly creaded NUMA
      node which is still offline and without any memory.
      
      Let's also simplify the cpu loop and use for_each_online_cpu instead of
      an explicit cpu_online check for all possible cpus.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9c9a0b9
    • Michal Hocko's avatar
      mm, page_alloc: remove boot pageset initialization from memory hotplug · afb6ebb3
      Michal Hocko authored
      boot_pageset is a boot time hack which gets superseded by normal
      pagesets later in the boot process.  It makes zero sense to reinitialize
      it again and again during memory hotplug.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afb6ebb3
    • Michal Hocko's avatar
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko authored
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • Wei Yang's avatar
      mm/memory_hotplug: just build zonelist for newly added node · c1152583
      Wei Yang authored
      Commit 9adb62a5 ("mm/hotplug: correctly setup fallback zonelists
      when creating new pgdat") tries to build the correct zonelist for a
      newly added node, while it is not necessary to rebuild it for already
      exist nodes.
      
      In build_zonelists(), it will iterate on nodes with memory.  For a newly
      added node, it will have memory until node_states_set_node() is called
      in online_pages().
      
      This patch avoids rebuilding the zonelists for already existing nodes.
      
      build_zonelists_node() uses managed_zone(zone) checks, so it should not
      include empty zones anyway.  So effectively we avoid some pointless work
      under stop_machine().
      
      [akpm@linux-foundation.org: tweak comment text]
      [akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
      Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1152583
  18. 31 Aug, 2017 1 commit
  19. 25 Aug, 2017 1 commit
    • Chen Yu's avatar
      PM/hibernate: touch NMI watchdog when creating snapshot · 556b969a
      Chen Yu authored
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.comSigned-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Reported-by: default avatarJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      556b969a
  20. 18 Aug, 2017 1 commit
  21. 10 Aug, 2017 3 commits
    • Jonathan Toppins's avatar
      mm: ratelimit PFNs busy info message · 75dddef3
      Jonathan Toppins authored
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.comSigned-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: default avatarDoug Ledford <dledford@redhat.com>
      Tested-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75dddef3
    • Johannes Weiner's avatar
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner authored
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
    • Peter Zijlstra's avatar
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra authored
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d92a8cfc
  22. 03 Aug, 2017 1 commit
    • Heiko Carstens's avatar
      mm: take memory hotplug lock within numa_zonelist_order_handler() · 167d0f25
      Heiko Carstens authored
      Andre Wild reported the following warning:
      
        WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
        Modules linked in:
        CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57 #10
        Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
        task: 00000000701d8100 task.stack: 0000000073594000
        Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
        ...
        Call Trace:
         lockdep_assert_cpus_held+0x42/0x60)
         stop_machine_cpuslocked+0x62/0xf0
         build_all_zonelists+0x92/0x150
         numa_zonelist_order_handler+0x102/0x150
         proc_sys_call_handler.isra.12+0xda/0x118
         proc_sys_write+0x34/0x48
         __vfs_write+0x3c/0x178
         vfs_write+0xbc/0x1a0
         SyS_write+0x66/0xc0
         system_call+0xc4/0x2b0
         locks held by bash/1205:
         #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
         #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
         #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
        Last Breaking-Event-Address:
          lockdep_assert_cpus_held+0x48/0x60
      
      This can be easily triggered with e.g.
      
          echo n > /proc/sys/vm/numa_zonelist_order
      
      In commit 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu
      rwsem") memory hotplug locking was changed to fix a potential deadlock.
      
      This also switched the stop_machine() invocation within
      build_all_zonelists() to stop_machine_cpuslocked() which now expects
      that online cpus are locked when being called.
      
      This assumption is not true if build_all_zonelists() is being called
      from numa_zonelist_order_handler().
      
      In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
      pair to numa_zonelist_order_handler().
      
      Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
      Fixes: 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu rwsem")
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarAndre Wild <wild@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      167d0f25
  23. 12 Jul, 2017 1 commit
    • Michal Hocko's avatar
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko authored
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  24. 10 Jul, 2017 1 commit
    • Thomas Gleixner's avatar
      mm/memory-hotplug: switch locking to a percpu rwsem · 3f906ba2
      Thomas Gleixner authored
      Andrey reported a potential deadlock with the memory hotplug lock and
      the cpu hotplug lock.
      
      The reason is that memory hotplug takes the memory hotplug lock and then
      calls stop_machine() which calls get_online_cpus().  That's the reverse
      lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c
      
      The problem has been there forever.  The reason why this was never
      reported is that the cpu hotplug locking had this homebrewn recursive
      reader writer semaphore construct which due to the recursion evaded the
      full lock dep coverage.  The memory hotplug code copied that construct
      verbatim and therefor has similar issues.
      
      Three steps to fix this:
      
      1) Convert the memory hotplug locking to a per cpu rwsem so the
         potential issues get reported proper by lockdep.
      
      2) Lock the online cpus in mem_hotplug_begin() before taking the memory
         hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
         code to avoid recursive locking.
      
      3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
         hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
         by invoking lru_add_drain_all_cpuslocked() instead.
      
      Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.deReported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f906ba2