1. 30 Nov, 2017 1 commit
    • Michal Hocko's avatar
      Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" · 90daf306
      Michal Hocko authored
      This reverts commit 0f6d24f8 ("mm/page-writeback.c: print a warning
      if the vm dirtiness settings are illogical") because it causes false
      positive warnings during OOM situations as noticed by Tetsuo Handa:
      
        Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
        Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 2687 3645 3645
        Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 0 958 958
        Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        [...]
        Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
        Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        vm direct limit must be set greater than background limit.
      
      The problem is that both thresh and bg_thresh will be 0 if
      available_memory is less than 4 pages when evaluating
      global_dirtyable_memory.
      
      While this might be worked around the whole point of the warning is
      dubious at best.  We do rely on admins to do sensible things when
      changing tunable knobs.  Dirty memory writeback knobs are not any
      special in that regards so revert the warning rather than adding more
      hacks to work this around.
      
      Debugged by Yafang Shao.
      
      Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
      Fixes: 0f6d24f8 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90daf306
  2. 21 Nov, 2017 1 commit
    • Kees Cook's avatar
      block/laptop_mode: Convert timers to use timer_setup() · bca237a5
      Kees Cook authored
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      bca237a5
  3. 16 Nov, 2017 8 commits
  4. 14 Oct, 2017 1 commit
  5. 09 Oct, 2017 1 commit
  6. 03 Oct, 2017 2 commits
  7. 07 Sep, 2017 1 commit
  8. 18 Aug, 2017 1 commit
    • Johannes Weiner's avatar
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner authored
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: default avatarBradley Bolen <bradleybolen@gmail.com>
      Tested-by: default avatarBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  9. 12 Jul, 2017 1 commit
  10. 06 Jul, 2017 1 commit
  11. 05 Jul, 2017 2 commits
  12. 03 May, 2017 3 commits
  13. 28 Apr, 2017 1 commit
    • Theodore Ts'o's avatar
      mm: retry writepages() on ENOMEM when doing an data integrity writeback · 80a2ea9f
      Theodore Ts'o authored
      Currently, file system's writepages() function must not fail with an
      ENOMEM, since if they do, it's possible for buffered data to be lost.
      This is because on a data integrity writeback writepages() gets called
      but once, and if it returns ENOMEM, if you're lucky the error will get
      reflected back to the userspace process calling fsync().  If you
      aren't lucky, the user is unmounting the file system, and the dirty
      pages will simply be lost.
      
      For this reason, file system code generally will use GFP_NOFS, and in
      some cases, will retry the allocation in a loop, on the theory that
      "kernel livelocks are temporary; data loss is forever".
      Unfortunately, this can indeed cause livelocks, since inside the
      writepages() call, the file system is holding various mutexes, and
      these mutexes may prevent the OOM killer from killing its targetted
      victim if it is also holding on to those mutexes.
      
      A better solution would be to allow writepages() to call the memory
      allocator with flags that give greater latitude to the allocator to
      fail, and then release its locks and return ENOMEM, and in the case of
      background writeback, the writes can be retried at a later time.  In
      the case of data-integrity writeback retry after waiting a brief
      amount of time.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      80a2ea9f
  14. 02 Mar, 2017 1 commit
  15. 28 Feb, 2017 1 commit
  16. 25 Feb, 2017 1 commit
  17. 02 Feb, 2017 1 commit
  18. 15 Dec, 2016 1 commit
  19. 08 Nov, 2016 1 commit
  20. 08 Oct, 2016 2 commits
    • Huang Ying's avatar
      mm: don't use radix tree writeback tags for pages in swap cache · 371a096e
      Huang Ying authored
      File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
      etc.) to accelerate finding the pages with a specific tag in the radix
      tree during inode writeback.  But for anonymous pages in the swap cache,
      there is no inode writeback.  So there is no need to find the pages with
      some writeback tags in the radix tree.  It is not necessary to touch
      radix tree writeback tags for pages in the swap cache.
      
      Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
      introduced for address spaces which don't need to update the writeback
      tags.  The flag is set for swap caches.  It may be used for DAX file
      systems, etc.
      
      With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
      ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
      The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
      simulated PMEM (persistent memory) device.  The improvement comes from
      the reduced contention on the swap cache radix tree lock.  To test
      sequential swapping out, the test case uses 8 processes, which
      sequentially allocate and write to the anonymous pages until RAM and
      part of the swap device is used up.
      
      Details of comparison is as follow,
      
      base             base+patch
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
         2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
         1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
           10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
           10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
           10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
           10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
      
      Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      371a096e
    • Michal Hocko's avatar
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko authored
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
  21. 06 Sep, 2016 1 commit
  22. 28 Jul, 2016 7 commits