1. 15 Mar, 2016 2 commits
    • Joonsoo Kim's avatar
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim authored
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarAaron Lu <aaron.lu@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • Vlastimil Babka's avatar
      mm, printk: introduce new format string for flags · edf14cdb
      Vlastimil Babka authored
      In mm we use several kinds of flags bitfields that are sometimes printed
      for debugging purposes, or exported to userspace via sysfs.  To make
      them easier to interpret independently on kernel version and config, we
      want to dump also the symbolic flag names.  So far this has been done
      with repeated calls to pr_cont(), which is unreliable on SMP, and not
      usable for e.g.  sysfs export.
      
      To get a more reliable and universal solution, this patch extends
      printk() format string for pointers to handle the page flags (%pGp),
      gfp_flags (%pGg) and vma flags (%pGv).  Existing users of
      dump_flag_names() are converted and simplified.
      
      It would be possible to pass flags by value instead of pointer, but the
      %p format string for pointers already has extensions for various kernel
      structures, so it's a good fit, and the extra indirection in a
      non-critical path is negligible.
      
      [linux@rasmusvillemoes.dk: lots of good implementation suggestions]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edf14cdb
  2. 03 Feb, 2016 2 commits
  3. 16 Jan, 2016 2 commits
    • Kirill A. Shutemov's avatar
      thp: reintroduce split_huge_page() · e9b61f19
      Kirill A. Shutemov authored
      This patch adds implementation of split_huge_page() for new
      refcountings.
      
      Unlike previous implementation, new split_huge_page() can fail if
      somebody holds GUP pin on the page.  It also means that pin on page
      would prevent it from bening split under you.  It makes situation in
      many places much cleaner.
      
      The basic scheme of split_huge_page():
      
        - Check that sum of mapcounts of all subpage is equal to page_count()
          plus one (caller pin). Foll off with -EBUSY. This way we can avoid
          useless PMD-splits.
      
        - Freeze the page counters by splitting all PMD and setup migration
          PTEs.
      
        - Re-check sum of mapcounts against page_count(). Page's counts are
          stable now. -EBUSY if page is pinned.
      
        - Split compound page.
      
        - Unfreeze the page by removing migration entries.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9b61f19
    • Kirill A. Shutemov's avatar
      mm: drop tail page refcounting · ddc58f27
      Kirill A. Shutemov authored
      Tail page refcounting is utterly complicated and painful to support.
      
      It uses ->_mapcount on tail pages to store how many times this page is
      pinned.  get_page() bumps ->_mapcount on tail page in addition to
      ->_count on head.  This information is required by split_huge_page() to
      be able to distribute pins from head of compound page to tails during
      the split.
      
      We will need ->_mapcount to account PTE mappings of subpages of the
      compound page.  We eliminate need in current meaning of ->_mapcount in
      tail pages by forbidding split entirely if the page is pinned.
      
      The only user of tail page refcounting is THP which is marked BROKEN for
      now.
      
      Let's drop all this mess.  It makes get_page() and put_page() much
      simpler.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddc58f27
  4. 15 Jan, 2016 2 commits
  5. 07 Nov, 2015 4 commits
  6. 06 Nov, 2015 1 commit
    • Hugh Dickins's avatar
      mm: page migration fix PageMlocked on migrated pages · 51afb12b
      Hugh Dickins authored
      Commit e6c509f8 ("mm: use clear_page_mlock() in page_remove_rmap()")
      in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
      unmaps the page from userspace before migrating, and that commit clears
      PageMlocked on the final unmap, leaving mlock_migrate_page() with
      nothing to do.  Not a serious bug, the next attempt at reclaiming the
      page would fix it up; but a betrayal of page migration's intent - the
      new page ought to emerge as PageMlocked.
      
      I don't see how to fix it for mlock_migrate_page() itself; but easily
      fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
      is VM_LOCKED - under pte lock as in try_to_unmap_one().
      
      Delete mlock_migrate_page()?  Not quite, it does still serve a purpose for
      migrate_misplaced_transhuge_page(): where we could replace it by a test,
      clear_page_mlock(), mlock_vma_page() sequence; but would that be an
      improvement?  mlock_migrate_page() is fairly lean, and let's make it
      leaner by skipping the irq save/restore now clearly not needed.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51afb12b
  7. 08 Sep, 2015 1 commit
    • Joonsoo Kim's avatar
      mm/compaction: correct to flush migrated pages if pageblock skip happens · 1a16718c
      Joonsoo Kim authored
      We cache isolate_start_pfn before entering isolate_migratepages().  If
      pageblock is skipped in isolate_migratepages() due to whatever reason,
      cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
      that were freed.  For example, the following scenario can be possible:
      
      - assume order-9 compaction, pageblock order is 9
      - start_isolate_pfn is 0x200
      - isolate_migratepages()
        - skip a number of pageblocks
        - start to isolate from pfn 0x600
        - cc->migrate_pfn = 0x620
        - return
      - last_migrated_pfn is set to 0x200
      - check flushing condition
        - current_block_start is set to 0x600
        - last_migrated_pfn < current_block_start then do useless flush
      
      This wrong flush would not help the performance and success rate so this
      patch tries to fix it.  One simple way to know the exact position where
      we start to isolate migratable pages is that we cache it in
      isolate_migratepages() before entering actual isolation.  This patch
      implements that and fixes the problem.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a16718c
  8. 04 Sep, 2015 2 commits
    • Mel Gorman's avatar
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman authored
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • Mel Gorman's avatar
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman authored
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  9. 01 Jul, 2015 5 commits
  10. 15 Apr, 2015 1 commit
  11. 14 Apr, 2015 2 commits
    • Joonsoo Kim's avatar
      mm/compaction: enhance compaction finish condition · 2149cdae
      Joonsoo Kim authored
      Compaction has anti fragmentation algorithm.  It is that freepage should
      be more than pageblock order to finish the compaction if we don't find any
      freepage in requested migratetype buddy list.  This is for mitigating
      fragmentation, but, there is a lack of migratetype consideration and it is
      too excessive compared to page allocator's anti fragmentation algorithm.
      
      Not considering migratetype would cause premature finish of compaction.
      For example, if allocation request is for unmovable migratetype, freepage
      with CMA migratetype doesn't help that allocation and compaction should
      not be stopped.  But, current logic regards this situation as compaction
      is no longer needed, so finish the compaction.
      
      Secondly, condition is too excessive compared to page allocator's logic.
      We can steal freepage from other migratetype and change pageblock
      migratetype on more relaxed conditions in page allocator.  This is
      designed to prevent fragmentation and we can use it here.  Imposing hard
      constraint only to the compaction doesn't help much in this case since
      page allocator would cause fragmentation again.
      
      To solve these problems, this patch borrows anti fragmentation logic from
      page allocator.  It will reduce premature compaction finish in some cases
      and reduce excessive compaction work.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      considerable increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      31.82 : 42.20
      
      I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
      there is no more degradation on allocation success rate than before.  That
      roughly means that this patch doesn't result in more fragmentations.
      
      Vlastimil suggests additional idea that we only test for fallbacks when
      migration scanner has scanned a whole pageblock.  It looked good for
      fragmentation because chance of stealing increase due to making more free
      pages in certain pageblock.  So, I tested it, but, it results in decreased
      compaction success rate, roughly 38.00.  I guess the reason that if system
      is low memory condition, watermark check could be failed due to not enough
      order 0 free page and so, sometimes, we can't reach a fallback check
      although migrate_pfn is aligned to pageblock_nr_pages.  I can insert code
      to cope with this situation but it makes code more complicated so I don't
      include his idea at this patch.
      
      [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2149cdae
    • Kirill A. Shutemov's avatar
      mm: rename __mlock_vma_pages_range() to populate_vma_page_range() · fc05f566
      Kirill A. Shutemov authored
      __mlock_vma_pages_range() doesn't necessarily mlock pages.  It depends on
      vma flags.  The same codepath is used for MAP_POPULATE.
      
      Let's rename __mlock_vma_pages_range() to populate_vma_page_range().
      
      This patch also drops mlock_vma_pages_range() references from
      documentation.  It has gone in cea10a19 ("mm: directly use
      __mlock_vma_pages_range() in find_extend_vma()").
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc05f566
  12. 13 Feb, 2015 1 commit
    • Rasmus Villemoes's avatar
      mm/internal.h: don't split printk call in two · fc5199d1
      Rasmus Villemoes authored
      All users of mminit_dprintk pass a compile-time constant as level, so this
      just makes gcc emit a single printk call instead of two.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc5199d1
  13. 12 Feb, 2015 1 commit
    • Vlastimil Babka's avatar
      mm: reduce try_to_compact_pages parameters · 1a6d53a1
      Vlastimil Babka authored
      Expand the usage of the struct alloc_context introduced in the previous
      patch also for calling try_to_compact_pages(), to reduce the number of its
      parameters.  Since the function is in different compilation unit, we need
      to move alloc_context definition in the shared mm/internal.h header.
      
      With this change we get simpler code and small savings of code size and stack
      usage:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
      function                                     old     new   delta
      __alloc_pages_direct_compact                 283     256     -27
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
      function                                     old     new   delta
      try_to_compact_pages                         582     569     -13
      
      Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
      scripts/checkstack.pl).
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a6d53a1
  14. 11 Dec, 2014 2 commits
    • Vlastimil Babka's avatar
      mm, compaction: always update cached scanner positions · 6bace090
      Vlastimil Babka authored
      Compaction caches the migration and free scanner positions between
      compaction invocations, so that the whole zone gets eventually scanned and
      there is no bias towards the initial scanner positions at the
      beginning/end of the zone.
      
      The cached positions are continuously updated as scanners progress and the
      updating stops as soon as a page is successfully isolated.  The reasoning
      behind this is that a pageblock where isolation succeeded is likely to
      succeed again in near future and it should be worth revisiting it.
      
      However, the downside is that potentially many pages are rescanned without
      successful isolation.  At worst, there might be a page where isolation
      from LRU succeeds but migration fails (potentially always).  So upon
      encountering this page, cached position would always stop being updated
      for no good reason.  It might have been useful to let such page be
      rescanned with sync compaction after async one failed, but this is now
      handled by caching scanner position for async and sync mode separately
      since commit 35979ef3 ("mm, compaction: add per-zone migration pfn
      cache for async compaction").
      
      After this patch, cached positions are updated unconditionally.  In
      stress-highalloc benchmark, this has decreased the numbers of scanned
      pages by few percent, without affecting allocation success rates.
      
      To prevent free scanner from leaving free pages behind after they are
      returned due to page migration failure, the cached scanner pfn is changed
      to point to the pageblock of the returned free page with the highest pfn,
      before leaving compact_zone().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bace090
    • Vlastimil Babka's avatar
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka authored
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
  15. 14 Nov, 2014 1 commit
    • Joonsoo Kim's avatar
      mm/page_alloc: restrict max order of merging on isolated pageblock · 3c605096
      Joonsoo Kim authored
      Current pageblock isolation logic could isolate each pageblock
      individually.  This causes freepage accounting problem if freepage with
      pageblock order on isolate pageblock is merged with other freepage on
      normal pageblock.  We can prevent merging by restricting max order of
      merging to pageblock order if freepage is on isolate pageblock.
      
      A side-effect of this change is that there could be non-merged buddy
      freepage even if finishing pageblock isolation, because undoing
      pageblock isolation is just to move freepage from isolate buddy list to
      normal buddy list rather than to consider merging.  So, the patch also
      makes undoing pageblock isolation consider freepage merge.  When
      un-isolation, freepage with more than pageblock order and it's buddy are
      checked.  If they are on normal pageblock, instead of just moving, we
      isolate the freepage and free it in order to get merged.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c605096
  16. 10 Oct, 2014 4 commits
    • David Rientjes's avatar
      mm, compaction: pass gfp mask to compact_control · 6d7ce559
      David Rientjes authored
      struct compact_control currently converts the gfp mask to a migratetype,
      but we need the entire gfp mask in a follow-up patch.
      
      Pass the entire gfp mask as part of struct compact_control.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7ce559
    • Vlastimil Babka's avatar
      mm, compaction: skip buddy pages by their order in the migrate scanner · 99c0fd5e
      Vlastimil Babka authored
      The migration scanner skips PageBuddy pages, but does not consider their
      order as checking page_order() is generally unsafe without holding the
      zone->lock, and acquiring the lock just for the check wouldn't be a good
      tradeoff.
      
      Still, this could avoid some iterations over the rest of the buddy page,
      and if we are careful, the race window between PageBuddy() check and
      page_order() is small, and the worst thing that can happen is that we skip
      too much and miss some isolation candidates.  This is not that bad, as
      compaction can already fail for many other reasons like parallel
      allocations, and those have much larger race window.
      
      This patch therefore makes the migration scanner obtain the buddy page
      order and use it to skip the whole buddy page, if the order appears to be
      in the valid range.
      
      It's important that the page_order() is read only once, so that the value
      used in the checks and in the pfn calculation is the same.  But in theory
      the compiler can replace the local variable by multiple inlines of
      page_order().  Therefore, the patch introduces page_order_unsafe() that
      uses ACCESS_ONCE to prevent this.
      
      Testing with stress-highalloc from mmtests shows a 15% reduction in number
      of pages scanned by migration scanner.  The reduction is >60% with
      __GFP_NO_KSWAPD allocations, along with success rates better by few
      percent.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99c0fd5e
    • Vlastimil Babka's avatar
      mm, compaction: khugepaged should not give up due to need_resched() · 1f9efdef
      Vlastimil Babka authored
      Async compaction aborts when it detects zone lock contention or
      need_resched() is true.  David Rientjes has reported that in practice,
      most direct async compactions for THP allocation abort due to
      need_resched().  This means that a second direct compaction is never
      attempted, which might be OK for a page fault, but khugepaged is intended
      to attempt a sync compaction in such case and in these cases it won't.
      
      This patch replaces "bool contended" in compact_control with an int that
      distinguishes between aborting due to need_resched() and aborting due to
      lock contention.  This allows propagating the abort through all compaction
      functions as before, but passing the abort reason up to
      __alloc_pages_slowpath() which decides when to continue with direct
      reclaim and another compaction attempt.
      
      Another problem is that try_to_compact_pages() did not act upon the
      reported contention (both need_resched() or lock contention) immediately
      and would proceed with another zone from the zonelist.  When
      need_resched() is true, that means initializing another zone compaction,
      only to check again need_resched() in isolate_migratepages() and aborting.
       For zone lock contention, the unintended consequence is that the lock
      contended status reported back to the allocator is detrmined from the last
      zone where compaction was attempted, which is rather arbitrary.
      
      This patch fixes the problem in the following way:
      - async compaction of a zone aborting due to need_resched() or fatal signal
        pending means that further zones should not be tried. We report
        COMPACT_CONTENDED_SCHED to the allocator.
      - aborting zone compaction due to lock contention means we can still try
        another zone, since it has different set of locks. We report back
        COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
        it was aborted due to lock contention.
      
      As a result of these fixes, khugepaged will proceed with second sync
      compaction as intended, when the preceding async compaction aborted due to
      need_resched().  Page fault compactions aborting due to need_resched()
      will spare some cycles previously wasted by initializing another zone
      compaction only to abort again.  Lock contention will be reported only
      when compaction in all zones aborted due to lock contention, and therefore
      it's not a good idea to try again after reclaim.
      
      In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
      has improved number of THP collapse allocations by 10%, which shows
      positive effect on khugepaged.  The benchmark's success rates are
      unchanged as it is not recognized as khugepaged.  Numbers of compact_stall
      and compact_fail events have however decreased by 20%, with
      compact_success still a bit improved, which is good.  With benchmark
      configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
      collapse allocations, and only slight improvement in stalls and failures.
      
      [akpm@linux-foundation.org: fix warnings]
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f9efdef
    • Vlastimil Babka's avatar
      mm, compaction: move pageblock checks up from isolate_migratepages_range() · edc2ca61
      Vlastimil Babka authored
      isolate_migratepages_range() is the main function of the compaction
      scanner, called either on a single pageblock by isolate_migratepages()
      during regular compaction, or on an arbitrary range by CMA's
      __alloc_contig_migrate_range().  It currently perfoms two pageblock-wide
      compaction suitability checks, and because of the CMA callpath, it tracks
      if it crossed a pageblock boundary in order to repeat those checks.
      
      However, closer inspection shows that those checks are always true for CMA:
      - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
      - migrate_async_suitable() check is skipped because CMA uses sync compaction
      
      We can therefore move the compaction-specific checks to
      isolate_migratepages() and simplify isolate_migratepages_range().
      Furthermore, we can mimic the freepage scanner family of functions, which
      has isolate_freepages_block() function called both by compaction from
      isolate_freepages() and by CMA from isolate_freepages_range(), where each
      use-case adds own specific glue code.  This allows further code
      simplification.
      
      Thus, we rename isolate_migratepages_range() to
      isolate_migratepages_block() and limit its functionality to a single
      pageblock (or its subset).  For CMA, a new different
      isolate_migratepages_range() is created as a CMA-specific wrapper for the
      _block() function.  The checks specific to compaction are moved to
      isolate_migratepages().  As part of the unification of these two families
      of functions, we remove the redundant zone parameter where applicable,
      since zone pointer is already passed in cc->zone.
      
      Furthermore, going back to compact_zone() and compact_finished() when
      pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
      - the checks are meant to skip pageblocks quickly.  The patch therefore
      also introduces a simple loop into isolate_migratepages() so that it does
      not return immediately on failed pageblock checks, but keeps going until
      isolate_migratepages_range() gets called once.  Similarily to
      isolate_freepages(), the function periodically checks if it needs to
      reschedule or abort async compaction.
      
      [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edc2ca61
  17. 07 Aug, 2014 1 commit
  18. 04 Jun, 2014 5 commits
  19. 07 Apr, 2014 1 commit