1. 04 Oct, 2017 1 commit
  2. 09 Sep, 2017 1 commit
    • Jérôme Glisse's avatar
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse authored
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
  3. 07 Sep, 2017 3 commits
  4. 10 Jul, 2017 1 commit
  5. 06 Jul, 2017 1 commit
  6. 03 May, 2017 1 commit
  7. 01 May, 2017 1 commit
    • Dan Williams's avatar
      mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
      Dan Williams authored
      The x86 conversion to the generic GUP code included a small change which causes
      crashes and data corruption in the pmem code - not good.
      
      The root cause is that the /dev/pmem driver code implicitly relies on the x86
      get_user_pages() implementation doing a get_page() on the page refcount, because
      get_page() does a get_zone_device_page() which properly refcounts pmem's separate
      page struct arrays that are not present in the regular page struct structures.
      (The pmem driver does this because it can cover huge memory areas.)
      
      But the x86 conversion to the generic GUP code changed the get_page() to
      page_cache_get_speculative() which is faster but doesn't do the
      get_zone_device_page() call the pmem code relies on.
      
      One way to solve the regression would be to change the generic GUP code to use
      get_page(), but that would slow things down a bit and punish other generic-GUP
      using architectures for an x86-ism they did not care about. (Arguably the pmem
      driver was probably not working reliably for them: but nvdimm is an Intel
      feature, so non-x86 exposure is probably still limited.)
      
      So restructure the pmem code's interface with the MM instead: get rid of the
      get/put_zone_device_page() distinction, integrate put_zone_device_page() into
      __put_page() and and restructure the pmem completion-wait and teardown machinery:
      
      Kirill points out that the calls to {get,put}_dev_pagemap() can be
      removed from the mm fast path if we take a single get_dev_pagemap()
      reference to signify that the page is alive and use the final put of the
      page to drop that reference.
      
      This does require some care to make sure that any waits for the
      percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
      since it now maintains its own elevated reference.
      
      This speeds up things while also making the pmem refcounting more robust going
      forward.
      Suggested-by: default avatarKirill Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarKirill Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      71389703
  8. 08 Apr, 2017 1 commit
  9. 25 Feb, 2017 1 commit
    • Johannes Weiner's avatar
      mm: vmscan: move dirty pages out of the way until they're flushed · c55e8d03
      Johannes Weiner authored
      We noticed a performance regression when moving hadoop workloads from
      3.10 kernels to 4.0 and 4.6.  This is accompanied by increased pageout
      activity initiated by kswapd as well as frequent bursts of allocation
      stalls and direct reclaim scans.  Even lowering the dirty ratios to the
      equivalent of less than 1% of memory would not eliminate the issue,
      suggesting that dirty pages concentrate where the scanner is looking.
      
      This can be traced back to recent efforts of thrash avoidance.  Where
      3.10 would not detect refaulting pages and continuously supply clean
      cache to the inactive list, a thrashing workload on 4.0+ will detect and
      activate refaulting pages right away, distilling used-once pages on the
      inactive list much more effectively.  This is by design, and it makes
      sense for clean cache.  But for the most part our workload's cache
      faults are refaults and its use-once cache is from streaming writes.  We
      end up with most of the inactive list dirty, and we don't go after the
      active cache as long as we have use-once pages around.
      
      But waiting for writes to avoid reclaiming clean cache that *might*
      refault is a bad trade-off.  Even if the refaults happen, reads are
      faster than writes.  Before getting bogged down on writeback, reclaim
      should first look at *all* cache in the system, even active cache.
      
      To accomplish this, activate pages that are dirty or under writeback
      when they reach the end of the inactive LRU.  The pages are marked for
      immediate reclaim, meaning they'll get moved back to the inactive LRU
      tail as soon as they're written back and become reclaimable.  But in the
      meantime, by reducing the inactive list to only immediately reclaimable
      pages, we allow the scanner to deactivate and refill the inactive list
      with clean cache from the active list tail to guarantee forward
      progress.
      
      [hannes@cmpxchg.org: update comment]
        Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c55e8d03
  10. 23 Feb, 2017 1 commit
    • Huang, Ying's avatar
      mm/swap: split swap cache into 64MB trunks · 4b3ef9da
      Huang, Ying authored
      The patch is to improve the scalability of the swap out/in via using
      fine grained locks for the swap cache.  In current kernel, one address
      space will be used for each swap device.  And in the common
      configuration, the number of the swap device is very small (one is
      typical).  This causes the heavy lock contention on the radix tree of
      the address space if multiple tasks swap out/in concurrently.
      
      But in fact, there is no dependency between pages in the swap cache.  So
      that, we can split the one shared address space for each swap device
      into several address spaces to reduce the lock contention.  In the
      patch, the shared address space is split into 64MB trunks.  64MB is
      chosen to balance the memory space usage and effect of lock contention
      reduction.
      
      The size of struct address_space on x86_64 architecture is 408B, so with
      the patch, 6528B more memory will be used for every 1GB swap space on
      x86_64 architecture.
      
      One address space is still shared for the swap entries in the same 64M
      trunks.  To avoid lock contention for the first round of swap space
      allocation, the order of the swap clusters in the initial free clusters
      list is changed.  The swap space distance between the consecutive swap
      clusters in the free cluster list is at least 64M.  After the first
      round of allocation, the swap clusters are expected to be freed
      randomly, so the lock contention should be reduced effectively.
      
      Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b3ef9da
  11. 25 Dec, 2016 1 commit
    • Nicholas Piggin's avatar
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin authored
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62906027
  12. 08 Oct, 2016 1 commit
    • Aaron Lu's avatar
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu authored
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
  13. 28 Jul, 2016 3 commits
  14. 26 Jul, 2016 1 commit
  15. 25 Jun, 2016 1 commit
    • Lukasz Odzioba's avatar
      mm/swap.c: flush lru pvecs on compound page arrival · 8f182270
      Lukasz Odzioba authored
      Currently we can have compound pages held on per cpu pagevecs, which
      leads to a lot of memory unavailable for reclaim when needed.  In the
      systems with hundreads of processors it can be GBs of memory.
      
      On of the way of reproducing the problem is to not call munmap
      explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
      that some pages (with THP enabled also huge pages) may end up on
      lru_add_pvec, example below.
      
        void main() {
        #pragma omp parallel
        {
      	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
      	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
      		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
      	if (p != MAP_FAILED)
      		memset(p, 0, size);
      	//munmap(p, size); // uncomment to make the problem go away
        }
        }
      
      When we run it with THP enabled it will leave significant amount of
      memory on lru_add_pvec.  This memory will be not reclaimed if we hit
      OOM, so when we run above program in a loop:
      
      	for i in `seq 100`; do ./a.out; done
      
      many processes (95% in my case) will be killed by OOM.
      
      The primary point of the LRU add cache is to save the zone lru_lock
      contention with a hope that more pages will belong to the same zone and
      so their addition can be batched.  The huge page is already a form of
      batched addition (it will add 512 worth of memory in one go) so skipping
      the batching seems like a safer option when compared to a potential
      excess in the caching which can be quite large and much harder to fix
      because lru_add_drain_all is way to expensive and it is not really clear
      what would be a good moment to call it.
      
      Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
      madvise(p, size, MADV_FREE); after memset.
      
      This patch flushes lru pvecs on compound page arrival making the problem
      less severe - after applying it kill rate of above example drops to 0%,
      due to reducing maximum amount of memory held on pvec from 28MB (with
      THP) to 56kB per CPU.
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Ming Li <mingli199x@qq.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f182270
  16. 09 Jun, 2016 1 commit
    • Wang Sheng-Hui's avatar
      mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all · f3a932ba
      Wang Sheng-Hui authored
      This patch is based on https://patchwork.ozlabs.org/patch/574623/.
      
      Tejun submitted commit 23d11a58 ("workqueue: skip flush dependency
      checks for legacy workqueues") for the legacy create*_workqueue()
      interface.
      
      But some workq created by alloc_workqueue still reports warning on
      memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set:
      
          workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
          ------------[ cut here ]------------
          WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c
          ...
          check_flush_dependency+0xb4/0x10c
          flush_work+0x54/0x140
          lru_add_drain_all+0x138/0x188
          migrate_prep+0xc/0x18
          alloc_contig_range+0xf4/0x350
          cma_alloc+0xec/0x1e4
          dma_alloc_from_contiguous+0x38/0x40
          __dma_alloc+0x74/0x25c
          nvme_alloc_queue+0xcc/0x36c
          nvme_reset_work+0x5c4/0xda8
          process_one_work+0x128/0x2ec
          worker_thread+0x58/0x434
          kthread+0xd4/0xe8
          ret_from_fork+0x10/0x50
      
      That's because lru_add_drain_all() will schedule the drain work on
      system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM.
      
      Introduce a dedicated WQ_MEM_RECLAIM workqueue to do
      lru_add_drain_all(), aiding in getting memory freed.
      
      Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.comSigned-off-by: default avatarWang Sheng-Hui <shhuiw@foxmail.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a932ba
  17. 21 May, 2016 1 commit
  18. 29 Apr, 2016 1 commit
  19. 04 Apr, 2016 2 commits
    • Kirill A. Shutemov's avatar
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov authored
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  20. 16 Jan, 2016 4 commits
    • Dan Williams's avatar
      mm, x86: get_user_pages() for dax mappings · 3565fce3
      Dan Williams authored
      A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
      has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
      return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
      encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
      struct dev_pagemap instance to keep the result of pfn_to_page() valid
      until put_page().
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3565fce3
    • Minchan Kim's avatar
      mm: move lazily freed pages to inactive list · 10853a03
      Minchan Kim authored
      MADV_FREE is a hint that it's okay to discard pages if there is memory
      pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
      them so there is no value keeping them in the active anonymous LRU so
      this patch moves them to inactive LRU list's head.
      
      This means that MADV_FREE-ed pages which were living on the inactive
      list are reclaimed first because they are more likely to be cold rather
      than recently active pages.
      
      An arguable issue for the approach would be whether we should put the
      page to the head or tail of the inactive list.  I chose head because the
      kernel cannot make sure it's really cold or warm for every MADV_FREE
      usecase but at least we know it's not *hot*, so landing of inactive head
      would be a comprimise for various usecases.
      
      This fixes suboptimal behavior of MADV_FREE when pages living on the
      active list will sit there for a long time even under memory pressure
      while the inactive list is reclaimed heavily.  This basically breaks the
      whole purpose of using MADV_FREE to help the system to free memory which
      is might not be used.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10853a03
    • Kirill A. Shutemov's avatar
      thp: allow mlocked THP again · e90309c9
      Kirill A. Shutemov authored
      Before THP refcounting rework, THP was not allowed to cross VMA
      boundary.  So, if we have THP and we split it, PG_mlocked can be safely
      transferred to small pages.
      
      With new THP refcounting and naive approach to mlocking we can end up
      with this scenario:
       1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
       2. the process does munlock() on the *part* of the THP:
            - the VMA is split into two, one of them VM_LOCKED;
            - huge PMD split into PTE table;
            - THP is still mlocked;
       3. split_huge_page():
            - it transfers PG_mlocked to *all* small pages regrardless if it
      	blong to any VM_LOCKED VMA.
      
      We probably could munlock() all small pages on split_huge_page(), but I
      think we have accounting issue already on step two.
      
      Instead of forbidding mlocked pages altogether, we just avoid mlocking
      PTE-mapped THPs and munlock THPs on split_huge_pmd().
      
      This means PTE-mapped THPs will be on normal lru lists and will be split
      under memory pressure by vmscan.  After the split vmscan will detect
      unevictable small pages and mlock them.
      
      With this approach we shouldn't hit situation like described above.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e90309c9
    • Kirill A. Shutemov's avatar
      mm: drop tail page refcounting · ddc58f27
      Kirill A. Shutemov authored
      Tail page refcounting is utterly complicated and painful to support.
      
      It uses ->_mapcount on tail pages to store how many times this page is
      pinned.  get_page() bumps ->_mapcount on tail page in addition to
      ->_count on head.  This information is required by split_huge_page() to
      be able to distribute pins from head of compound page to tails during
      the split.
      
      We will need ->_mapcount to account PTE mappings of subpages of the
      compound page.  We eliminate need in current meaning of ->_mapcount in
      tail pages by forbidding split entirely if the page is pinned.
      
      The only user of tail page refcounting is THP which is marked BROKEN for
      now.
      
      Let's drop all this mess.  It makes get_page() and put_page() much
      simpler.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddc58f27
  21. 07 Nov, 2015 1 commit
    • Kirill A. Shutemov's avatar
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov authored
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
  22. 10 Sep, 2015 1 commit
    • Vladimir Davydov's avatar
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov authored
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  23. 25 Jun, 2015 1 commit
    • Kirill A. Shutemov's avatar
      mm: drop bogus VM_BUG_ON_PAGE assert in put_page() codepath · 73933b33
      Kirill A. Shutemov authored
      My commit 8d63d99a ("mm: avoid tail page refcounting on non-THP
      compound pages") which was merged during 4.1 merge window caused
      regression:
      
        page:ffffea0010a15040 count:0 mapcount:1 mapping:          (null) index:0x0
        flags: 0x8000000000008014(referenced|dirty|tail)
        page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) != 0)
        ------------[ cut here ]------------
        kernel BUG at mm/swap.c:134!
      
      The problem can be reproduced by playing *two* audio files at the same
      time and then stopping one of players.  I used two mplayers to trigger
      this.
      
      The VM_BUG_ON_PAGE() which triggers the bug is bogus:
      
      Sound subsystem uses compound pages for its buffers, but unlike most
      __GFP_COMP sound maps compound pages to userspace with PTEs.
      
      In our case with two players map the buffer twice and therefore elevates
      page_mapcount() on tail pages by two.  When one of players exits it
      unmaps the VMA and drops page_mapcount() to one and try to release
      reference on the page with put_page().
      
      My commit changes which path it takes under put_compound_page().  It hits
      put_unrefcounted_compound_page() where VM_BUG_ON_PAGE() is.  It sees
      page_mapcount() == 1.  The function wrongly assumes that subpages of
      compound page cannot be be mapped by itself with PTEs..
      
      The solution is simply drop the VM_BUG_ON_PAGE().
      
      Note: there's no need to move the check under put_page_testzero().
      Allocator will check the mapcount by itself before putting on free list.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73933b33
  24. 15 Apr, 2015 2 commits
  25. 10 Feb, 2015 1 commit
  26. 20 Jan, 2015 1 commit
  27. 10 Oct, 2014 1 commit
    • Michal Hocko's avatar
      mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache · aabfb572
      Michal Hocko authored
      free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
      This is not a big deal for the normal release path but it completely kills
      memcg uncharge batching which reduces res_counter spin_lock contention.
      Dave has noticed this with his page fault scalability test case on a large
      machine when the lock was basically dominating on all CPUs:
      
          80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                        |
                        --- _raw_spin_lock
                           |
                           |--66.59%-- res_counter_uncharge_until
                           |          res_counter_uncharge
                           |          uncharge_batch
                           |          uncharge_list
                           |          mem_cgroup_uncharge_list
                           |          release_pages
                           |          free_pages_and_swap_cache
                           |          tlb_flush_mmu_free
                           |          |
                           |          |--90.12%-- unmap_single_vma
                           |          |          unmap_vmas
                           |          |          unmap_region
                           |          |          do_munmap
                           |          |          vm_munmap
                           |          |          sys_munmap
                           |          |          system_call_fastpath
                           |          |          __GI___munmap
                           |          |
                           |           --9.88%-- tlb_flush_mmu
                           |                     tlb_finish_mmu
                           |                     unmap_region
                           |                     do_munmap
                           |                     vm_munmap
                           |                     sys_munmap
                           |                     system_call_fastpath
                           |                     __GI___munmap
      
      In his case the load was running in the root memcg and that part has been
      handled by reverting 05b84301 ("mm: memcontrol: use root_mem_cgroup
      res_counter") because this is a clear regression, but the problem remains
      inside dedicated memcgs.
      
      There is no reason to limit release_pages to PAGEVEC_SIZE batches other
      than lru_lock held times.  This logic, however, can be moved inside the
      function.  mem_cgroup_uncharge_list and free_hot_cold_page_list do not
      hold any lock for the whole pages_to_free list so it is safe to call them
      in a single run.
      
      The release_pages() code was previously breaking the lru_lock each
      PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
      pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
      (32) pages.  This means that the lock acquisition frequency is
      approximately halved and the max hold times are approximately doubled.
      
      The now unneeded batching is removed from free_pages_and_swap_cache().
      
      Also update the grossly out-of-date release_pages documentation.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarDave Hansen <dave@sr71.net>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aabfb572
  28. 08 Aug, 2014 3 commits
    • Johannes Weiner's avatar
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner authored
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • Johannes Weiner's avatar
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner authored
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: default avatarJet Chen <jet.chen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • Johannes Weiner's avatar
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner authored
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  29. 07 Aug, 2014 1 commit
    • Mel Gorman's avatar
      mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated · 24b7e581
      Mel Gorman authored
      This was formerly the series "Improve sequential read throughput" which
      noted some major differences in performance of tiobench since 3.0.
      While there are a number of factors, two that dominated were the
      introduction of the fair zone allocation policy and changes to CFQ.
      
      The behaviour of fair zone allocation policy makes more sense than
      tiobench as a benchmark and CFQ defaults were not changed due to
      insufficient benchmarking.
      
      This series is what's left.  It's one functional fix to the fair zone
      allocation policy when used on NUMA machines and a reduction of overhead
      in general.  tiobench was used for the comparison despite its flaws as
      an IO benchmark as in this case we are primarily interested in the
      overhead of page allocator and page reclaim activity.
      
      On UMA, it makes little difference to overhead
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          383.61      386.77
      System        403.83      401.74
      Elapsed      5411.50     5413.11
      
      On a 4-socket NUMA machine it's a bit more noticable
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          746.94      802.00
      System      65336.22    40852.33
      Elapsed     27553.52    27368.46
      
      This patch (of 6):
      
      The LRU insertion and activate tracepoints take PFN as a parameter
      forcing the overhead to the caller.  Move the overhead to the tracepoint
      fast-assign method to ensure the cost is only incurred when the
      tracepoint is active.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24b7e581