1. 24 Apr, 2012 1 commit
    • Hugh Dickins's avatar
      mm: fix s390 BUG by __set_page_dirty_no_writeback on swap · aca50bd3
      Hugh Dickins authored
      Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
      3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
      tries to transfer dirty flag from s390 storage key to struct page and
      radix_tree.
      
      That would be because of reclaim's shrink_page_list() calling
      add_to_swap() on this page at the same time: first PageSwapCache is set
      (causing page_mapping(page) to appear as &swapper_space), then
      page->private set, then tree_lock taken, then page inserted into
      radix_tree - so there's an interval before taking the lock when the
      radix_tree slot is empty.
      
      We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
      before the SetPageSwapCache.  But a better fix is simply to do what's
      five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
      (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
      overhead, and swap is just the same - it ignores the radix_tree tag, and
      does not participate in dirty page accounting, so should be using
      __set_page_dirty_no_writeback() too.
      
      s390 testing now confirms that this does indeed fix the problem.
      Reported-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aca50bd3
  2. 22 Mar, 2012 1 commit
    • Rik van Riel's avatar
      mm: make swapin readahead skip over holes · 67f96aa2
      Rik van Riel authored
      Ever since abandoning the virtual scan of processes, for scalability
      reasons, swap space has been a little more fragmented than before.  This
      can lead to the situation where a large memory user is killed, swap space
      ends up full of "holes" and swapin readahead is totally ineffective.
      
      On my home system, after killing a leaky firefox it took over an hour to
      page just under 2GB of memory back in, slowing the virtual machines down
      to a crawl.
      
      This patch makes swapin readahead simply skip over holes, instead of
      stopping at them.  This allows the system to swap things back in at rates
      of several MB/second, instead of a few hundred kB/second.
      
      The checks done in valid_swaphandles are already done in
      read_swap_cache_async as well, allowing us to remove a fair amount of
      code.
      
      [akpm@linux-foundation.org: fix it for page_cluster >= 32]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Adrian Drzewiecki <z@drze.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67f96aa2
  3. 05 Mar, 2012 1 commit
    • Hugh Dickins's avatar
      memcg: fix GPF when cgroup removal races with last exit · 7512102c
      Hugh Dickins authored
      When moving tasks from old memcg (with move_charge_at_immigrate on new
      memcg), followed by removal of old memcg, hit General Protection Fault in
      mem_cgroup_lru_del_list() (called from release_pages called from
      free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
      exit_mmap from mmput from exit_mm from do_exit).
      
      Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
      been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
      still trying to update its stats, and take page off lru before freeing.
      
      A task, or a charge, or a page on lru: each secures a memcg against
      removal.  In this case, the last task has been moved out of the old memcg,
      and it is exiting: anonymous pages are uncharged one by one from the
      memcg, as they are zapped from its pagetables, so the charge gets down to
      0; but the pages themselves are queued in an mmu_gather for freeing.
      
      Most of those pages will be on lru (and force_empty is careful to
      lru_add_drain_all, to add pages from pagevec to lru first), but not
      necessarily all: perhaps some have been isolated for page reclaim, perhaps
      some isolated for other reasons.  So, force_empty may find no task, no
      charge and no page on lru, and let the removal proceed.
      
      There would still be no problem if these pages were immediately freed; but
      typically (and the put_page_testzero protocol demands it) they have to be
      added back to lru before they are found freeable, then removed from lru
      and freed.  We don't see the issue when adding, because the
      mem_cgroup_iter() loops keep their own reference to the memcg being
      scanned; but when it comes to mem_cgroup_lru_del_list().
      
      I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
      PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
      of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
      38c5d72f ("memcg: simplify LRU handling by new rule") mercifully
      removed those convolutions, but left this General Protection Fault.
      
      But it's surprisingly easy to restore the old behaviour: just check
      PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
      to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
      change?  just going back to how it worked before; testing, and an audit of
      uses of pc->mem_cgroup, show no problem.
      
      And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
      sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
      longer has any purpose, and we can safely revert 4e5f01c2 ("memcg:
      clear pc->mem_cgroup if necessary").
      
      Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
      is not strictly necessary: the lru_lock there, with RCU before memcg
      structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
      without that; but it seems cleaner to rely on one dependency less.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7512102c
  4. 13 Jan, 2012 1 commit
    • KAMEZAWA Hiroyuki's avatar
      memcg: clear pc->mem_cgroup if necessary. · 4e5f01c2
      KAMEZAWA Hiroyuki authored
      This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
      and reducing atomic ops/complexity in memcg LRU handling.
      
      In some cases, pages are added to lru before charge to memcg and pages
      are not classfied to memory cgroup at lru addtion.  Now, the lru where
      the page should be added is determined a bit in page_cgroup->flags and
      pc->mem_cgroup.  I'd like to remove the check of flag.
      
      To handle the case pc->mem_cgroup may contain stale pointers if pages
      are added to LRU before classification.  This patch resets
      pc->mem_cgroup to root_mem_cgroup before lru additions.
      
      [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
      [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
      [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
      [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
      [hughd@google.com: fix page migration to reset_owner]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e5f01c2
  5. 04 Jan, 2012 1 commit
  6. 31 Oct, 2011 1 commit
  7. 10 Mar, 2011 1 commit
  8. 14 Jan, 2011 1 commit
    • Andrea Arcangeli's avatar
      thp: split_huge_page paging · 3f04f62f
      Andrea Arcangeli authored
      Paging logic that splits the page before it is unmapped and added to swap
      to ensure backwards compatibility with the legacy swap code.  Eventually
      swap should natively pageout the hugepages to increase performance and
      decrease seeking and fragmentation of swap space.  swapoff can just skip
      over huge pmd as they cannot be part of swap yet.  In add_to_swap be
      careful to split the page only if we got a valid swap entry so we don't
      split hugepages with a full swap.
      
      In theory we could split pages before isolating them during the lru scan,
      but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
      or PG_lock taken, so split_huge_page has to run either with mmap_sem
      read/write mode or PG_lock taken.  Calling it from isolate_lru_page would
      make locking more complicated, in addition to that split_huge_page would
      deadlock if called by __isolate_lru_page because it has to take the lru
      lock to add the tail pages.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f04f62f
  9. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  10. 22 Sep, 2009 2 commits
    • Daisuke Nishimura's avatar
      mm: add_to_swap_cache() does not return -EEXIST · 2ca4532a
      Daisuke Nishimura authored
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
      or get_swap_page() would call add_to_swap_cache().  So add_to_swap_cache()
      doesn't return -EEXIST any more.
      
      Even though it doesn't return -EEXIST, it's not good behavior conceptually
      to call swapcache_prepare() in the -EEXIST case, because it means clearing
      SWAP_HAS_CACHE flag while the entry is on swap cache.
      
      This patch removes redundant codes and comments from callers of it, and
      adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca4532a
    • Daisuke Nishimura's avatar
      mm: add_to_swap_cache() must not sleep · 31a56396
      Daisuke Nishimura authored
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
      read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
      cache but it has SWAP_HAS_CACHE flag.
      
      Such entries can exist on add/delete path of swap cache.  On add path,
      add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
      on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
      cleared) soon after __delete_from_swap_cache() is called.  So, the
      busy-wait works well in most cases.
      
      But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
      read_swap_cache_async() tries to swap-in the same entry on the same cpu.
      
      This patch calls radix_tree_preload() before swapcache_prepare() and
      divides add_to_swap_cache() into two part: radix_tree_preload() part and
      radix_tree_insert() part(define it as __add_to_swap_cache()).
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a56396
  11. 11 Sep, 2009 1 commit
  12. 17 Jun, 2009 4 commits
  13. 29 May, 2009 1 commit
  14. 08 Jan, 2009 2 commits
    • KAMEZAWA Hiroyuki's avatar
      memcg: mem+swap controller core · 8c7c6e34
      KAMEZAWA Hiroyuki authored
      This patch implements per cgroup limit for usage of memory+swap.  However
      there are SwapCache, double counting of swap-cache and swap-entry is
      avoided.
      
      Mem+Swap controller works as following.
        - memory usage is limited by memory.limit_in_bytes.
        - memory + swap usage is limited by memory.memsw_limit_in_bytes.
      
      This has following benefits.
        - A user can limit total resource usage of mem+swap.
      
          Without this, because memory resource controller doesn't take care of
          usage of swap, a process can exhaust all the swap (by memory leak.)
          We can avoid this case.
      
          And Swap is shared resource but it cannot be reclaimed (goes back to memory)
          until it's used. This characteristic can be trouble when the memory
          is divided into some parts by cpuset or memcg.
          Assume group A and group B.
          After some application executes, the system can be..
      
          Group A -- very large free memory space but occupy 99% of swap.
          Group B -- under memory shortage but cannot use swap...it's nearly full.
      
          Ability to set appropriate swap limit for each group is required.
      
      Maybe someone wonder "why not swap but mem+swap ?"
      
        - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
          to move account from memory to swap...there is no change in usage of
          mem+swap.
      
          In other words, when we want to limit the usage of swap without affecting
          global LRU, mem+swap limit is better than just limiting swap.
      
      Accounting target information is stored in swap_cgroup which is
      per swap entry record.
      
      Charge is done as following.
        map
          - charge  page and memsw.
      
        unmap
          - uncharge page/memsw if not SwapCache.
      
        swap-out (__delete_from_swap_cache)
          - uncharge page
          - record mem_cgroup information to swap_cgroup.
      
        swap-in (do_swap_page)
          - charged as page and memsw.
            record in swap_cgroup is cleared.
            memsw accounting is decremented.
      
        swap-free (swap_free())
          - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
      
      There are people work under never-swap environments and consider swap as
      something bad. For such people, this mem+swap controller extension is just an
      overhead.  This overhead is avoided by config or boot option.
      (see Kconfig. detail is not in this patch.)
      
      TODO:
       - maybe more optimization can be don in swap-in path. (but not very safe.)
         But we just do simple accounting at this stage.
      
      [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
      [hugh@veritas.com: memswap controller core swapcache fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c7c6e34
    • KAMEZAWA Hiroyuki's avatar
      memcg: handle swap caches · d13d1443
      KAMEZAWA Hiroyuki authored
      SwapCache support for memory resource controller (memcg)
      
      Before mem+swap controller, memcg itself should handle SwapCache in proper
      way.  This is cut-out from it.
      
      In current memcg, SwapCache is just leaked and the user can create tons of
      SwapCache.  This is a leak of account and should be handled.
      
      SwapCache accounting is done as following.
      
        charge (anon)
      	- charged when it's mapped.
      	  (because of readahead, charge at add_to_swap_cache() is not sane)
        uncharge (anon)
      	- uncharged when it's dropped from swapcache and fully unmapped.
      	  means it's not uncharged at unmap.
      	  Note: delete from swap cache at swap-in is done after rmap information
      	        is established.
        charge (shmem)
      	- charged at swap-in. this prevents charge at add_to_page_cache().
      
        uncharge (shmem)
      	- uncharged when it's dropped from swapcache and not on shmem's
      	  radix-tree.
      
        at migration, check against 'old page' is modified to handle shmem.
      
      Comparing to the old version discussed (and caused troubles), we have
      advantages of
        - PCG_USED bit.
        - simple migrating handling.
      
      So, situation is much easier than several months ago, maybe.
      
      [hugh@veritas.com: memcg: handle swap caches build fix]
      Reviewed-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d13d1443
  15. 06 Jan, 2009 3 commits
    • Hugh Dickins's avatar
      mm: remove gfp_mask from add_to_swap · ac47b003
      Hugh Dickins authored
      Remove gfp_mask argument from add_to_swap(): it's misleading because its
      only caller, shrink_page_list(), is not atomic at that point; and in due
      course (implementing discard) we'll sometimes want to allocate some memory
      with GFP_NOIO (as is used in swap_writepage) when allocating swap.
      
      No change to the gfp_mask passed down to add_to_swap_cache(): still use
      __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
      though it's not obvious if that's the best combination to ask for here.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac47b003
    • Hugh Dickins's avatar
      mm: try_to_free_swap replaces remove_exclusive_swap_page · a2c43eed
      Hugh Dickins authored
      remove_exclusive_swap_page(): its problem is in living up to its name.
      
      It doesn't matter if someone else has a reference to the page (raised
      page_count); it doesn't matter if the page is mapped into userspace
      (raised page_mapcount - though that hints it may be worth keeping the
      swap): all that matters is that there be no more references to the swap
      (and no writeback in progress).
      
      swapoff (try_to_unuse) has been removing pages from swapcache for years,
      with no concern for page count or page mapcount, and we used to have a
      comment in lookup_swap_cache() recognizing that: if you go for a page of
      swapcache, you'll get the right page, but it could have been removed from
      swapcache by the time you get page lock.
      
      So, give up asking for exclusivity: get rid of
      remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
      remove_exclusive_swap_page_count() which were spawned for the recent LRU
      work: replace them by the simpler try_to_free_swap() which just checks
      page_swapcount().
      
      Similarly, remove the page_count limitation from free_swap_and_count(),
      but assume that it's worth holding on to the swap if page is mapped and
      swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
      It would be consistent, but I think we probably have enough for now.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2c43eed
    • Hugh Dickins's avatar
      mm: replace some BUG_ONs by VM_BUG_ONs · 51726b12
      Hugh Dickins authored
      The swap code is over-provisioned with BUG_ONs on assorted page flags,
      mostly dating back to 2.3.  They're good documentation, and guard against
      developer error, but a waste of space on most systems: change them to
      VM_BUG_ONs, conditional on CONFIG_DEBUG_VM.  Just delete the PagePrivate
      ones: they're later, from 2.5.69, but even less interesting now.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51726b12
  16. 20 Oct, 2008 4 commits
  17. 20 Aug, 2008 1 commit
  18. 05 Aug, 2008 1 commit
  19. 26 Jul, 2008 3 commits
    • Johannes Weiner's avatar
      mm: print swapcache page count in show_swap_cache_info() · 2c97b7fc
      Johannes Weiner authored
      Every arch implements its own show_mem() function.  Most of them share
      quite some code, some of them are completely identical.
      
      This series implements a generic version of this function and migrates
      almost all architectures to it.
      
      This patch:
      
      Most show_mem() implementations calculate the amount of pages within
      the swapcache every time.  Move the output to a more appropriate place
      and use the anyway available total_swapcache_pages variable.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Bryan Wu <cooloney@kernel.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c97b7fc
    • Nick Piggin's avatar
      mm: spinlock tree_lock · 19fd6231
      Nick Piggin authored
      mapping->tree_lock has no read lockers.  convert the lock from an rwlock
      to a spinlock.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19fd6231
    • Nick Piggin's avatar
      mm: speculative page references · e286781d
      Nick Piggin authored
      If we can be sure that elevating the page_count on a pagecache page will
      pin it, we can speculatively run this operation, and subsequently check to
      see if we hit the right page rather than relying on holding a lock or
      otherwise pinning a reference to the page.
      
      This can be done if get_page/put_page behaves consistently throughout the
      whole tree (ie.  if we "get" the page after it has been used for something
      else, we must be able to free it with a put_page).
      
      Actually, there is a period where the count behaves differently: when the
      page is free or if it is a constituent page of a compound page.  We need
      an atomic_inc_not_zero operation to ensure we don't try to grab the page
      in either case.
      
      This patch introduces the core locking protocol to the pagecache (ie.
      adds page_cache_get_speculative, and tweaks some update-side code to make
      it work).
      
      Thanks to Hugh for pointing out an improvement to the algorithm setting
      page_count to zero when we have control of all references, in order to
      hold off speculative getters.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
      [hugh@veritas.com: fix add_to_page_cache]
      [akpm@linux-foundation.org: repair a comment]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e286781d
  20. 30 Apr, 2008 1 commit
  21. 20 Mar, 2008 1 commit
  22. 07 Feb, 2008 5 commits
    • Hugh Dickins's avatar
      memcgroup: revert swap_state mods · fa1de900
      Hugh Dickins authored
      If we're charging rss and we're charging cache, it seems obvious that we
      should be charging swapcache - as has been done.  But in practice that
      doesn't work out so well: both swapin readahead and swapoff leave the
      majority of pages charged to the wrong cgroup (the cgroup that happened to
      read them in, rather than the cgroup to which they belong).
      
      (Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up
      as a problem: no allocation was ever done there, every page read being
      already charged to the cgroup which initiated the swapoff.)
      
      It all works rather better if we leave the charging to do_swap_page and
      unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to
      what it was before the memory-controller patches.  This also speeds up
      significantly a contained process working at its limit: because it no
      longer needs to keep waiting for swap writeback to complete.
      
      Is it unfair that swap pages become uncharged once they're unmapped, even
      though they're still clearly private to particular cgroups?  For a short
      while, yes; but PageReclaim arranges for those pages to go to the end of
      the inactive list and be reclaimed soon if necessary.
      
      shmem/tmpfs pages are a distinct case: their charging also benefits from
      this change, but their second life on the lists as swapcache pages may
      prove more unfair - that I need to check next.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa1de900
    • Balbir Singh's avatar
      memory controller BUG_ON() · 35c754d7
      Balbir Singh authored
      Move mem_controller_cache_charge() above radix_tree_preload().
      radix_tree_preload() disables preemption, even though the gfp_mask passed
      contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we
      hit a BUG_ON() in kmem_cache_alloc().
      
      This patch moves mem_controller_cache_charge() to above radix_tree_preload()
      for cache charging.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35c754d7
    • Balbir Singh's avatar
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh authored
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • Balbir Singh's avatar
      Memory controller: add switch to control what type of pages to limit · 8697d331
      Balbir Singh authored
      Choose if we want cached pages to be accounted or not.  By default both are
      accounted for.  A new set of tunables are added.
      
      echo -n 1 > mem_control_type
      
      switches the accounting to account for only mapped pages
      
      echo -n 3 > mem_control_type
      
      switches the behaviour back
      
      [bunk@kernel.org: mm/memcontrol.c: clenups]
      [akpm@linux-foundation.org: fix sparc32 build]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8697d331
    • Balbir Singh's avatar
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh authored
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
  23. 05 Feb, 2008 2 commits
    • Nick Piggin's avatar
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin authored
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
    • Hugh Dickins's avatar
      tmpfs: move swap swizzling into shmem · 73b1262f
      Hugh Dickins authored
      move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
      between tmpfs page cache and swap cache, to avoid page copying) are only used
      by shmem.c; and our subsequent fix for unionfs needs different treatments in
      the two instances of move_from_swap_cache.  Move them from swap_state.c into
      their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
      add_to_swap_cache externally visible.
      
      shmem.c likes to say set_page_dirty where swap_state.c liked to say
      SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
      makes moot (and implies we should lose that "shift page from clean_pages to
      dirty_pages list" comment: it's on neither).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b1262f