Skip to content
  • Michal Hocko's avatar
    memcg, vmscan: integrate soft reclaim tighter with zone shrinking code · 3b38722e
    Michal Hocko authored
    This patchset is sitting out of tree for quite some time without any
    objections.  I would be really happy if it made it into 3.12.  I do not
    want to push it too hard but I think this work is basically ready and
    waiting more doesn't help.
    
    The basic idea is quite simple.  Pull soft reclaim into shrink_zone in the
    first step and get rid of the previous soft reclaim infrastructure.
    shrink_zone is done in two passes now.  First it tries to do the soft
    limit reclaim and it falls back to reclaim-all mode if no group is over
    the limit or no pages have been scanned.  The second pass happens at the
    same priority so the only time we waste is the memcg tree walk which has
    been updated in the third step to have only negligible overhead.
    
    As a bonus we will get rid of a _lot_ of code by this and soft reclaim
    will not stand out like before when it wasn't integrated into the zone
    shrinking code and it reclaimed at priority 0 (the testing results show
    that some workloads suffers from such an aggressive reclaim).  The clean
    up is in a separate patch because I felt it would be easier to review that
    way.
    
    The second step is soft limit reclaim integration into targeted reclaim.
    It should be rather straight forward.  Soft limit has been used only for
    the global reclaim so far but it makes sense for any kind of pressure
    coming from up-the-hierarchy, including targeted reclaim.
    
    The third step (patches 4-8) addresses the tree walk overhead by enhancing
    memcg iterators to enable skipping whole subtrees and tracking number of
    over soft limit children at each level of the hierarchy.  This information
    is updated same way the old soft limit tree was updated (from
    memcg_check_events) so we shouldn't see an additional overhead.  In fact
    mem_cgroup_update_soft_limit is much simpler than tree manipulation done
    previously.
    
    __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
    mem_cgroup_iter so the decision whether a particular group should be
    visited is done at the iterator level which allows us to decide to skip
    the whole subtree as well (if there is no child in excess).  This reduces
    the tree walk overhead considerably.
    
    * TEST 1
    ========
    
    My primary test case was a parallel kernel build with 2 groups (make is
    running with -j8 with a distribution .config in a separate cgroup without
    any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
    run taskset to Node 0 cpus.
    
    I was mostly interested in 2 setups.  Default - no soft limit set and -
    and 0 soft limit set to both groups.  The first one should tell us whether
    the rework regresses the default behavior while the second one should show
    us improvements in an extreme case where both workloads are always over
    the soft limit.
    
    /usr/bin/time -v has been used to collect the statistics and each
    configuration had 3 runs after fresh boot without any other load on the
    system.
    
    base is mmotm-2013-07-18-16-40
    rework all 8 patches applied on top of base
    
    * No-limit
    User
    no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
    no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
    System
    no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
    no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
    Elapsed
    no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
    no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6
    
    The results are within noise. Elapsed time has a bigger variance but the
    average looks good.
    
    * 0-limit
    User
    0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
    0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
    System
    0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
    0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
    Elapsed
    0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
    0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6
    
    The improvement is really huge here (even bigger than with my previous
    testing and I suspect that this highly depends on the storage).  Page
    fault statistics tell us at least part of the story:
    
    Minor
    0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
    0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
    Major
    0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
    0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6
    
    Same as with my previous testing Minor faults are more or less within
    noise but Major fault count is way bellow the base kernel.
    
    While this looks as a nice win it is fair to say that 0-limit
    configuration is quite artificial. So I was playing with 0-no-limit
    loads as well.
    
    * TEST 2
    ========
    
    The following results are from 2 groups configuration on a 16GB machine
    (single NUMA node).
    
    - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
      2*TotalMem with 0 soft limit.
    - B running a mem_eater which consumes TotalMem-1G without any limit. The
      mem_eater consumes the memory in 100 chunks with 1s nap after each
      mmap+poppulate so that both loads have chance to fight for the memory.
    
    The expected result is that B shouldn't be reclaimed and A shouldn't see
    a big dropdown in elapsed time.
    
    User
    base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
    rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
    System
    base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
    rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
    Elapsed
    base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
    rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3
    
    System time improved slightly as well as Elapsed. My previous testing
    has shown worse numbers but this again seem to depend on the storage
    speed.
    
    My theory is that the writeback doesn't catch up and prio-0 soft reclaim
    falls into wait on writeback page too often in the base kernel. The
    patched kernel doesn't do that because the soft reclaim is done from the
    kswapd/direct reclaim context. This can be seen on the following graph
    nicely. The A's group usage_in_bytes regurarly drops really low very often.
    
    All 3 runs
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
    resp. a detail of the single run
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
    
    mem_eater seems to be doing better as well. It gets to the full
    allocation size faster as can be seen on the following graph:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png
    
    /proc/meminfo collected during the test also shows that rework kernel
    hasn't swapped that much (well almost not at all):
    base: max: 123900 K avg: 56388.29 K
    rework: max: 300 K avg: 128.68 K
    
    kswapd and direct reclaim statistics are of no use unfortunatelly because
    soft reclaim is not accounted properly as the counters are hidden by
    global_reclaim() checks in the base kernel.
    
    * TEST 3
    ========
    
    Another test was the same configuration as TEST2 except the stream IO was
    replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
    in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
    left.
    
    Kbuild did better with the rework kernel here as well:
    User
    base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
    rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
    System
    base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
    rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
    Elapsed
    base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
    rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
    Minor
    base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
    rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
    Major
    base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
    rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3
    
    Again we can see a significant improvement in Elapsed (it also seems to
    be more stable), there is a huge dropdown for the Major page faults and
    much more swapping:
    base: max: 583736 K avg: 112547.43 K
    rework: max: 4012 K avg: 124.36 K
    
    Graphs from all three runs show the variability of the kbuild quite
    nicely.  It even seems that it took longer after every run with the base
    kernel which would be quite surprising as the source tree for the build is
    removed and caches are dropped after each run so the build operates on a
    freshly extracted sources everytime.
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png
    
    My other testing shows that this is just a matter of timing and other runs
    behave differently the std for Elapsed time is similar ~50.  Example of
    other three runs:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png
    
    
    
    So to wrap this up.  The series is still doing good and improves the soft
    limit.
    
    The testing results for bunch of cgroups with both stream IO and kbuild
    loads can be found in "memcg: track children in soft limit excess to
    improve soft limit".
    
    This patch:
    
    Memcg soft reclaim has been traditionally triggered from the global
    reclaim paths before calling shrink_zone.  mem_cgroup_soft_limit_reclaim
    then picked up a group which exceeds the soft limit the most and reclaimed
    it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
    
    The infrastructure requires per-node-zone trees which hold over-limit
    groups and keep them up-to-date (via memcg_check_events) which is not cost
    free.  Although this overhead hasn't turned out to be a bottle neck the
    implementation is suboptimal because mem_cgroup_update_tree has no idea
    which zones consumed memory over the limit so we could easily end up
    having a group on a node-zone tree having only few pages from that
    node-zone.
    
    This patch doesn't try to fix node-zone trees management because it seems
    that integrating soft reclaim into zone shrinking sounds much easier and
    more appropriate for several reasons.  First of all 0 priority reclaim was
    a crude hack which might lead to big stalls if the group's LRUs are big
    and hard to reclaim (e.g.  a lot of dirty/writeback pages).  Soft reclaim
    should be applicable also to the targeted reclaim which is awkward right
    now without additional hacks.  Last but not least the whole infrastructure
    eats quite some code.
    
    After this patch shrink_zone is done in 2 passes.  First it tries to do
    the soft reclaim if appropriate (only for global reclaim for now to keep
    compatible with the original state) and fall back to ignoring soft limit
    if no group is eligible to soft reclaim or nothing has been scanned during
    the first pass.  Only groups which are over their soft limit or any of
    their parents up the hierarchy is over the limit are considered eligible
    during the first pass.
    
    Soft limit tree which is not necessary anymore will be removed in the
    follow up patch to make this patch smaller and easier to review.
    
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Reviewed-by: default avatarGlauber Costa <glommer@openvz.org>
    Reviewed-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Ying Han <yinghan@google.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Michel Lespinasse <walken@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: Glauber Costa <glommer@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    3b38722e