1. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  2. 06 Jul, 2017 3 commits
  3. 03 May, 2017 1 commit
    • Greg Thelen's avatar
      slab: avoid IPIs when creating kmem caches · a87c75fb
      Greg Thelen authored
      Each slab kmem cache has per cpu array caches.  The array caches are
      created when the kmem_cache is created, either via kmem_cache_create()
      or lazily when the first object is allocated in context of a kmem
      enabled memcg.  Array caches are replaced by writing to /proc/slabinfo.
      
      Array caches are protected by holding slab_mutex or disabling
      interrupts.  Array cache allocation and replacement is done by
      __do_tune_cpucache() which holds slab_mutex and calls
      kick_all_cpus_sync() to interrupt all remote processors which confirms
      there are no references to the old array caches.
      
      IPIs are needed when replacing array caches.  But when creating a new
      array cache, there's no need to send IPIs because there cannot be any
      references to the new cache.  Outside of memcg kmem accounting these
      IPIs occur at boot time, so they're not a problem.  But with memcg kmem
      accounting each container can create kmem caches, so the IPIs are
      wasteful.
      
      Avoid unnecessary IPIs when creating array caches.
      
      Test which reports the IPI count of allocating slab in 10000 memcg:
      
      	import os
      
      	def ipi_count():
      		with open("/proc/interrupts") as f:
      			for l in f:
      				if 'Function call interrupts' in l:
      					return int(l.split()[1])
      
      	def echo(val, path):
      		with open(path, "w") as f:
      			f.write(val)
      
      	n = 10000
      	os.chdir("/mnt/cgroup/memory")
      	pid = str(os.getpid())
      	a = ipi_count()
      	for i in range(n):
      		os.mkdir(str(i))
      		echo("1G\n", "%d/memory.limit_in_bytes" % i)
      		echo("1G\n", "%d/memory.kmem.limit_in_bytes" % i)
      		echo(pid, "%d/cgroup.procs" % i)
      		open("/tmp/x", "w").close()
      		os.unlink("/tmp/x")
      	b = ipi_count()
      	print "%d loops: %d => %d (+%d ipis)" % (n, a, b, b-a)
      	echo(pid, "cgroup.procs")
      	for i in range(n):
      		os.rmdir(str(i))
      
      patched:   10000 loops: 1069 => 1170 (+101 ipis)
      unpatched: 10000 loops: 1192 => 48933 (+47741 ipis)
      
      Link: http://lkml.kernel.org/r/20170416214544.109476-1-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a87c75fb
  4. 18 Apr, 2017 1 commit
    • Paul E. McKenney's avatar
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney authored
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  5. 02 Mar, 2017 1 commit
  6. 23 Feb, 2017 3 commits
    • Tejun Heo's avatar
      slab: introduce __kmemcg_cache_deactivate() · c9fc5864
      Tejun Heo authored
      __kmem_cache_shrink() is called with %true @deactivate only for memcg
      caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
      __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
      should implement it and it should deactivate and drain the cache.
      
      This is to allow memcg cache deactivation behavior to further deviate
      from simple shrinking without messing up __kmem_cache_shrink().
      
      This is pure reorganization and doesn't introduce any observable
      behavior changes.
      
      v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.orgSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9fc5864
    • Tejun Heo's avatar
      Revert "slub: move synchronize_sched out of slab_mutex on shrink" · 290b6a58
      Tejun Heo authored
      Patch series "slab: make memcg slab destruction scalable", v3.
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.
      
      I've seen machines which end up with hundred thousands of caches and
      many millions of kernfs_nodes.  The current code is O(N^2) on the total
      number of caches and has synchronous rcu_barrier() and
      synchronize_sched() in cgroup offline / release path which is executed
      while holding cgroup_mutex.  Combined, this leads to very expensive and
      slow cache destruction operations which can easily keep running for half
      a day.
      
      This also messes up /proc/slabinfo along with other cache iterating
      operations.  seq_file operates on 4k chunks and on each 4k boundary
      tries to seek to the last position in the list.  With a huge number of
      caches on the list, this becomes very slow and very prone to the list
      content changing underneath it leading to a lot of missing and/or
      duplicate entries.
      
      This patchset addresses the scalability problem.
      
      * Add root and per-memcg lists.  Update each user to use the
        appropriate list.
      
      * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
        and asynchronous.
      
      * For dying empty slub caches, remove the sysfs files after
        deactivation so that we don't end up with millions of sysfs files
        without any useful information on them.
      
      This patchset contains the following nine patches.
      
       0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
       0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
       0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
       0004-slab-reorganize-memcg_cache_params.patch
       0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
       0006-slab-implement-slab_root_caches-list.patch
       0007-slab-introduce-__kmemcg_cache_deactivate.patch
       0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
       0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
       0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
      
      0001 reverts an existing optimization to prepare for the following
      changes.  0002 is a prep patch.  0003 makes rcu_barrier() in release
      path batched and asynchronous.  0004-0006 separate out the lists.
      0007-0008 replace synchronize_sched() in slub destruction path with
      call_rcu_sched().  0009 removes sysfs files early for empty dying
      caches.  0010 makes destruction work items use a workqueue with limited
      concurrency.
      
      This patch (of 10):
      
      Revert 89e364db ("slub: move synchronize_sched out of slab_mutex on
      shrink").
      
      With kmem cgroup support enabled, kmem_caches can be created and destroyed
      frequently and a great number of near empty kmem_caches can accumulate if
      there are a lot of transient cgroups and the system is not under memory
      pressure.  When memory reclaim starts under such conditions, it can lead
      to consecutive deactivation and destruction of many kmem_caches, easily
      hundreds of thousands on moderately large systems, exposing scalability
      issues in the current slab management code.  This is one of the patches to
      address the issue.
      
      Moving synchronize_sched() out of slab_mutex isn't enough as it's still
      inside cgroup_mutex.  The whole deactivation / release path will be
      updated to avoid all synchronous RCU operations.  Revert this insufficient
      optimization in preparation to ease future changes.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.orgSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      290b6a58
    • Vlastimil Babka's avatar
      mm, slab: rename kmalloc-node cache to kmalloc-<size> · af3b5f87
      Vlastimil Babka authored
      SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
      the kmem_cache_node management structure, and puts it into the generic
      kmalloc cache array (e.g. for 128b objects).  The name of this cache is
      "kmalloc-node", which is confusing for readers of /proc/slabinfo as the
      cache is used for generic allocations (and not just the kmem_cache_node
      struct) and it appears as the kmalloc-128 cache is missing.
      
      An easy solution is to use the kmalloc-<size> name when pre-creating the
      cache, which we can get from the kmalloc_info array.
      
      Example /proc/slabinfo before the patch:
      
        ...
        kmalloc-256         1647   1984    256   16    1 : tunables  120   60    8 : slabdata    124    124    828
        kmalloc-192         1974   1974    192   21    1 : tunables  120   60    8 : slabdata     94     94    133
        kmalloc-96          1332   1344    128   32    1 : tunables  120   60    8 : slabdata     42     42    219
        kmalloc-64          2505   5952     64   64    1 : tunables  120   60    8 : slabdata     93     93    715
        kmalloc-32          4278   4464     32  124    1 : tunables  120   60    8 : slabdata     36     36    346
        kmalloc-node        1352   1376    128   32    1 : tunables  120   60    8 : slabdata     43     43     53
        kmem_cache           132    147    192   21    1 : tunables  120   60    8 : slabdata      7      7      0
      
      After the patch:
      
        ...
        kmalloc-256         1672   2160    256   16    1 : tunables  120   60    8 : slabdata    135    135    807
        kmalloc-192         1992   2016    192   21    1 : tunables  120   60    8 : slabdata     96     96    203
        kmalloc-96          1159   1184    128   32    1 : tunables  120   60    8 : slabdata     37     37    116
        kmalloc-64          2561   4864     64   64    1 : tunables  120   60    8 : slabdata     76     76    785
        kmalloc-32          4253   4340     32  124    1 : tunables  120   60    8 : slabdata     35     35    270
        kmalloc-128         1256   1280    128   32    1 : tunables  120   60    8 : slabdata     40     40     39
        kmem_cache           125    147    192   21    1 : tunables  120   60    8 : slabdata      7      7      0
      
      [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
        Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
      Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af3b5f87
  7. 11 Jan, 2017 1 commit
  8. 13 Dec, 2016 3 commits
  9. 28 Oct, 2016 2 commits
  10. 17 Sep, 2016 1 commit
    • Tejun Heo's avatar
      slab, workqueue: remove keventd_up() usage · eac0337a
      Tejun Heo authored
      Now that workqueue can handle work item queueing from very early
      during boot, there is no need to gate schedule_delayed_work_on() while
      !keventd_up().  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm@kvack.org
      eac0337a
  11. 06 Sep, 2016 1 commit
  12. 02 Aug, 2016 2 commits
  13. 26 Jul, 2016 5 commits
    • Wei Yongjun's avatar
      mm/slab: use list_move instead of list_del/list_add · de24baec
      Wei Yongjun authored
      Using list_move() instead of list_del() + list_add() to avoid needlessly
      poisoning the next and prev values.
      
      Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.comSigned-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de24baec
    • Michal Hocko's avatar
      slab: do not panic on invalid gfp_mask · 72baeef0
      Michal Hocko authored
      Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
      This is a rather harsh way to announce a non-critical issue.  Allocator
      is free to ignore invalid flags.  Let's simply replace BUG() by
      dump_stack to tell the offender and fixup the mask to move on with the
      allocation request.
      
      This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
      module:
      
        Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
        CPU: 0 PID: 2916 Comm: insmod Tainted: G           O    4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
          dump_stack+0x67/0x90
          cache_alloc_refill+0x201/0x617
          kmem_cache_alloc_trace+0xa7/0x24a
          ? 0xffffffffa0005000
          mymodule_init+0x20/0x1000 [test_slab]
          do_one_initcall+0xe7/0x16c
          ? rcu_read_lock_sched_held+0x61/0x69
          ? kmem_cache_alloc_trace+0x197/0x24a
          do_init_module+0x5f/0x1d9
          load_module+0x1a3d/0x1f21
          ? retint_kernel+0x2d/0x2d
          SyS_init_module+0xe8/0x10e
          ? SyS_init_module+0xe8/0x10e
          do_syscall_64+0x68/0x13f
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72baeef0
    • Michal Hocko's avatar
      slab: make GFP_SLAB_BUG_MASK information more human readable · bacdcb34
      Michal Hocko authored
      printk offers %pGg for quite some time so let's use it to get a human
      readable list of invalid flags.
      
      The original output would be
        [  429.191962] gfp: 2
      
      after the change
        [  429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)
      
      Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bacdcb34
    • Thomas Garnier's avatar
      mm: reorganize SLAB freelist randomization · 7c00fce9
      Thomas Garnier authored
      The kernel heap allocators are using a sequential freelist making their
      allocation predictable.  This predictability makes kernel heap overflow
      easier to exploit.  An attacker can careful prepare the kernel heap to
      control the following chunk overflowed.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      ***Problems that needed solving:
       - Randomize the Freelist (singled linked) used in the SLUB allocator.
       - Ensure good performance to encourage usage.
       - Get best entropy in early boot stage.
      
      ***Parts:
       - 01/02 Reorganize the SLAB Freelist randomization to share elements
         with the SLUB implementation.
       - 02/02 The SLUB Freelist randomization implementation. Similar approach
         than the SLAB but tailored to the singled freelist used in SLUB.
      
      ***Performance data:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      This patch (of 2):
      
      This commit reorganizes the previous SLAB freelist randomization to
      prepare for the SLUB implementation.  It moves functions that will be
      shared to slab_common.
      
      The entropy functions are changed to align with the SLUB implementation,
      now using get_random_(int|long) functions.  These functions were chosen
      because they provide a bit more entropy early on boot and better
      performance when specific arch instructions are not available.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.comSigned-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c00fce9
    • Kees Cook's avatar
      mm: SLAB hardened usercopy support · 04385fc5
      Kees Cook authored
      Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
      SLAB allocator to catch any copies that may span objects.
      
      Based on code from PaX and grsecurity.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarValdis Kletnieks <valdis.kletnieks@vt.edu>
      04385fc5
  14. 21 May, 2016 2 commits
    • Alexander Potapenko's avatar
      mm, kasan: don't call kasan_krealloc() from ksize(). · 4ebb31a4
      Alexander Potapenko authored
      Instead of calling kasan_krealloc(), which replaces the memory
      allocation stack ID (if stack depot is used), just unpoison the whole
      memory chunk.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ebb31a4
    • Alexander Potapenko's avatar
      mm: kasan: initial memory quarantine implementation · 55834c59
      Alexander Potapenko authored
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      When the object is freed, its state changes from KASAN_STATE_ALLOC to
      KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
      instead of being returned to the allocator, therefore every subsequent
      access to that object triggers a KASAN error, and the error handler is
      able to say where the object has been allocated and deallocated.
      
      When it's time for the object to leave quarantine, its state becomes
      KASAN_STATE_FREE and it's returned to the allocator.  From now on the
      allocator may reuse it for another allocation.  Before that happens,
      it's still possible to detect a use-after free on that object (it
      retains the allocation/deallocation stacks).
      
      When the allocator reuses this object, the shadow is unpoisoned and old
      allocation/deallocation stacks are wiped.  Therefore a use of this
      object, even an incorrect one, won't trigger ASan warning.
      
      Without the quarantine, it's not guaranteed that the objects aren't
      reused immediately, that's why the probability of catching a
      use-after-free is lower than with quarantine in place.
      
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      Freed objects are first added to per-cpu quarantine queues.  When a
      cache is destroyed or memory shrinking is requested, the objects are
      moved into the global quarantine queue.  Whenever a kmalloc call allows
      memory reclaiming, the oldest objects are popped out of the global queue
      until the total size of objects in quarantine is less than 3/4 of the
      maximum quarantine size (which is a fraction of installed physical
      memory).
      
      As long as an object remains in the quarantine, KASAN is able to report
      accesses to it, so the chance of reporting a use-after-free is
      increased.  Once the object leaves quarantine, the allocator may reuse
      it, in which case the object is unpoisoned and KASAN can't detect
      incorrect accesses to it.
      
      Right now quarantine support is only enabled in SLAB allocator.
      Unification of KASAN features in SLAB and SLUB will be done later.
      
      This patch is based on the "mm: kasan: quarantine" patch originally
      prepared by Dmitry Chernenkov.  A number of improvements have been
      suggested by Andrey Ryabinin.
      
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55834c59
  15. 20 May, 2016 13 commits
    • Andrew Morton's avatar
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton authored
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
    • Yang Shi's avatar
      mm: slab: remove ZONE_DMA_FLAG · a3187e43
      Yang Shi authored
      Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
      not, so ZONE_DMA_FLAG sounds no longer useful.
      
      And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
      comment [1] from Johannes Weiner, so remove them and ORing passed in
      flags with the cache gfp flags has been done in kmem_getpages().
      
      [1] https://lkml.org/lkml/2014/9/25/553
      
      Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.orgSigned-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3187e43
    • Thomas Garnier's avatar
      mm: SLAB freelist randomization · c7ce4f60
      Thomas Garnier authored
      Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
      the SLAB freelist.  The list is randomized during initialization of a
      new set of pages.  The order on different freelist sizes is pre-computed
      at boot for performance.  Each kmem_cache has its own randomized
      freelist.  Before pre-computed lists are available freelists are
      generated dynamically.  This security feature reduces the predictability
      of the kernel SLAB allocator against heap overflows rendering attacks
      much less stable.
      
      For example this attack against SLUB (also applicable against SLAB)
      would be affected:
      
        https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
      
      Also, since v4.6 the freelist was moved at the end of the SLAB.  It
      means a controllable heap is opened to new attacks not yet publicly
      discussed.  A kernel heap overflow can be transformed to multiple
      use-after-free.  This feature makes this type of attack harder too.
      
      To generate entropy, we use get_random_bytes_arch because 0 bits of
      entropy is available in the boot stage.  In the worse case this function
      will fallback to the get_random_bytes sub API.  We also generate a shift
      random number to shift pre-computed freelist for each new set of pages.
      
      The config option name is not specific to the SLAB as this approach will
      be extended to other allocators like SLUB.
      
      Performance results highlighted no major changes:
      
      Hackbench (running 90 10 times):
      
        Before average: 0.0698
        After average: 0.0663 (-5.01%)
      
      slab_test 1 run on boot.  Difference only seen on the 2048 size test
      being the worse case scenario covered by freelist randomization.  New
      slab pages are constantly being created on the 10000 allocations.
      Variance should be mainly due to getting new pages every few
      allocations.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
        10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
        10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
        10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
        10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
        10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
        10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
        10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
        10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
        10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
        10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
        10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 121 cycles
        10000 times kmalloc(64)/kfree -> 121 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 121 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
        10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
        10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
        10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
        10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
        10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
        10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
        10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
        10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
        10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
        10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
        10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 123 cycles
        10000 times kmalloc(64)/kfree -> 142 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 119 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7ce4f60
    • Joonsoo Kim's avatar
      mm/slab: lockless decision to grow cache · 801faf0d
      Joonsoo Kim authored
      To check whether free objects exist or not precisely, we need to grab a
      lock.  But, accuracy isn't that important because race window would be
      even small and if there is too much free object, cache reaper would reap
      it.  So, this patch makes the check for free object exisistence not to
      hold a lock.  This will reduce lock contention in heavily allocation
      case.
      
      Note that until now, n->shared can be freed during the processing by
      writing slabinfo, but, with some trick in this patch, we can access it
      freely within interrupt disabled period.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
        * After
        Kmalloc N*alloc N*free(32): Average=344/792
        Kmalloc N*alloc N*free(64): Average=347/882
        Kmalloc N*alloc N*free(128): Average=390/959
        Kmalloc N*alloc N*free(256): Average=393/1067
        Kmalloc N*alloc N*free(512): Average=683/1229
        Kmalloc N*alloc N*free(1024): Average=1295/1325
        Kmalloc N*alloc N*free(2048): Average=2513/1664
        Kmalloc N*alloc N*free(4096): Average=4742/2172
      
      It shows that allocation performance decreases for the object size up to
      128 and it may be due to extra checks in cache_alloc_refill().  But,
      with considering improvement of free performance, net result looks the
      same.  Result for other size class looks very promising, roughly, 50%
      performance improvement.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      801faf0d
    • Joonsoo Kim's avatar
      mm/slab: refill cpu cache through a new slab without holding a node lock · 213b4695
      Joonsoo Kim authored
      Until now, cache growing makes a free slab on node's slab list and then
      we can allocate free objects from it.  This necessarily requires to hold
      a node lock which is very contended.  If we refill cpu cache before
      attaching it to node's slab list, we can avoid holding a node lock as
      much as possible because this newly allocated slab is only visible to
      the current task.  This will reduce lock contention.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
        * After
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
      It shows that contention is reduced for all the object sizes and
      performance increases by 30 ~ 40%.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      213b4695
    • Joonsoo Kim's avatar
      mm/slab: separate cache_grow() to two parts · 76b342bd
      Joonsoo Kim authored
      This is a preparation step to implement lockless allocation path when
      there is no free objects in kmem_cache.
      
      What we'd like to do here is to refill cpu cache without holding a node
      lock.  To accomplish this purpose, refill should be done after new slab
      allocation but before attaching the slab to the management list.  So,
      this patch separates cache_grow() to two parts, allocation and attaching
      to the list in order to add some code inbetween them in the following
      patch.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76b342bd
    • Joonsoo Kim's avatar
      mm/slab: make cache_grow() handle the page allocated on arbitrary node · 511e3a05
      Joonsoo Kim authored
      Currently, cache_grow() assumes that allocated page's nodeid would be
      same with parameter nodeid which is used for allocation request.  If we
      discard this assumption, we can handle fallback_alloc() case gracefully.
      So, this patch makes cache_grow() handle the page allocated on arbitrary
      node and clean-up relevant code.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      511e3a05
    • Joonsoo Kim's avatar
      mm/slab: racy access/modify the slab color · 03d1d43a
      Joonsoo Kim authored
      Slab color isn't needed to be changed strictly.  Because locking for
      changing slab color could cause more lock contention so this patch
      implements racy access/modify the slab color.  This is a preparation
      step to implement lockless allocation path when there is no free objects
      in the kmem_cache.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=365/806
        Kmalloc N*alloc N*free(64): Average=452/690
        Kmalloc N*alloc N*free(128): Average=736/886
        Kmalloc N*alloc N*free(256): Average=1167/985
        Kmalloc N*alloc N*free(512): Average=2088/1125
        Kmalloc N*alloc N*free(1024): Average=4115/1184
        Kmalloc N*alloc N*free(2048): Average=8451/1748
        Kmalloc N*alloc N*free(4096): Average=16024/2048
      
        * After
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
      It shows that contention is reduced for object size >= 1024 and
      performance increases by roughly 15%.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03d1d43a
    • Joonsoo Kim's avatar
      mm/slab: don't keep free slabs if free_objects exceeds free_limit · 6052b788
      Joonsoo Kim authored
      Currently, determination to free a slab is done whenever each freed
      object is put into the slab.  This has a following problem.
      
      Assume free_limit = 10 and nr_free = 9.
      
      Free happens as following sequence and nr_free changes as following.
      
      free(become a free slab) free(not become a free slab) nr_free: 9 -> 10
      (at first free) -> 11 (at second free)
      
      If we try to check if we can free current slab or not on each object
      free, we can't free any slab in this situation because current slab
      isn't a free slab when nr_free exceed free_limit (at second free) even
      if there is a free slab.
      
      However, if we check it lastly, we can free 1 free slab.
      
      This problem would cause to keep too much memory in the slab subsystem.
      This patch try to fix it by checking number of free object after all
      free work is done.  If there is free slab at that time, we can free slab
      as much as possible so we keep free slab as minimal.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6052b788
    • Joonsoo Kim's avatar
      mm/slab: clean-up kmem_cache_node setup · c3d332b6
      Joonsoo Kim authored
      There are mostly same code for setting up kmem_cache_node either in
      cpuup_prepare() or alloc_kmem_cache_node().  Factor out and clean-up
      them.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: default avatarNishanth Menon <nm@ti.com>
      Tested-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3d332b6
    • Joonsoo Kim's avatar
      mm/slab: factor out kmem_cache_node initialization code · ded0ecf6
      Joonsoo Kim authored
      It can be reused on other place, so factor out it.  Following patch will
      use it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ded0ecf6
    • Joonsoo Kim's avatar
      mm/slab: drain the free slab as much as possible · a5aa63a5
      Joonsoo Kim authored
      slabs_tofree() implies freeing all free slab.  We can do it with just
      providing INT_MAX.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5aa63a5
    • Joonsoo Kim's avatar
      mm/slab: remove BAD_ALIEN_MAGIC again · 8888177e
      Joonsoo Kim authored
      Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
      edcad250 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it
      causes a problem on m68k which has many node but !CONFIG_NUMA.  In this
      case, although alien cache isn't used at all but to cope with some
      initialization path, garbage value is used and that is BAD_ALIEN_MAGIC.
      Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no
      initialization path problem so we don't need BAD_ALIEN_MAGIC at all.  So
      remove it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8888177e