1. 13 Feb, 2015 40 commits
    • Andy Lutomirski's avatar
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski authored
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: Richard Weinberger's avatarRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
    • Andy Shevchenko's avatar
      fs/proc/array.c: convert to use string_escape_str() · edc924e0
      Andy Shevchenko authored
      Instead of custom approach let's use string_escape_str() to escape a given
      string (task_name in this case).
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edc924e0
    • Rafael Aquini's avatar
      fs: proc: task_mmu: show page size in /proc/<pid>/numa_maps · 198d1597
      Rafael Aquini authored
      The output of /proc/$pid/numa_maps is in terms of number of pages like
      anon=22 or dirty=54.  Here's some output:
      
        7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
        7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
        7fff8d425000 default stack anon=50 dirty=50 N0=50
      
      Looks like we have a stack and a couple of anonymous hugetlbfs
      areas page which both use the same amount of memory.  They don't.
      
      The 'bigfile' uses 1GB pages and takes up ~50GB of space.  The
      anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
      uses normal 4k pages.  You can go over to smaps to figure out what the
      page size _really_ is with KernelPageSize or MMUPageSize.  But, I think
      this is a pretty nasty and counterintuitive interface as it stands.
      
      This patch introduces 'kernelpagesize_kB' line element to
      /proc/<pid>/numa_maps report file in order to help identifying the size of
      pages that are backing memory areas mapped by a given task.  This is
      specially useful to help differentiating between HUGE and GIGANTIC page
      backed VMAs.
      
      This patch is based on Dave Hansen's proposal and reviewer's follow-ups
      taken from the following dicussion threads:
       * https://lkml.org/lkml/2011/9/21/454
       * https://lkml.org/lkml/2014/12/20/66Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      198d1597
    • Rafael Aquini's avatar
      Documentation/filesystems/proc.txt: add /proc/pid/numa_maps interface explanation snippet · 0c369711
      Rafael Aquini authored
      Add a small section to proc.txt doc in order to document its
      /proc/pid/numa_maps interface.  It does not introduce any functional
      changes, just documentation.
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c369711
    • Alexander Kuleshov's avatar
      fs: proc: use PDE() to get proc_dir_entry · 6bee55f9
      Alexander Kuleshov authored
      Use the PDE() helper to get proc_dir_entry instead of coding it directly.
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bee55f9
    • Petr Cermak's avatar
      fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS) · 695f0559
      Petr Cermak authored
      Peak resident size of a process can be reset back to the process's
      current rss value by writing "5" to /proc/pid/clear_refs.  The driving
      use-case for this would be getting the peak RSS value, which can be
      retrieved from the VmHWM field in /proc/pid/status, per benchmark
      iteration or test scenario.
      
      [akpm@linux-foundation.org: clarify behaviour in documentation]
      Signed-off-by: default avatarPetr Cermak <petrcermak@chromium.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Primiano Tucci <primiano@chromium.org>
      Cc: Petr Cermak <petrcermak@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      695f0559
    • Rickard Strandqvist's avatar
      arch/frv/mm/extable.c: remove unused function · d170cf46
      Rickard Strandqvist authored
      Remove the function search_one_table() that is not used anywhere.
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: default avatarRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d170cf46
    • Ganesh Mahendran's avatar
      mm/zsmalloc: add statistics support · 0f050d99
      Ganesh Mahendran authored
      Keeping fragmentation of zsmalloc in a low level is our target.  But now
      we still need to add the debug code in zsmalloc to get the quantitative
      data.
      
      This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
      statistics collection for developers.  Currently only the objects
      statatitics in each class are collected.  User can get the information via
      debugfs.
      
           cat /sys/kernel/debug/zsmalloc/zram0/...
      
      For example:
      
      After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
       class  size obj_allocated   obj_used pages_used
           0    32             0          0          0
           1    48           256         12          3
           2    64            64         14          1
           3    80            51          7          1
           4    96           128          5          3
           5   112            73          5          2
           6   128            32          4          1
           7   144             0          0          0
           8   160             0          0          0
           9   176             0          0          0
          10   192             0          0          0
          11   208             0          0          0
          12   224             0          0          0
          13   240             0          0          0
          14   256            16          1          1
          15   272            15          9          1
          16   288             0          0          0
          17   304             0          0          0
          18   320             0          0          0
          19   336             0          0          0
          20   352             0          0          0
          21   368             0          0          0
          22   384             0          0          0
          23   400             0          0          0
          24   416             0          0          0
          25   432             0          0          0
          26   448             0          0          0
          27   464             0          0          0
          28   480             0          0          0
          29   496            33          1          4
          30   512             0          0          0
          31   528             0          0          0
          32   544             0          0          0
          33   560             0          0          0
          34   576             0          0          0
          35   592             0          0          0
          36   608             0          0          0
          37   624             0          0          0
          38   640             0          0          0
          40   672             0          0          0
          42   704             0          0          0
          43   720            17          1          3
          44   736             0          0          0
          46   768             0          0          0
          49   816             0          0          0
          51   848             0          0          0
          52   864            14          1          3
          54   896             0          0          0
          57   944            13          1          3
          58   960             0          0          0
          62  1024             4          1          1
          66  1088            15          2          4
          67  1104             0          0          0
          71  1168             0          0          0
          74  1216             0          0          0
          76  1248             0          0          0
          83  1360             3          1          1
          91  1488            11          1          4
          94  1536             0          0          0
         100  1632             5          1          2
         107  1744             0          0          0
         111  1808             9          1          4
         126  2048             4          4          2
         144  2336             7          3          4
         151  2448             0          0          0
         168  2720            15         15         10
         190  3072            28         27         21
         202  3264             0          0          0
         254  4096         36209      36209      36209
      
       Total               37022      36326      36288
      
      We can calculate the overall fragentation by the last line:
          Total               37022      36326      36288
          (37022 - 36326) / 37022 = 1.87%
      
      Also by analysing objects alocated in every class we know why we got so
      low fragmentation: Most of the allocated objects is in <class 254>.  And
      there is only 1 page in class 254 zspage.  So, No fragmentation will be
      introduced by allocating objs in class 254.
      
      And in future, we can collect other zsmalloc statistics as we need and
      analyse them.
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f050d99
    • Ganesh Mahendran's avatar
      mm/zpool: add name argument to create zpool · 3eba0c6a
      Ganesh Mahendran authored
      Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
      them.  There is not a method to let zsmalloc/zbud find which caller they
      belong to.
      
      Now we want to add statistics collection in zsmalloc.  We need to name the
      debugfs dir for each pool created.  The way suggested by Minchan Kim is to
      use a name passed by caller(such as zram) to create the zsmalloc pool.
      
          /sys/kernel/debug/zsmalloc/zram0
      
      This patch adds an argument `name' to zs_create_pool() and other related
      functions.
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eba0c6a
    • Sergey Senozhatsky's avatar
      zram: remove request_queue from struct zram · ee980160
      Sergey Senozhatsky authored
      `struct zram' contains both `struct gendisk' and `struct request_queue'.
      the latter can be deleted, because zram->disk carries ->queue pointer, and
      ->queue carries zram pointer:
      
      create_device()
      	zram->queue->queuedata = zram
      	zram->disk->queue = zram->queue
      	zram->disk->private_data = zram
      
      so zram->queue is not needed, we can access all necessary data anyway.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee980160
    • Minchan Kim's avatar
      zram: remove init_lock in zram_make_request · 08eee69f
      Minchan Kim authored
      Admin could reset zram during I/O operation going on so we have used
      zram->init_lock as read-side lock in I/O path to prevent sudden zram
      meta freeing.
      
      However, the init_lock is really troublesome.  We can't do call
      zram_meta_alloc under init_lock due to lockdep splat because
      zram_rw_page is one of the function under reclaim path and hold it as
      read_lock while other places in process context hold it as write_lock.
      So, we have used allocation out of the lock to avoid lockdep warn but
      it's not good for readability and fainally, I met another lockdep splat
      between init_lock and cpu_hotplug from kmem_cache_destroy during working
      zsmalloc compaction.  :(
      
      Yes, the ideal is to remove horrible init_lock of zram in rw path.  This
      patch removes it in rw path and instead, add atomic refcount for meta
      lifetime management and completion to free meta in process context.
      It's important to free meta in process context because some of resource
      destruction needs mutex lock, which could be held if we releases the
      resource in reclaim context so it's deadlock, again.
      
      As a bonus, we could remove init_done check in rw path because
      zram_meta_get will do a role for it, instead.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08eee69f
    • Minchan Kim's avatar
      zram: check bd_openers instead of bd_holders · 2b269ce6
      Minchan Kim authored
      bd_holders is increased only when user open the device file as FMODE_EXCL
      so if something opens zram0 as !FMODE_EXCL and request I/O while another
      user reset zram0, we can see following warning.
      
        zram0: detected capacity change from 0 to 64424509440
        Buffer I/O error on dev zram0, logical block 180823, lost async page write
        Buffer I/O error on dev zram0, logical block 180824, lost async page write
        Buffer I/O error on dev zram0, logical block 180825, lost async page write
        Buffer I/O error on dev zram0, logical block 180826, lost async page write
        Buffer I/O error on dev zram0, logical block 180827, lost async page write
        Buffer I/O error on dev zram0, logical block 180828, lost async page write
        Buffer I/O error on dev zram0, logical block 180829, lost async page write
        Buffer I/O error on dev zram0, logical block 180830, lost async page write
        Buffer I/O error on dev zram0, logical block 180831, lost async page write
        Buffer I/O error on dev zram0, logical block 180832, lost async page write
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
        Modules linked in:
        CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x45/0x57
          warn_slowpath_common+0x8a/0xc0
          warn_slowpath_null+0x1a/0x20
          __blkdev_put+0x1d7/0x210
          blkdev_put+0x50/0x130
          blkdev_close+0x25/0x30
          __fput+0xdf/0x1e0
          ____fput+0xe/0x10
          task_work_run+0xa7/0xe0
          do_notify_resume+0x49/0x60
          int_signal+0x12/0x17
        ---[ end trace 274fbbc5664827d2 ]---
      
      The warning comes from bdev_write_node in blkdev_put path.
      
         static void bdev_write_inode(struct inode *inode)
         {
              spin_lock(&inode->i_lock);
              while (inode->i_state & I_DIRTY) {
                      spin_unlock(&inode->i_lock);
                      WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                      spin_lock(&inode->i_lock);
              }
              spin_unlock(&inode->i_lock);
         }
      
      The reason is dd process encounters I/O fails due to sudden block device
      disappear so in filemap_check_errors in __writeback_single_inode returns
      -EIO.
      
      If we check bd_openers instead of bd_holders, we could address the
      problem.  When I see the brd, it already have used it rather than
      bd_holders so although I'm not a expert of block layer, it seems to be
      better.
      
      I can make following warning with below simple script.  In addition, I
      added msleep(2000) below set_capacity(zram->disk, 0) after applying your
      patch to make window huge(Kudos to Ganesh!)
      
      script:
      
         echo $((60<<30)) > /sys/block/zram0/disksize
         setsid dd if=/dev/zero of=/dev/zram0 &
         sleep 1
         setsid echo 1 > /sys/block/zram0/reset
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b269ce6
    • Sergey Senozhatsky's avatar
      zram: rework reset and destroy path · a096cafc
      Sergey Senozhatsky authored
      We need to return set_capacity(disk, 0) from reset_store() back to
      zram_reset_device(), a catch by Ganesh Mahendran.  Potentially, we can
      race set_capacity() calls from init and reset paths.
      
      The problem is that zram_reset_device() is also getting called from
      zram_exit(), which performs operations in misleading reversed order -- we
      first create_device() and then init it, while zram_exit() perform
      destroy_device() first and then does zram_reset_device().  This is done to
      remove sysfs group before we reset device, so we can continue with device
      reset/destruction not being raced by sysfs attr write (f.e.  disksize).
      
      Apart from that, destroy_device() releases zram->disk (but we still have
      ->disk pointer), so we cannot acces zram->disk in later
      zram_reset_device() call, which may cause additional errors in the future.
      
      So, this patch rework and cleanup destroy path.
      
      1) remove several unneeded goto labels in zram_init()
      
      2) factor out zram_init() error path and zram_exit() into
         destroy_devices() function, which takes the number of devices to
         destroy as its argument.
      
      3) remove sysfs group in destroy_devices() first, so we can reorder
         operations -- reset device (as expected) goes before disk destroy and
         queue cleanup.  So we can always access ->disk in zram_reset_device().
      
      4) and, finally, return set_capacity() back under ->init_lock.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a096cafc
    • Sergey Senozhatsky's avatar
      zram: fix umount-reset_store-mount race condition · ba6b17d6
      Sergey Senozhatsky authored
      Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
      avoid ->bd_holders race condition:
      
              CPU0                            CPU1
      umount /* zram->init_done is true */
      reset_store()
      bdev->bd_holders == 0                   mount
      ...                                     zram_make_request()
      zram_reset_device()
      
      However, his solution required some considerable amount of code movement,
      which we can avoid.
      
      Apart from using bdev->bd_mutex in reset_store(), this patch also
      simplifies zram_reset_device().
      
      zram_reset_device() has a bool parameter reset_capacity which tells it
      whether disk capacity and itself disk should be reset.  There are two
      zram_reset_device() callers:
      
      -- zram_exit() passes reset_capacity=false
      -- reset_store() passes reset_capacity=true
      
      So we can move reset_capacity-sensitive work out of zram_reset_device()
      and perform it unconditionally in reset_store().  This also lets us drop
      reset_capacity parameter from zram_reset_device() and pass zram pointer
      only.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba6b17d6
    • Ganesh Mahendran's avatar
      zram: free meta table in zram_meta_free · 1fec1172
      Ganesh Mahendran authored
      zram_meta_alloc() and zram_meta_free() are a pair.  In
      zram_meta_alloc(), meta table is allocated.  So it it better to free it
      in zram_meta_free().
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fec1172
    • Sergey Senozhatsky's avatar
      zram: clean up zram_meta_alloc() · b8179958
      Sergey Senozhatsky authored
      A trivial cleanup of zram_meta_alloc() error handling.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8179958
    • Hugh Dickins's avatar
      mm: fix negative nr_isolated counts · ff59909a
      Hugh Dickins authored
      The vmstat interfaces are good at hiding negative counts (at least when
      CONFIG_SMP); but if you peer behind the curtain, you find that
      nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
      more negative: so they can absorb larger and larger numbers of isolated
      pages, yet still appear to be zero.
      
      I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
      but I guess it's there for a good reason, in which case we ought to get
      too_many_isolated() working again.
      
      The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
      putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
      to call acct_isolated() to increment them.
      
      It is possible that the bug whcih this patch fixes could cause OOM kills
      when the system still has a lot of reclaimable page cache.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff59909a
    • Naoya Horiguchi's avatar
      mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() · 9ab3b598
      Naoya Horiguchi authored
      A race condition starts to be visible in recent mmotm, where a PG_hwpoison
      flag is set on a migration source page *before* it's back in buddy page
      poo= l.
      
      This is problematic because no page flag is supposed to be set when
      freeing (see __free_one_page().) So the user-visible effect of this race
      is that it could trigger the BUG_ON() when soft-offlining is called.
      
      The root cause is that we call lru_add_drain_all() to make sure that the
      page is in buddy, but that doesn't work because this function just
      schedule= s a work item and doesn't wait its completion.
      drain_all_pages() does drainin= g directly, so simply dropping
      lru_add_drain_all() solves this problem.
      
      Fixes: f15bdfa8 ("mm/memory-failure.c: fix memory leak in successful soft offlining")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[3.11+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ab3b598
    • Yaowei Bai's avatar
      mm/page_alloc: fix comment · 84109e15
      Yaowei Bai authored
      Add a necessary 'leave'.
      Signed-off-by: default avatarYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84109e15
    • Grazvydas Ignotas's avatar
      mm/memory.c: actually remap enough memory · 9cb12d7b
      Grazvydas Ignotas authored
      For whatever reason, generic_access_phys() only remaps one page, but
      actually allows to access arbitrary size.  It's quite easy to trigger
      large reads, like printing out large structure with gdb, which leads to a
      crash.  Fix it by remapping correct size.
      
      Fixes: 28b2ee20 ("access_process_vm device memory infrastructure")
      Signed-off-by: default avatarGrazvydas Ignotas <notasas@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cb12d7b
    • Rasmus Villemoes's avatar
      kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __init · 8f4ab07f
      Rasmus Villemoes authored
      The only caller of cpuset_init_current_mems_allowed is the __init
      annotated build_all_zonelists_init, so we can also make the former __init.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f4ab07f
    • Rasmus Villemoes's avatar
      mm/mm_init.c: mark mminit_loglevel __meminitdata · 194e8151
      Rasmus Villemoes authored
      mminit_loglevel is only referenced from __init and __meminit functions, so
      we can mark it __meminitdata.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      194e8151
    • Rasmus Villemoes's avatar
      mm/mm_init.c: park mminit_verify_zonelist as __init · 0e2342c7
      Rasmus Villemoes authored
      The only caller of mminit_verify_zonelist is build_all_zonelists_init,
      which is annotated with __init, so it should be safe to also mark the
      former as __init, saving ~400 bytes of .text.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e2342c7
    • Rasmus Villemoes's avatar
      mm/page_alloc.c: pull out init code from build_all_zonelists · 061f67bc
      Rasmus Villemoes authored
      Pulling the code protected by if (system_state == SYSTEM_BOOTING) into
      its own helper allows us to shrink .text a little. This relies on
      build_all_zonelists already having a __ref annotation. Add a comment
      explaining why so one doesn't have to track it down through git log.
      
      The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist
      as __init"), where we save about 400 bytes
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      061f67bc
    • Rasmus Villemoes's avatar
      mm/internal.h: don't split printk call in two · fc5199d1
      Rasmus Villemoes authored
      All users of mminit_dprintk pass a compile-time constant as level, so this
      just makes gcc emit a single printk call instead of two.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc5199d1
    • Kirill A. Shutemov's avatar
      mm: do not use mm->nr_pmds on !MMU configurations · 2d2f5119
      Kirill A. Shutemov authored
      mm->nr_pmds doesn't make sense on !MMU configurations
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d2f5119
    • Vladimir Davydov's avatar
      memcg: cleanup static keys decrement · f48b80a5
      Vladimir Davydov authored
      Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
      from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
      wrapper functions.
      
      Although this patch moves static keys decrement from __mem_cgroup_free to
      mem_cgroup_css_free, it does not introduce any functional changes, because
      the keys are incremented on setting the limit (tcp or kmem), which can
      only happen after successful mem_cgroup_css_online.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f48b80a5
    • Joonsoo Kim's avatar
      mm/compaction: stop the isolation when we isolate enough freepage · 932ff6bb
      Joonsoo Kim authored
      Currently, freepage isolation in one pageblock doesn't consider how many
      freepages we isolate. When I traced flow of compaction, compaction
      sometimes isolates more than 256 freepages to migrate just 32 pages.
      
      In this patch, freepage isolation is stopped at the point that we
      have more isolated freepage than isolated page for migration. This
      results in slowing down free page scanner and make compaction success
      rate higher.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      27.13 : 31.82
      
      pfn where both scanners meets on compaction complete
      (separate test due to enormous tracepoint buffer)
      (zone_start=4096, zone_end=1048576)
      586034 : 654378
      
      In fact, I didn't fully understand why this patch results in such good
      result. There was a guess that not used freepages are released to pcp list
      and on next compaction trial we won't isolate them again so compaction
      success rate would decrease. To prevent this effect, I tested with adding
      pcp drain code on release_freepages(), but, it has no good effect.
      
      Anyway, this patch reduces waste time to isolate unneeded freepages so
      seems reasonable.
      
      Vlastimil said:
      
      : I briefly tried it on top of the pivot-changing series and with order-9
      : allocations it reduced free page scanned counter by almost 10%.  No effect
      : on success rates (maybe because pivot changing already took care of the
      : scanners meeting problem) but the scanning reduction is good on its own.
      :
      : It also explains why e14c720e ("mm, compaction: remember position
      : within pageblock in free pages scanner") had less than expected
      : improvements.  It would only actually stop within pageblock in case of
      : async compaction detecting contention.  I guess that's also why the
      : infinite loop problem fixed by 1d5bfe1f affected so relatively few
      : people.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932ff6bb
    • Joonsoo Kim's avatar
      mm/compaction: fix wrong order check in compact_finished() · 372549c2
      Joonsoo Kim authored
      What we want to check here is whether there is highorder freepage in buddy
      list of other migratetype in order to steal it without fragmentation.
      But, current code just checks cc->order which means allocation request
      order.  So, this is wrong.
      
      Without this fix, non-movable synchronous compaction below pageblock order
      would not stopped until compaction is complete, because migratetype of
      most pageblocks are movable and high order freepage made by compaction is
      usually on movable type buddy list.
      
      There is some report related to this bug. See below link.
      
        http://www.spinics.net/lists/linux-mm/msg81666.html
      
      Although the issued system still has load spike comes from compaction,
      this makes that system completely stable and responsive according to his
      report.
      
      stress-highalloc test in mmtests with non movable order 7 allocation
      doesn't show any notable difference in allocation success rate, but, it
      shows more compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      18.47 : 28.94
      
      Fixes: 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available")
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      372549c2
    • Vladimir Davydov's avatar
      slub: make dead caches discard free slabs immediately · d6e0b7fa
      Vladimir Davydov authored
      To speed up further allocations SLUB may store empty slabs in per cpu/node
      partial lists instead of freeing them immediately.  This prevents per
      memcg caches destruction, because kmem caches created for a memory cgroup
      are only destroyed after the last page charged to the cgroup is freed.
      
      To fix this issue, this patch resurrects approach first proposed in [1].
      It forbids SLUB to cache empty slabs after the memory cgroup that the
      cache belongs to was destroyed.  It is achieved by setting kmem_cache's
      cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
      that it would drop frozen empty slabs immediately if cpu_partial = 0.
      
      The runtime overhead is minimal.  From all the hot functions, we only
      touch relatively cold put_cpu_partial(): we make it call
      unfreeze_partials() after freezing a slab that belongs to an offline
      memory cgroup.  Since slab freezing exists to avoid moving slabs from/to a
      partial list on free/alloc, and there can't be allocations from dead
      caches, it shouldn't cause any overhead.  We do have to disable preemption
      for put_cpu_partial() to achieve that though.
      
      The original patch was accepted well and even merged to the mm tree.
      However, I decided to withdraw it due to changes happening to the memcg
      core at that time.  I had an idea of introducing per-memcg shrinkers for
      kmem caches, but now, as memcg has finally settled down, I do not see it
      as an option, because SLUB shrinker would be too costly to call since SLUB
      does not keep free slabs on a separate list.  Besides, we currently do not
      even call per-memcg shrinkers for offline memcgs.  Overall, it would
      introduce much more complexity to both SLUB and memcg than this small
      patch.
      
      Regarding to SLAB, there's no problem with it, because it shrinks
      per-cpu/node caches periodically.  Thanks to list_lru reparenting, we no
      longer keep entries for offline cgroups in per-memcg arrays (such as
      memcg_cache_params->memcg_caches), so we do not have to bother if a
      per-memcg cache will be shrunk a bit later than it could be.
      
      [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6e0b7fa
    • Vladimir Davydov's avatar
      slub: fix kmem_cache_shrink return value · ce3712d7
      Vladimir Davydov authored
      It is supposed to return 0 if the cache has no remaining objects and 1
      otherwise, while currently it always returns 0.  Fix it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce3712d7
    • Vladimir Davydov's avatar
      slub: never fail to shrink cache · 832f37f5
      Vladimir Davydov authored
      SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but
      also tries to rearrange the partial lists to place slabs filled up most to
      the head to cope with fragmentation.  To achieve that, it allocates a
      temporary array of lists used to sort slabs by the number of objects in
      use.  If the allocation fails, the whole procedure is aborted.
      
      This is unacceptable for the kernel memory accounting extension of the
      memory cgroup, where we want to make sure that kmem_cache_shrink()
      successfully discarded empty slabs.  Although the allocation failure is
      utterly unlikely with the current page allocator implementation, which
      retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not
      to rely on that.
      
      This patch therefore makes __kmem_cache_shrink() allocate the array on
      stack instead of calling kmalloc, which may fail.  The array size is
      chosen to be equal to 32, because most SLUB caches store not more than 32
      objects per slab page.  Slab pages with <= 32 free objects are sorted
      using the array by the number of objects in use and promoted to the head
      of the partial list, while slab pages with > 32 free objects are left in
      the end of the list without any ordering imposed on them.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      832f37f5
    • Vladimir Davydov's avatar
      memcg: reparent list_lrus and free kmemcg_id on css offline · 2788cf0c
      Vladimir Davydov authored
      Now, the only reason to keep kmemcg_id till css free is list_lru, which
      uses it to distribute elements between per-memcg lists.  However, it can
      be easily sorted out - we only need to change kmemcg_id of an offline
      cgroup to its parent's id, making further list_lru_add()'s add elements to
      the parent's list, and then move all elements from the offline cgroup's
      list to the one of its parent.  It will work, because a racing
      list_lru_del() does not need to know the list it is deleting the element
      from.  It can decrement the wrong nr_items counter though, but the ongoing
      reparenting will fix it.  After list_lru reparenting is done we are free
      to release kmemcg_id saving a valuable slot in a per-memcg array for new
      cgroups.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2788cf0c
    • Vladimir Davydov's avatar
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov authored
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • Vladimir Davydov's avatar
      memcg: free memcg_caches slot on css offline · 2a4db7eb
      Vladimir Davydov authored
      We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
      on allocations, so there is no need to have the array entries set until
      css free - we can clear them on css offline.  This will allow us to reuse
      array entries more efficiently and avoid costly array relocations.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a4db7eb
    • Vladimir Davydov's avatar
      slab: use css id for naming per memcg caches · f1008365
      Vladimir Davydov authored
      Currently, we use mem_cgroup->kmemcg_id to guarantee kmem_cache->name
      uniqueness.  This is correct, because kmemcg_id is only released on css
      free after destroying all per memcg caches.
      
      However, I am going to change that and release kmemcg_id on css offline,
      because it is not wise to keep it for so long, wasting valuable entries of
      memcg_cache_params->memcg_caches arrays.  Therefore, to preserve cache
      name uniqueness, let us switch to css->id.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1008365
    • Vladimir Davydov's avatar
      cgroup: release css->id after css_free · 01e58659
      Vladimir Davydov authored
      Currently, we release css->id in css_release_work_fn, right before calling
      css_free callback, so that when css_free is called, the id may have
      already been reused for a new cgroup.
      
      I am going to use css->id to create unique names for per memcg kmem
      caches.  Since kmem caches are destroyed only on css_free, I need css->id
      to be freed after css_free was called to avoid name clashes.  This patch
      therefore moves css->id removal to css_free_work_fn.  To prevent
      css_from_id from returning a pointer to a stale css, it makes
      css_release_work_fn replace the css ptr at css_idr:css->id with NULL.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01e58659
    • Vladimir Davydov's avatar
      slab: link memcg caches of the same kind into a list · 426589f5
      Vladimir Davydov authored
      Sometimes, we need to iterate over all memcg copies of a particular root
      kmem cache.  Currently, we use memcg_cache_params->memcg_caches array for
      that, because it contains all existing memcg caches.
      
      However, it's a bad practice to keep all caches, including those that
      belong to offline cgroups, in this array, because it will be growing
      beyond any bounds then.  I'm going to wipe away dead caches from it to
      save space.  To still be able to perform iterations over all memcg caches
      of the same kind, let us link them into a list.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      426589f5
    • Vladimir Davydov's avatar
      slab: embed memcg_cache_params to kmem_cache · f7ce3190
      Vladimir Davydov authored
      Currently, kmem_cache stores a pointer to struct memcg_cache_params
      instead of embedding it.  The rationale is to save memory when kmem
      accounting is disabled.  However, the memcg_cache_params has shrivelled
      drastically since it was first introduced:
      
      * Initially:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct kmem_cache *memcg_caches[0];
      		struct {
      			struct mem_cgroup *memcg;
      			struct list_head list;
      			struct kmem_cache *root_cache;
      			bool dead;
      			atomic_t nr_pages;
      			struct work_struct destroy;
      		};
      	};
      };
      
      * Now:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct {
      			struct rcu_head rcu_head;
      			struct kmem_cache *memcg_caches[0];
      		};
      		struct {
      			struct mem_cgroup *memcg;
      			struct kmem_cache *root_cache;
      		};
      	};
      };
      
      So the memory saving does not seem to be a clear win anymore.
      
      OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
      it results in touching one more cache line on kmem alloc/free hot paths.
      Besides, it makes linking kmem caches in a list chained by a field of
      struct memcg_cache_params really painful due to a level of indirection,
      while I want to make them linked in the following patch.  That said, let
      us embed it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ce3190
    • Vladimir Davydov's avatar
      fs: shrinker: always scan at least one object of each type · 49e7e7ff
      Vladimir Davydov authored
      In super_cache_scan() we divide the number of objects of particular type
      by the total number of objects in order to distribute pressure among As a
      result, in some corner cases we can get nr_to_scan=0 even if there are
      some objects to reclaim, e.g.  dentries=1, inodes=1, fs_objects=1,
      nr_to_scan=1/3=0.
      
      This is unacceptable for per memcg kmem accounting, because this means
      that some objects may never get reclaimed after memcg death, preventing it
      from being freed.
      
      This patch therefore assures that super_cache_scan() will scan at least
      one object of each type if any.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49e7e7ff