1. 30 May, 2018 2 commits
  2. 24 Apr, 2018 1 commit
    • Eric Biggers's avatar
      ipc/shm: fix use-after-free of shm file via remap_file_pages() · 570ef10d
      Eric Biggers authored
      commit 3f05317d9889ab75c7190dcd39491d2a97921984 upstream.
      syzbot reported a use-after-free of shm_file_data(file)->file->f_op in
      shm_get_unmapped_area(), called via sys_remap_file_pages().
      Unfortunately it couldn't generate a reproducer, but I found a bug which
      I think caused it.  When remap_file_pages() is passed a full System V
      shared memory segment, the memory is first unmapped, then a new map is
      created using the ->vm_file.  Between these steps, the shm ID can be
      removed and reused for a new shm segment.  But, shm_mmap() only checks
      whether the ID is currently valid before calling the underlying file's
      ->mmap(); it doesn't check whether it was reused.  Thus it can use the
      wrong underlying file, one that was already freed.
      Fix this by making the "outer" shm file (the one that gets put in
      ->vm_file) hold a reference to the real shm file, and by making
      __shm_open() require that the file associated with the shm ID matches
      the one associated with the "outer" file.
      Taking the reference to the real shm file is needed to fully solve the
      problem, since otherwise sfd->file could point to a freed file, which
      then could be reallocated for the reused shm ID, causing the wrong shm
      segment to be mapped (and without the required permission checks).
      Commit 1ac0b6de ("ipc/shm: handle removed segments gracefully in
      shm_mmap()") almost fixed this bug, but it didn't go far enough because
      it didn't consider the case where the shm ID is reused.
      The following program usually reproduces this bug:
      	#include <stdlib.h>
      	#include <sys/shm.h>
      	#include <sys/syscall.h>
      	#include <unistd.h>
      	int main()
      		int is_parent = (fork() != 0);
      		for (;;) {
      			int id = shmget(0xF00F, 4096, IPC_CREAT|0700);
      			if (is_parent) {
      				void *addr = shmat(id, NULL, 0);
      				usleep(rand() % 50);
      				while (!syscall(__NR_remap_file_pages, addr, 4096, 0, 0, 0));
      			} else {
      				usleep(rand() % 50);
      				shmctl(id, IPC_RMID, NULL);
      It causes the following NULL pointer dereference due to a 'struct file'
      being used while it's being freed.  (I couldn't actually get a KASAN
      use-after-free splat like in the syzbot report.  But I think it's
      possible with this bug; it would just take a more extraordinary race...)
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
      	PGD 0 P4D 0
      	Oops: 0000 [#1] SMP NOPTI
      	CPU: 9 PID: 258 Comm: syz_ipc Not tainted 4.16.0-05140-gf8cf2f16a7c95 #189
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
      	RIP: 0010:d_inode include/linux/dcache.h:519 [inline]
      	RIP: 0010:touch_atime+0x25/0xd0 fs/inode.c:1724
      	Call Trace:
      	 file_accessed include/linux/fs.h:2063 [inline]
      	 shmem_mmap+0x25/0x40 mm/shmem.c:2149
      	 call_mmap include/linux/fs.h:1789 [inline]
      	 shm_mmap+0x34/0x80 ipc/shm.c:465
      	 call_mmap include/linux/fs.h:1789 [inline]
      	 mmap_region+0x309/0x5b0 mm/mmap.c:1712
      	 do_mmap+0x294/0x4a0 mm/mmap.c:1483
      	 do_mmap_pgoff include/linux/mm.h:2235 [inline]
      	 SYSC_remap_file_pages mm/mmap.c:2853 [inline]
      	 SyS_remap_file_pages+0x232/0x310 mm/mmap.c:2769
      	 do_syscall_64+0x64/0x1a0 arch/x86/entry/common.c:287
      [ebiggers@google.com: add comment]
        Link: http://lkml.kernel.org/r/20180410192850.235835-1-ebiggers3@gmail.com
      Link: http://lkml.kernel.org/r/20180409043039.28915-1-ebiggers3@gmail.com
      Reported-by: syzbot+d11f321e7f1923157eac80aa990b446596f46439@syzkaller.appspotmail.com
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 08 Apr, 2018 1 commit
    • Mike Kravetz's avatar
      ipc/shm.c: add split function to shm_vm_ops · 86c8c892
      Mike Kravetz authored
      commit 3d942ee079b917b24e2a0c5f18d35ac8ec9fee48 upstream.
      If System V shmget/shmat operations are used to create a hugetlbfs
      backed mapping, it is possible to munmap part of the mapping and split
      the underlying vma such that it is not huge page aligned.  This will
      untimately result in the following BUG:
        kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
        CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G         C  E 4.15.0-10-generic #11-Ubuntu
        NIP:  c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
        REGS: c000003fbcdcf810 TRAP: 0700   Tainted: G         C  E (4.15.0-10-generic)
        MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 20040000
        CFAR: c00000000036ee44 SOFTE: 1
        NIP __unmap_hugepage_range+0xa4/0x760
        LR __unmap_hugepage_range_final+0x28/0x50
        Call Trace:
          0x7115e4e00000 (unreliable)
        ---[ end trace ee88f958a1c62605 ]---
      This bug was introduced by commit 31383c68 ("mm, hugetlbfs:
      introduce ->split() to vm_operations_struct").  A split function was
      added to vm_operations_struct to determine if a mapping can be split.
      This was mostly for device-dax and hugetlbfs mappings which have
      specific alignment constraints.
      Mappings initiated via shmget/shmat have their original vm_ops
      overwritten with shm_vm_ops.  shm_vm_ops functions will call back to the
      original vm_ops if needed.  Add such a split function to shm_vm_ops.
      Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
      Fixes: 31383c68 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Tested-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 31 Jan, 2018 1 commit
    • Jiri Slaby's avatar
      ipc: msg, make msgrcv work with LONG_MIN · 542cde0e
      Jiri Slaby authored
      commit 99989835 upstream.
      When LONG_MIN is passed to msgrcv, one would expect to recieve any
      message.  But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is
      undefined.  In particular, with my gcc -LONG_MIN produces -LONG_MIN
      So handle this case properly by assigning LONG_MAX to *msgtyp if
      LONG_MIN was specified as msgtyp to msgrcv.
      This code:
        long msg[] = { 100, 200 };
        int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644);
        msgsnd(m, &msg, sizeof(msg), 0);
        msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0);
      produces currently nothing:
        msgget(IPC_PRIVATE, IPC_CREAT|0644)     = 65538
        msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
        msgrcv(65538, ...
      Except a UBSAN warning:
        UBSAN: Undefined behaviour in ipc/msg.c:745:13
        negation of -9223372036854775808 cannot be represented in type 'long int':
      With the patch, I see what I expect:
        msgget(IPC_PRIVATE, IPC_CREAT|0644)     = 0
        msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
        msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16
      Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.czSigned-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  5. 15 Jul, 2017 1 commit
  6. 12 Mar, 2017 1 commit
  7. 28 Oct, 2016 1 commit
  8. 11 Oct, 2016 6 commits
    • Nikolay Borisov's avatar
      ipc/sem.c: add cond_resched in exit_sme · 2a1613a5
      Nikolay Borisov authored
      In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
      exit_sem.  Apparently it's possible for the loop to take quite a long time
      and it doesn't have a scheduling point in it.  Since the codes is
      executing under an rcu read section this may also cause rcu stalls, which
      in turn block synchronize_rcu operations, which more or less de-stabilises
      the whole system.
      Fix this by introducing a cond_resched() at the beginning of the loop.
      So this patch fixes the following:
        NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
        CPU: 10 PID: 18119 Comm: httpd Tainted: G           O    4.4.20-clouder2 #6
        Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
        task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
        RIP: 0010:[<ffffffff81614bc7>]  [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30
        RSP: 0018:ffff881c95553e40  EFLAGS: 00000246
        RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
        RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
        RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
        R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
        R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
        FS:  0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
        Call Trace:
          ? exit_sem+0x7c/0x280
      Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.comSigned-off-by: default avatarNikolay Borisov <kernel@kyup.com>
      Cc: Herton R. Krzesinski <herton@redhat.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Davidlohr Bueso's avatar
      ipc/msg: avoid waking sender upon full queue · ed27f912
      Davidlohr Bueso authored
      Blocked tasks queued in q_senders waiting for their message to fit in the
      queue are blindly awoken every time we think there's a remote chance this
      might happen.  This could cause numerous (and expensive -- thundering
      herd-ish) bogus wakeups if the queue is still really full.  Adding to the
      scheduling cost/overhead, there's also the fact that we need to take the
      ipc object lock and requeue ourselves in the q_senders list.
      By keeping track of the blocked sender's message size, we can know
      previously if the wakeup ought to occur or not.  Otherwise, to maintain
      the current wakeup order we just move it to the tail.  This is exactly
      what occurs right now if the sender needs to go back to sleep.
      The case of EIDRM is left completely untouched, as we need to wakeup all
      the tasks, and shouldn't be playing games in the first place.
      This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context
      switches (avg out of ten runs).  Although these tests are really about
      functionality (as opposed to performance), is does show the direct
      benefits of the optimization.
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.netSigned-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Davidlohr Bueso's avatar
      ipc/msg: make ss_wakeup() kill arg boolean · d0d6a2a9
      Davidlohr Bueso authored
      ... 'tis annoying.
      Link: http://lkml.kernel.org/r/1469748819-19484-4-git-send-email-dave@stgolabs.netSigned-off-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Davidlohr Bueso's avatar
      ipc/msg: batch queue sender wakeups · e3658538
      Davidlohr Bueso authored
      Currently the use of wake_qs in sysv msg queues are only for the receiver
      tasks that are blocked on the queue.  But blocked sender tasks (due to
      queue size constraints) still are awoken with the ipc object lock held,
      which can be a problem particularly for small sized queues and far from
      gracious for -rt (just like it was for the receiver side).
      The paths that actually wakeup a sender are obviously related to when we
      are either getting rid of the queue or after (some) space is freed-up
      after a receiver takes the msg (msgrcv).  Furthermore, with the exception
      of msgrcv, we can always piggy-back on expunge_all that has its own tasks
      lined-up for waking.  Finally, upon unlinking the message, it should be no
      problem delaying the wakeups a bit until after we've released the lock.
      Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.netSigned-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Sebastian Andrzej Siewior's avatar
      ipc/msg: implement lockless pipelined wakeups · ee51636c
      Sebastian Andrzej Siewior authored
      This patch moves the wakeup_process() invocation so it is not done under
      the ipc global lock by making use of a lockless wake_q.  With this change,
      the waiter is woken up once the message has been assigned and it does not
      need to loop on SMP if the message points to NULL.  In the signal case we
      still need to check the pointer under the lock to verify the state.
      This change should also avoid the introduction of preempt_disable() in -RT
      which avoids a busy-loop which pools for the NULL -> !NULL change if the
      waiter has a higher priority compared to the waker.
      By making use of wake_qs, the logic of sysv msg queues is greatly
      simplified (and very well suited as we can batch lockless wakeups),
      particularly around the lockless receive algorithm.
      This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800
      Radeon R7, 12 Compute Cores 4C+8G":
      test             |   before   |   after    | diff
      pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 %
      pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 %
      pmsg-shared 2 60 | 22,884,224 | 24,278,200 | +  ~6.09 %
      Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.netSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Manfred Spraul's avatar
      ipc/sem.c: fix complex_count vs. simple op race · 5864a2fd
      Manfred Spraul authored
      Commit 6d07b68c ("ipc/sem.c: optimize sem_lock()") introduced a
      sem_lock has a fast path that allows parallel simple operations.
      There are two reasons why a simple operation cannot run in parallel:
       - a non-simple operations is ongoing (sma->sem_perm.lock held)
       - a complex operation is sleeping (sma->complex_count != 0)
      As both facts are stored independently, a thread can bypass the current
      checks by sleeping in the right positions.  See below for more details
      (or kernel bugzilla 105651).
      The patch fixes that by creating one variable (complex_mode)
      that tracks both reasons why parallel operations are not possible.
      The patch also updates stale documentation regarding the locking.
      With regards to stable kernels:
      The patch is required for all kernels that include the
      commit 6d07b68c ("ipc/sem.c: optimize sem_lock()") (3.10?)
      The alternative is to revert the patch that introduced the race.
      The patch is safe for backporting, i.e. it makes no assumptions
      about memory barriers in spin_unlock_wait().
      Here is the race of the current implementation:
      Thread A: (simple op)
      - does the first "sma->complex_count == 0" test
      Thread B: (complex op)
      - does sem_lock(): This includes an array scan. But the scan can't
        find Thread A, because Thread A does not own sem->lock yet.
      - the thread does the operation, increases complex_count,
        drops sem_lock, sleeps
      Thread A:
      - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
      - sleeps before the complex_count test
      Thread C: (complex op)
      - does sem_lock (no array scan, complex_count==1)
      - wakes up Thread B.
      - decrements complex_count
      Thread A:
      - does the complex_count test
      Now both thread A and thread C operate on the same array, without
      any synchronization.
      Fixes: 6d07b68c ("ipc/sem.c: optimize sem_lock()")
      Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
      Reported-by: <felixh@informatik.uni-bremen.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <1vier1@web.de>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 28 Sep, 2016 1 commit
  10. 23 Sep, 2016 1 commit
  11. 22 Sep, 2016 1 commit
  12. 08 Aug, 2016 1 commit
  13. 02 Aug, 2016 2 commits
  14. 26 Jul, 2016 2 commits
  15. 23 Jun, 2016 4 commits
  16. 14 Jun, 2016 2 commits
  17. 24 May, 2016 1 commit
  18. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      This promise never materialized.  And unlikely will.
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      Let's stop pretending that pages in page cache are special.  They are
      The changes are pretty straight-forward:
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - page_cache_get() -> get_page();
       - page_cache_release() -> put_page();
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      virtual patch
      expression E;
      + E
      expression E;
      + E
      + PAGE_SHIFT
      + PAGE_SIZE
      + PAGE_MASK
      expression E;
      + PAGE_ALIGN(E)
      expression E;
      - page_cache_get(E)
      + get_page(E)
      expression E;
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  19. 22 Mar, 2016 1 commit
    • Davidlohr Bueso's avatar
      ipc/sem: make semctl setting sempid consistent · a5f4db87
      Davidlohr Bueso authored
      As indicated by bug#112271, Linux sets the sempid value upon semctl, and
      not only for semop calls.  However, within semctl we only do this for
      SETVAL, leaving SETALL without updating the field, and therefore rather
      inconsistent behavior when compared to other Unices.
      There is really no documentation regarding this and therefore users
      should not make assumptions.  With this patch, along with updating
      semctl.2 manpages, this scenario should become less ambiguous As such,
      set sempid on SETALL cmd.
      Also update some in-code documentation, specifying where the sempid is
      Passes ltp and custom testcase where a child (fork) does SETALL to the
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reported-by: default avatarPhilip Semanchuk <linux_kernel.20.ick@spamgourmet.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Herton R. Krzesinski <herton@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 19 Feb, 2016 1 commit
  21. 23 Jan, 2016 1 commit
  22. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  23. 21 Jan, 2016 1 commit
  24. 15 Jan, 2016 1 commit
    • Vladimir Davydov's avatar
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov authored
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  25. 07 Nov, 2015 1 commit
    • Davidlohr Bueso's avatar
      ipc,msg: drop dst nil validation in copy_msg · 5f2a2d5d
      Davidlohr Bueso authored
      d0edd852 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the
      nil dst parameter check, originally being a full BUG_ON.  However, this
      check seems quite unnecessary when the only purpose is for
      ceckpoint/restore (MSG_COPY flag):
      o The copy variable is set initially to nil, apparently as a way of
        ensuring that prepare_copy is previously called.  Which is in fact done,
        unconditionally at the beginning of do_msgrcv.
      o There is no concurrency with 'copy' (stack allocated in do_msgrcv).
      Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should
      always handled by IS_ERR() family.  Therefore remove this check altogether
      as it can never occur with the current users.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  26. 30 Sep, 2015 1 commit
    • Linus Torvalds's avatar
      Initialize msg/shm IPC objects before doing ipc_addid() · b9a53227
      Linus Torvalds authored
      As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
      having initialized the IPC object state.  Yes, we initialize the IPC
      object in a locked state, but with all the lockless RCU lookup work,
      that IPC object lock no longer means that the state cannot be seen.
      We already did this for the IPC semaphore code (see commit e8577d1f:
      "ipc/sem.c: fully initialize sem_array before making it visible") but we
      clearly forgot about msg and shm.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  27. 10 Sep, 2015 1 commit
    • Davidlohr Bueso's avatar
      ipc: convert invalid scenarios to use WARN_ON · d0edd852
      Davidlohr Bueso authored
      Considering Linus' past rants about the (ab)use of BUG in the kernel, I
      took a look at how we deal with such calls in ipc.  Given that any errors
      or corruption in ipc code are most likely contained within the set of
      processes participating in the broken mechanisms, there aren't really many
      strong fatal system failure scenarios that would require a BUG call.
      Also, if something is seriously wrong, ipc might not be the place for such
      a BUG either.
      1. For example, recently, a customer hit one of these BUG_ONs in shm
         after failing shm_lock().  A busted ID imho does not merit a BUG_ON,
         and WARN would have been better.
      2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
         I don't see how we can hit this anyway -- at least it should be IS_ERR.
          The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
         first and foremost.  We could also probably drop this check altogether.
          Either way, it does not merit a BUG_ON.
      3. No ->fault() callback for the fs getting the corresponding page --
         seems selfish to make the system unusable.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  28. 14 Aug, 2015 1 commit
    • Manfred Spraul's avatar
      ipc/sem.c: update/correct memory barriers · 3ed1f8a9
      Manfred Spraul authored
      sem_lock() did not properly pair memory barriers:
      !spin_is_locked() and spin_unlock_wait() are both only control barriers.
      The code needs an acquire barrier, otherwise the cpu might perform read
      operations before the lock test.
      As no primitive exists inside <include/spinlock.h> and since it seems
      noone wants another primitive, the code creates a local primitive within
      With regards to -stable:
      The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
      nop (just a preprocessor redefinition to improve the readability).  The
      bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
      starting from 3.10).
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>