1. 11 Jun, 2019 27 commits
    • Kirill Smelkov's avatar
      fuse: Add FOPEN_STREAM to use stream_open() · c9696a8f
      Kirill Smelkov authored
      commit bbd84f33652f852ce5992d65db4d020aba21f882 upstream.
      
      Starting from commit 9c225f26 ("vfs: atomic f_pos accesses as per
      POSIX") files opened even via nonseekable_open gate read and write via lock
      and do not allow them to be run simultaneously. This can create read vs
      write deadlock if a filesystem is trying to implement a socket-like file
      which is intended to be simultaneously used for both read and write from
      filesystem client.  See commit 10dce8af3422 ("fs: stream_open - opener for
      stream-like files so that read and write can run simultaneously without
      deadlock") for details and e.g. commit 581d21a2 ("xenbus: fix deadlock
      on writes to /proc/xen/xenbus") for a similar deadlock example on
      /proc/xen/xenbus.
      
      To avoid such deadlock it was tempting to adjust fuse_finish_open to use
      stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
      but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
      and in particular GVFS which actually uses offset in its read and write
      handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
      so if we would do such a change it will break a real user.
      
      Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
      opened handler is having stream-like semantics; does not use file position
      and thus the kernel is free to issue simultaneous read and write request on
      opened file handle.
      
      This patch together with stream_open() should be added to stable kernels
      starting from v3.14+. This will allow to patch OSSPD and other FUSE
      filesystems that provide stream-like files to return FOPEN_STREAM |
      FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
      kernel versions. This should work because fuse_finish_open ignores unknown
      open flags returned from a filesystem and so passing FOPEN_STREAM to a
      kernel that is not aware of this flag cannot hurt. In turn the kernel that
      is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
      is sufficient to implement streams without read vs write deadlock.
      
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9696a8f
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · 3bf0c459
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      commit 10dce8af34226d90fa56746a934f8da5dcdba3df upstream.
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      [ backport to 4.4: actually fixed deadlock on /proc/xen/xenbus as 581d21a2 was not backported to 4.4 ]
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      3bf0c459
    • Miklos Szeredi's avatar
      fuse: fallocate: fix return with locked inode · 8061c23f
      Miklos Szeredi authored
      commit 35d6fcbb7c3e296a52136347346a698a35af3fda upstream.
      
      Do the proper cleanup in case the size check fails.
      
      Tested with xfstests:generic/228
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 0cbade024ba5 ("fuse: honor RLIMIT_FSIZE in fuse_file_fallocate")
      Cc: Liu Bo <bo.liu@linux.alibaba.com>
      Cc: <stable@vger.kernel.org> # v3.5
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8061c23f
    • Oleg Nesterov's avatar
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · 6ad730b8
      Oleg Nesterov authored
      commit d2005e3f upstream.
      
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ad730b8
    • Roberto Bergantinos Corpas's avatar
      CIFS: cifs_read_allocate_pages: don't iterate through whole page array on ENOMEM · 336c1662
      Roberto Bergantinos Corpas authored
      commit 31fad7d41e73731f05b8053d17078638cf850fa6 upstream.
      
       In cifs_read_allocate_pages, in case of ENOMEM, we go through
      whole rdata->pages array but we have failed the allocation before
      nr_pages, therefore we may end up calling put_page with NULL
      pointer, causing oops
      Signed-off-by: default avatarRoberto Bergantinos Corpas <rbergant@redhat.com>
      Acked-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      336c1662
    • Filipe Manana's avatar
      Btrfs: fix race updating log root item during fsync · 494447b9
      Filipe Manana authored
      commit 06989c799f04810f6876900d4760c0edda369cf7 upstream.
      
      When syncing the log, the final phase of a fsync operation, we need to
      either create a log root's item or update the existing item in the log
      tree of log roots, and that depends on the current value of the log
      root's log_transid - if it's 1 we need to create the log root item,
      otherwise it must exist already and we update it. Since there is no
      synchronization between updating the log_transid and checking it for
      deciding whether the log root's item needs to be created or updated, we
      end up with a tiny race window that results in attempts to update the
      item to fail because the item was not yet created:
      
                    CPU 1                                    CPU 2
      
        btrfs_sync_log()
      
          lock root->log_mutex
      
          set log root's log_transid to 1
      
          unlock root->log_mutex
      
                                                     btrfs_sync_log()
      
                                                       lock root->log_mutex
      
                                                       sets log root's
                                                       log_transid to 2
      
                                                       unlock root->log_mutex
      
          update_log_root()
      
            sees log root's log_transid
            with a value of 2
      
              calls btrfs_update_root(),
              which fails with -EUCLEAN
              and causes transaction abort
      
      Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
      the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a
      root key") we just abort the current transaction.
      
      A sample trace of the BUG_ON() on a SLE12 kernel:
      
        ------------[ cut here ]------------
        kernel BUG at ../fs/btrfs/root-tree.c:157!
        Oops: Exception in kernel mode, sig: 5 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        (...)
        Supported: Yes, External
        CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
        task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
        NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
        REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
        MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
        CFAR: d000000036aba66c SOFTE: 1
        GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
        GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
        GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
        GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
        GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
        GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
        GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
        GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
        NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
        LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
        Call Trace:
        [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
        [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
        [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
        [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
        [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
        [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
        [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
        Instruction dump:
        7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
        e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
        ---[ end trace 8f2dc8f919cabab8 ]---
      
      So fix this by doing the check of log_transid and updating or creating the
      log root's item while holding the root's log_mutex.
      
      Fixes: 7237f183 ("Btrfs: fix tree logs parallel sync")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      494447b9
    • Chengguang Xu's avatar
      chardev: add additional check for minor range overlap · cb7872f1
      Chengguang Xu authored
      [ Upstream commit de36e16d1557a0b6eb328bc3516359a12ba5c25c ]
      
      Current overlap checking cannot correctly handle
      a case which is baseminor < existing baseminor &&
      baseminor + minorct > existing baseminor + minorct.
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cb7872f1
    • Ross Lagerwall's avatar
      gfs2: Fix lru_count going negative · 6948c6bc
      Ross Lagerwall authored
      [ Upstream commit 7881ef3f33bb80f459ea6020d1e021fc524a6348 ]
      
      Under certain conditions, lru_count may drop below zero resulting in
      a large amount of log spam like this:
      
      vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
          negative objects to delete nr=-1
      
      This happens as follows:
      1) A glock is moved from lru_list to the dispose list and lru_count is
         decremented.
      2) The dispose function calls cond_resched() and drops the lru lock.
      3) Another thread takes the lru lock and tries to add the same glock to
         lru_list, checking if the glock is on an lru list.
      4) It is on a list (actually the dispose list) and so it avoids
         incrementing lru_count.
      5) The glock is moved to lru_list.
      5) The original thread doesn't dispose it because it has been re-added
         to the lru list but the lru_count has still decreased by one.
      
      Fix by checking if the LRU flag is set on the glock rather than checking
      if the glock is on some list and rearrange the code so that the LRU flag
      is added/removed precisely when the glock is added/removed from lru_list.
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6948c6bc
    • Mike Kravetz's avatar
      hugetlb: use same fault hash key for shared and private mappings · bf8474c6
      Mike Kravetz authored
      commit 1b426bac66e6cc83c9f2d92b96e4e72acf43419a upstream.
      
      hugetlb uses a fault mutex hash table to prevent page faults of the
      same pages concurrently.  The key for shared and private mappings is
      different.  Shared keys off address_space and file index.  Private keys
      off mm and virtual address.  Consider a private mappings of a populated
      hugetlbfs file.  A fault will map the page from the file and if needed
      do a COW to map a writable page.
      
      Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
      pages.  It uses the address_space file index key.  However, private
      mappings will use a different key and could race with this code to map
      the file page.  This causes problems (BUG) for the page cache remove
      code as it expects the page to be unmapped.  A sample stack is:
      
      page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
      kernel BUG at mm/filemap.c:169!
      ...
      RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
      ...
      Call Trace:
      __delete_from_page_cache+0x39/0x220
      delete_from_page_cache+0x45/0x70
      remove_inode_hugepages+0x13c/0x380
      ? __add_to_page_cache_locked+0x162/0x380
      hugetlbfs_fallocate+0x403/0x540
      ? _cond_resched+0x15/0x30
      ? __inode_security_revalidate+0x5d/0x70
      ? selinux_file_permission+0x100/0x130
      vfs_fallocate+0x13f/0x270
      ksys_fallocate+0x3c/0x80
      __x64_sys_fallocate+0x1a/0x20
      do_syscall_64+0x5b/0x180
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      There seems to be another potential COW issue/race with this approach
      of different private and shared keys as noted in commit 8382d914
      ("mm, hugetlb: improve page-fault scalability").
      
      Since every hugetlb mapping (even anon and private) is actually a file
      mapping, just use the address_space index key for all mappings.  This
      results in potentially more hash collisions.  However, this should not
      be the common case.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      bf8474c6
    • Tobin C. Harding's avatar
      btrfs: sysfs: don't leak memory when failing add fsid · 5c9a2039
      Tobin C. Harding authored
      commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream.
      
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c9a2039
    • Filipe Manana's avatar
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 0fa88718
      Filipe Manana authored
      commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream.
      
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0fa88718
    • Andreas Gruenbacher's avatar
      gfs2: Fix sign extension bug in gfs2_update_stats · 2f5ac0bd
      Andreas Gruenbacher authored
      commit 5a5ec83d6ac974b12085cd99b196795f14079037 upstream.
      
      Commit 4d207133 changed the types of the statistic values in struct
      gfs2_lkstats from s64 to u64.  Because of that, what should be a signed
      value in gfs2_update_stats turned into an unsigned value.  When shifted
      right, we end up with a large positive value instead of a small negative
      value, which results in an incorrect variance estimate.
      
      Fixes: 4d207133 ("gfs2: Make statistics unsigned, suitable for use with do_div()")
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Cc: stable@vger.kernel.org # v4.4+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f5ac0bd
    • Jan Kara's avatar
      ext4: do not delete unlinked inode from orphan list on failed truncate · 75d63b13
      Jan Kara authored
      commit ee0ed02ca93ef1ecf8963ad96638795d55af2c14 upstream.
      
      It is possible that unlinked inode enters ext4_setattr() (e.g. if
      somebody calls ftruncate(2) on unlinked but still open file). In such
      case we should not delete the inode from the orphan list if truncate
      fails. Note that this is mostly a theoretical concern as filesystem is
      corrupted if we reach this path anyway but let's be consistent in our
      orphan handling.
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75d63b13
    • Nikolay Borisov's avatar
      btrfs: Honour FITRIM range constraints during free space trim · 7d64186e
      Nikolay Borisov authored
      commit c2d1b3aae33605a61cbab445d8ae1c708ccd2698 upstream.
      
      Up until now trimming the freespace was done irrespective of what the
      arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
      will be entirely ignored. Fix it by correctly handling those paramter.
      This requires breaking if the found freespace extent is after the end of
      the passed range as well as completing trim after trimming
      fstrim_range::len bytes.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      7d64186e
    • Al Viro's avatar
      ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour · 66ee750c
      Al Viro authored
      [ Upstream commit 4e9036042fedaffcd868d7f7aa948756c48c637d ]
      
      To choose whether to pick the GID from the old (16bit) or new (32bit)
      field, we should check if the old gid field is set to 0xffff.  Mainline
      checks the old *UID* field instead - cut'n'paste from the corresponding
      code in ufs_get_inode_uid().
      
      Fixes: 252e211eSigned-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      66ee750c
    • Jeff Layton's avatar
      ceph: flush dirty inodes before proceeding with remount · a7929c94
      Jeff Layton authored
      commit 00abf69dd24f4444d185982379c5cc3bb7b6d1fc upstream.
      
      xfstest generic/452 was triggering a "Busy inodes after umount" warning.
      ceph was allowing the mount to go read-only without first flushing out
      dirty inodes in the cache. Ensure we sync out the filesystem before
      allowing a remount to proceed.
      
      Cc: stable@vger.kernel.org
      Link: http://tracker.ceph.com/issues/39571Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7929c94
    • Liu Bo's avatar
      fuse: honor RLIMIT_FSIZE in fuse_file_fallocate · 40857ab7
      Liu Bo authored
      commit 0cbade024ba501313da3b7e5dd2a188a6bc491b5 upstream.
      
      fstests generic/228 reported this failure that fuse fallocate does not
      honor what 'ulimit -f' has set.
      
      This adds the necessary inode_newsize_ok() check.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Fixes: 05ba1f08 ("fuse: add FALLOCATE operation")
      Cc: <stable@vger.kernel.org> # v3.5
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40857ab7
    • Miklos Szeredi's avatar
      fuse: fix writepages on 32bit · 73724958
      Miklos Szeredi authored
      commit 9de5be06d0a89ca97b5ab902694d42dfd2bb77d2 upstream.
      
      Writepage requests were cropped to i_size & 0xffffffff, which meant that
      mmaped writes to any file larger than 4G might be silently discarded.
      
      Fix by storing the file size in a properly sized variable (loff_t instead
      of size_t).
      Reported-by: default avatarAntonio SJ Musumeci <trapexit@spawn.link>
      Fixes: 6eaf4782 ("fuse: writepages: crop secondary requests")
      Cc: <stable@vger.kernel.org> # v3.13
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73724958
    • ZhangXiaoxu's avatar
      NFS4: Fix v4.0 client state corruption when mount · 4676a07a
      ZhangXiaoxu authored
      commit f02f3755dbd14fb935d24b14650fff9ba92243b8 upstream.
      
      stat command with soft mount never return after server is stopped.
      
      When alloc a new client, the state of the client will be set to
      NFS4CLNT_LEASE_EXPIRED.
      
      When the server is stopped, the state manager will work, and accord
      the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it
      will drain the slot table and lead other task to wait queue, until
      the client recovered. Then the stat command is hung.
      
      When discover server trunking, the client will renew the lease,
      but check the client state, it lead the client state corruption.
      
      So, we need to call state manager to recover it when detect server
      ip trunking.
      Signed-off-by: default avatarZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4676a07a
    • Christoph Probst's avatar
      cifs: fix strcat buffer overflow and reduce raciness in smb21_set_oplock_level() · dffc9e5f
      Christoph Probst authored
      commit 6a54b2e002c9d00b398d35724c79f9fe0d9b38fb upstream.
      
      Change strcat to strncpy in the "None" case to fix a buffer overflow
      when cinode->oplock is reset to 0 by another thread accessing the same
      cinode. It is never valid to append "None" to any other message.
      
      Consolidate multiple writes to cinode->oplock to reduce raciness.
      Signed-off-by: default avatarChristoph Probst <kernel@probst.it>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dffc9e5f
    • Sriram Rajagopalan's avatar
      ext4: zero out the unused memory region in the extent tree block · 98529ecd
      Sriram Rajagopalan authored
      commit 592acbf16821288ecdc4192c47e3774a4c48bb64 upstream.
      
      This commit zeroes out the unused memory region in the buffer_head
      corresponding to the extent metablock after writing the extent header
      and the corresponding extent node entries.
      
      This is done to prevent random uninitialized data from getting into
      the filesystem when the extent block is synced.
      
      This fixes CVE-2019-11833.
      Signed-off-by: default avatarSriram Rajagopalan <sriramr@arista.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98529ecd
    • Jiufei Xue's avatar
      fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount · 9ff6372e
      Jiufei Xue authored
      commit ec084de929e419e51bcdafaafe567d9e7d0273b7 upstream.
      
      synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
      switch may not go to the workqueue after synchronize_rcu().  Thus
      previous scheduled switches was not finished even flushing the
      workqueue, which will cause a NULL pointer dereferenced followed below.
      
        VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds.  Have a nice day...
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
          evict+0xb3/0x180
          iput+0x1b0/0x230
          inode_switch_wbs_work_fn+0x3c0/0x6a0
          worker_thread+0x4e/0x490
          ? process_one_work+0x410/0x410
          kthread+0xe6/0x100
          ret_from_fork+0x39/0x50
      
      Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
      pending callbacks to finish.  And inc isw_nr_in_flight after call_rcu()
      in inode_switch_wbs() to make more sense.
      
      Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.comSigned-off-by: default avatarJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ff6372e
    • Tejun Heo's avatar
      writeback: synchronize sync(2) against cgroup writeback membership switches · bfce20ea
      Tejun Heo authored
      commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 upstream.
      
      sync_inodes_sb() can race against cgwb (cgroup writeback) membership
      switches and fail to writeback some inodes.  For example, if an inode
      switches to another wb while sync_inodes_sb() is in progress, the new
      wb might not be visible to bdi_split_work_to_wbs() at all or the inode
      might jump from a wb which hasn't issued writebacks yet to one which
      already has.
      
      This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
      switch path against sync_inodes_sb() so that sync_inodes_sb() is
      guaranteed to see all the target wbs and inodes can't jump wbs to
      escape syncing.
      
      v2: Fixed misplaced rwsem init.  Spotted by Jiufei.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJiufei Xue <xuejiufei@gmail.com>
      Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.comAcked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bfce20ea
    • Filipe Manana's avatar
      Btrfs: do not start a transaction at iterate_extent_inodes() · 686e4352
      Filipe Manana authored
      commit bfc61c36260ca990937539cd648ede3cd749bc10 upstream.
      
      When finding out which inodes have references on a particular extent, done
      by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
      v1 and v2) ioctl and from scrub we use the transaction join API to grab a
      reference on the currently running transaction, since in order to give
      accurate results we need to inspect the delayed references of the currently
      running transaction.
      
      However, if there is currently no running transaction, the join operation
      will create a new transaction. This is inefficient as the transaction will
      eventually be committed, doing unnecessary IO and introducing a potential
      point of failure that will lead to a transaction abort due to -ENOSPC, as
      recently reported [1].
      
      That's because the join, creates the transaction but does not reserve any
      space, so when attempting to update the root item of the root passed to
      btrfs_join_transaction(), during the transaction commit, we can end up
      failling with -ENOSPC. Users of a join operation are supposed to actually
      do some filesystem changes and reserve space by some means, which is not
      the case of iterate_extent_inodes(), it is a read-only operation for all
      contextes from which it is called.
      
      The reported [1] -ENOSPC failure stack trace is the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      So fix that by using the attach API, which does not create a transaction
      when there is currently no running transaction.
      
      [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      686e4352
    • Debabrata Banerjee's avatar
      ext4: fix ext4_show_options for file systems w/o journal · b268b6e5
      Debabrata Banerjee authored
      commit 50b29d8f033a7c88c5bc011abc2068b1691ab755 upstream.
      
      Instead of removing EXT4_MOUNT_JOURNAL_CHECKSUM from s_def_mount_opt as
      I assume was intended, all other options were blown away leading to
      _ext4_show_options() output being incorrect.
      
      Fixes: 1e381f60 ("ext4: do not allow journal_opts for fs w/o journal")
      Signed-off-by: default avatarDebabrata Banerjee <dbanerje@akamai.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b268b6e5
    • Kirill Tkhai's avatar
      ext4: actually request zeroing of inode table after grow · f3b9c26f
      Kirill Tkhai authored
      commit 310a997fd74de778b9a4848a64be9cda9f18764a upstream.
      
      It is never possible, that number of block groups decreases,
      since only online grow is supported.
      
      But after a growing occured, we have to zero inode tables
      for just created new block groups.
      
      Fixes: 19c5246d ("ext4: add new online resize interface")
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3b9c26f
    • Shuning Zhang's avatar
      ocfs2: fix ocfs2 read inode data panic in ocfs2_iget · e3a74fbc
      Shuning Zhang authored
      commit e091eab028f9253eac5c04f9141bbc9d170acab3 upstream.
      
      In some cases, ocfs2_iget() reads the data of inode, which has been
      deleted for some reason.  That will make the system panic.  So We should
      judge whether this inode has been deleted, and tell the caller that the
      inode is a bad inode.
      
      For example, the ocfs2 is used as the backed of nfs, and the client is
      nfsv3.  This issue can be reproduced by the following steps.
      
      on the nfs server side,
      ..../patha/pathb
      
      Step 1: The process A was scheduled before calling the function fh_verify.
      
      Step 2: The process B is removing the 'pathb', and just completed the call
      to function dput.  Then the dentry of 'pathb' has been deleted from the
      dcache, and all ancestors have been deleted also.  The relationship of
      dentry and inode was deleted through the function hlist_del_init.  The
      following is the call stack.
      dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
      
      At this time, the inode is still in the dcache.
      
      Step 3: The process A call the function ocfs2_get_dentry, which get the
      inode from dcache.  Then the refcount of inode is 1.  The following is the
      call stack.
      nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
      
      Step 4: Dirty pages are flushed by bdi threads.  So the inode of 'patha'
      is evicted, and this directory was deleted.  But the inode of 'pathb'
      can't be evicted, because the refcount of the inode was 1.
      
      Step 5: The process A keep running, and call the function
      reconnect_path(in exportfs_decode_fh), which call function
      ocfs2_get_parent of ocfs2.  Get the block number of parent
      directory(patha) by the name of ...  Then read the data from disk by the
      block number.  But this inode has been deleted, so the system panic.
      
      Process A                                             Process B
      1. in nfsd3_proc_getacl                   |
      2.                                        |        dput
      3. fh_to_dentry(ocfs2_get_dentry)         |
      4. bdi flush dirty cache                  |
      5. ocfs2_iget                             |
      
      [283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
      Invalid dinode #580640: OCFS2_VALID_FL not set
      
      [283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
      after error
      
      [283465.546889] CPU: 5 PID: 12416 Comm: nfsd Tainted: G        W
      4.1.12-124.18.6.el6uek.bug28762940v3.x86_64 #2
      [283465.548382] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
      Desktop Reference Platform, BIOS 6.00 09/21/2015
      [283465.549657]  0000000000000000 ffff8800a56fb7b8 ffffffff816e839c
      ffffffffa0514758
      [283465.550392]  000000000008dc20 ffff8800a56fb838 ffffffff816e62d3
      0000000000000008
      [283465.551056]  ffff880000000010 ffff8800a56fb848 ffff8800a56fb7e8
      ffff88005df9f000
      [283465.551710] Call Trace:
      [283465.552516]  [<ffffffff816e839c>] dump_stack+0x63/0x81
      [283465.553291]  [<ffffffff816e62d3>] panic+0xcb/0x21b
      [283465.554037]  [<ffffffffa04e66b0>] ocfs2_handle_error+0xf0/0xf0 [ocfs2]
      [283465.554882]  [<ffffffffa04e7737>] __ocfs2_error+0x67/0x70 [ocfs2]
      [283465.555768]  [<ffffffffa049c0f9>] ocfs2_validate_inode_block+0x229/0x230
      [ocfs2]
      [283465.556683]  [<ffffffffa047bcbc>] ocfs2_read_blocks+0x46c/0x7b0 [ocfs2]
      [283465.557408]  [<ffffffffa049bed0>] ? ocfs2_inode_cache_io_unlock+0x20/0x20
      [ocfs2]
      [283465.557973]  [<ffffffffa049f0eb>] ocfs2_read_inode_block_full+0x3b/0x60
      [ocfs2]
      [283465.558525]  [<ffffffffa049f5ba>] ocfs2_iget+0x4aa/0x880 [ocfs2]
      [283465.559082]  [<ffffffffa049146e>] ocfs2_get_parent+0x9e/0x220 [ocfs2]
      [283465.559622]  [<ffffffff81297c05>] reconnect_path+0xb5/0x300
      [283465.560156]  [<ffffffff81297f46>] exportfs_decode_fh+0xf6/0x2b0
      [283465.560708]  [<ffffffffa062faf0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
      [283465.561262]  [<ffffffff810a8196>] ? prepare_creds+0x26/0x110
      [283465.561932]  [<ffffffffa0630860>] fh_verify+0x350/0x660 [nfsd]
      [283465.562862]  [<ffffffffa0637804>] ? nfsd_cache_lookup+0x44/0x630 [nfsd]
      [283465.563697]  [<ffffffffa063a8b9>] nfsd3_proc_getattr+0x69/0xf0 [nfsd]
      [283465.564510]  [<ffffffffa062cf60>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [283465.565358]  [<ffffffffa05eb892>] ? svc_tcp_adjust_wspace+0x12/0x30
      [sunrpc]
      [283465.566272]  [<ffffffffa05ea652>] svc_process_common+0x412/0x6a0 [sunrpc]
      [283465.567155]  [<ffffffffa05eaa03>] svc_process+0x123/0x210 [sunrpc]
      [283465.568020]  [<ffffffffa062c90f>] nfsd+0xff/0x170 [nfsd]
      [283465.568962]  [<ffffffffa062c810>] ? nfsd_destroy+0x80/0x80 [nfsd]
      [283465.570112]  [<ffffffff810a622b>] kthread+0xcb/0xf0
      [283465.571099]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
      [283465.572114]  [<ffffffff816f11b8>] ret_from_fork+0x58/0x90
      [283465.573156]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
      
      Link: http://lkml.kernel.org/r/1554185919-3010-1-git-send-email-sunny.s.zhang@oracle.comSigned-off-by: default avatarShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: piaojun <piaojun@huawei.com>
      Cc: "Gang He" <ghe@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3a74fbc
  2. 16 May, 2019 10 commits
    • Mike Kravetz's avatar
      hugetlbfs: fix memory leak for resv_map · 66c57ab1
      Mike Kravetz authored
      [ Upstream commit 58b6e5e8f1addd44583d61b0a03c0f5519527e35 ]
      
      When mknod is used to create a block special file in hugetlbfs, it will
      allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
      inode->i_mapping->private_data will point the newly allocated resv_map.
      However, when the device special file is opened bd_acquire() will set
      inode->i_mapping to bd_inode->i_mapping.  Thus the pointer to the
      allocated resv_map is lost and the structure is leaked.
      
      Programs to reproduce:
              mount -t hugetlbfs nodev hugetlbfs
              mknod hugetlbfs/dev b 0 0
              exec 30<> hugetlbfs/dev
              umount hugetlbfs/
      
      resv_map structures are only needed for inodes which can have associated
      page allocations.  To fix the leak, only allocate resv_map for those
      inodes which could possibly be associated with page allocations.
      
      Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Suggested-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      66c57ab1
    • Al Viro's avatar
      debugfs: fix use-after-free on symlink traversal · 02395682
      Al Viro authored
      [ Upstream commit 93b919da64c15b90953f96a536e5e61df896ca57 ]
      
      symlink body shouldn't be freed without an RCU delay.  Switch debugfs to
      ->destroy_inode() and use of call_rcu(); free both the inode and symlink
      body in the callback.  Similar to solution for bpf, only here it's even
      more obvious that ->evict_inode() can be dropped.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      02395682
    • Al Viro's avatar
      jffs2: fix use-after-free on symlink traversal · 90a015d4
      Al Viro authored
      [ Upstream commit 4fdcfab5b5537c21891e22e65996d4d0dd8ab4ca ]
      
      free the symlink body after the same RCU delay we have for freeing the
      struct inode itself, so that traversal during RCU pathwalk wouldn't step
      into freed memory.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      90a015d4
    • Al Viro's avatar
      ceph: fix use-after-free on symlink traversal · cd2bdca3
      Al Viro authored
      [ Upstream commit daf5cc27eed99afdea8d96e71b89ba41f5406ef6 ]
      
      free the symlink body after the same RCU delay we have for freeing the
      struct inode itself, so that traversal during RCU pathwalk wouldn't step
      into freed memory.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarSasha Levin (Microsoft) <sashal@kernel.org>
      cd2bdca3
    • Tetsuo Handa's avatar
      NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family. · cec54a8e
      Tetsuo Handa authored
      commit 7c2bd9a39845bfb6d72ddb55ce737650271f6f96 upstream.
      
      syzbot is reporting uninitialized value at rpc_sockaddr2uaddr() [1]. This
      is because syzbot is setting AF_INET6 to "struct sockaddr_in"->sin_family
      (which is embedded into user-visible "struct nfs_mount_data" structure)
      despite nfs23_validate_mount_data() cannot pass sizeof(struct sockaddr_in6)
      bytes of AF_INET6 address to rpc_sockaddr2uaddr().
      
      Since "struct nfs_mount_data" structure is user-visible, we can't change
      "struct nfs_mount_data" to use "struct sockaddr_storage". Therefore,
      assuming that everybody is using AF_INET family when passing address via
      "struct nfs_mount_data"->addr, reject if its sin_family is not AF_INET.
      
      [1] https://syzkaller.appspot.com/bug?id=599993614e7cbbf66bc2656a919ab2a95fb5d75cReported-by: default avatarsyzbot <syzbot+047a11c361b872896a4f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cec54a8e
    • YueHaibing's avatar
      fs/proc/proc_sysctl.c: Fix a NULL pointer dereference · 76c279c7
      YueHaibing authored
      commit 89189557b47b35683a27c80ee78aef18248eefb4 upstream.
      
      Syzkaller report this:
      
        sysctl could not get directory: /net//bridge -12
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 7027 Comm: syz-executor.0 Tainted: G         C        5.1.0-rc3+ #8
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
        RIP: 0010:__write_once_size include/linux/compiler.h:220 [inline]
        RIP: 0010:__rb_change_child include/linux/rbtree_augmented.h:144 [inline]
        RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:186 [inline]
        RIP: 0010:rb_erase+0x5f4/0x19f0 lib/rbtree.c:459
        Code: 00 0f 85 60 13 00 00 48 89 1a 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 75 0c 00 00 4d 85 ed 4c 89 2e 74 ce 4c 89 ea 48
        RSP: 0018:ffff8881bb507778 EFLAGS: 00010206
        RAX: dffffc0000000000 RBX: ffff8881f224b5b8 RCX: ffffffff818f3f6a
        RDX: 000000000000000a RSI: 0000000000000050 RDI: ffff8881f224b568
        RBP: 0000000000000000 R08: ffffed10376a0ef4 R09: ffffed10376a0ef4
        R10: 0000000000000001 R11: ffffed10376a0ef4 R12: ffff8881f224b558
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        FS:  00007f3e7ce13700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fd60fbe9398 CR3: 00000001cb55c001 CR4: 00000000007606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         erase_entry fs/proc/proc_sysctl.c:178 [inline]
         erase_header+0xe3/0x160 fs/proc/proc_sysctl.c:207
         start_unregistering fs/proc/proc_sysctl.c:331 [inline]
         drop_sysctl_table+0x558/0x880 fs/proc/proc_sysctl.c:1631
         get_subdir fs/proc/proc_sysctl.c:1022 [inline]
         __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
         br_netfilter_init+0x68/0x1000 [br_netfilter]
         do_one_initcall+0xbc/0x47d init/main.c:901
         do_init_module+0x1b5/0x547 kernel/module.c:3456
         load_module+0x6405/0x8c10 kernel/module.c:3804
         __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
         do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        Modules linked in: br_netfilter(+) backlight comedi(C) hid_sensor_hub max3100 ti_ads8688 udc_core fddi snd_mona leds_gpio rc_streamzap mtd pata_netcell nf_log_common rc_winfast udp_tunnel snd_usbmidi_lib snd_usb_toneport snd_usb_line6 snd_rawmidi snd_seq_device snd_hwdep videobuf2_v4l2 videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops rc_gadmei_rm008z 8250_of smm665 hid_tmff hid_saitek hwmon_vid rc_ati_tv_wonder_hd_600 rc_core pata_pdc202xx_old dn_rtmsg as3722 ad714x_i2c ad714x snd_soc_cs4265 hid_kensington panel_ilitek_ili9322 drm drm_panel_orientation_quirks ipack cdc_phonet usbcore phonet hid_jabra hid extcon_arizona can_dev industrialio_triggered_buffer kfifo_buf industrialio adm1031 i2c_mux_ltc4306 i2c_mux ipmi_msghandler mlxsw_core snd_soc_cs35l34 snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer ac97_bus snd_compress snd soundcore gpio_da9055 uio ecdh_generic mdio_thunder of_mdio fixed_phy libphy mdio_cavium iptable_security iptable_raw iptable_mangle
         iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ide_pci_generic piix aes_x86_64 crypto_simd cryptd ide_core glue_helper input_leds psmouse intel_agp intel_gtt serio_raw ata_generic i2c_piix4 agpgart pata_acpi parport_pc parport floppy rtc_cmos sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: br_netfilter]
        Dumping ftrace buffer:
           (ftrace buffer empty)
        ---[ end trace 68741688d5fbfe85 ]---
      
      commit 23da9588037e ("fs/proc/proc_sysctl.c: fix NULL pointer
      dereference in put_links") forgot to handle start_unregistering() case,
      while header->parent is NULL, it calls erase_header() and as seen in the
      above syzkaller call trace, accessing &header->parent->root will trigger
      a NULL pointer dereference.
      
      As that commit explained, there is also no need to call
      start_unregistering() if header->parent is NULL.
      
      Link: http://lkml.kernel.org/r/20190409153622.28112-1-yuehaibing@huawei.com
      Fixes: 23da9588037e ("fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links")
      Fixes: 0e47c99d ("sysctl: Replace root_list with links between sysctl_table_sets")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76c279c7
    • Trond Myklebust's avatar
      nfsd: Don't release the callback slot unless it was actually held · 498e9066
      Trond Myklebust authored
      commit e6abc8caa6deb14be2a206253f7e1c5e37e9515b upstream.
      
      If there are multiple callbacks queued, waiting for the callback
      slot when the callback gets shut down, then they all currently
      end up acting as if they hold the slot, and call
      nfsd4_cb_sequence_done() resulting in interesting side-effects.
      
      In addition, the 'retry_nowait' path in nfsd4_cb_sequence_done()
      causes a loop back to nfsd4_cb_prepare() without first freeing the
      slot, which causes a deadlock when nfsd41_cb_get_slot() gets called
      a second time.
      
      This patch therefore adds a boolean to track whether or not the
      callback did pick up the slot, so that it can do the right thing
      in these 2 cases.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      498e9066
    • Yan, Zheng's avatar
      ceph: fix ci->i_head_snapc leak · b8d15c06
      Yan, Zheng authored
      commit 37659182bff1eeaaeadcfc8f853c6d2b6dbc3f47 upstream.
      
      We missed two places that i_wrbuffer_ref_head, i_wr_ref, i_dirty_caps
      and i_flushing_caps may change. When they are all zeros, we should free
      i_head_snapc.
      
      Cc: stable@vger.kernel.org
      Link: https://tracker.ceph.com/issues/38224Reported-and-tested-by: default avatarLuis Henriques <lhenriques@suse.com>
      Signed-off-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8d15c06
    • Jeff Layton's avatar
      ceph: ensure d_name stability in ceph_dentry_hash() · 811fb302
      Jeff Layton authored
      commit 76a495d666e5043ffc315695f8241f5e94a98849 upstream.
      
      Take the d_lock here to ensure that d_name doesn't change.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      811fb302
    • Frank Sorenson's avatar
      cifs: do not attempt cifs operation on smb2+ rename error · fd496074
      Frank Sorenson authored
      commit 652727bbe1b17993636346716ae5867627793647 upstream.
      
      A path-based rename returning EBUSY will incorrectly try opening
      the file with a cifs (NT Create AndX) operation on an smb2+ mount,
      which causes the server to force a session close.
      
      If the mount is smb2+, skip the fallback.
      Signed-off-by: default avatarFrank Sorenson <sorenson@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd496074
  3. 27 Apr, 2019 3 commits
    • Miklos Szeredi's avatar
      ovl: fix uid/gid when creating over whiteout · 54a07fff
      Miklos Szeredi authored
      [ Upstream commit d0e13f5b ]
      
      Fix a regression when creating a file over a whiteout.  The new
      file/directory needs to use the current fsuid/fsgid, not the ones from the
      mounter's credentials.
      
      The refcounting is a bit tricky: prepare_creds() sets an original refcount,
      override_creds() gets one more, which revert_cred() drops.  So
      
        1) we need to expicitly put the mounter's credentials when overriding
           with the updated one
      
        2) we need to put the original ref to the updated creds (and this can
           safely be done before revert_creds(), since we'll still have the ref
           from override_creds()).
      Reported-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Fixes: 3fe6e52f ("ovl: override creds with the ones from the superblock mounter")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarSasha Levin (Microsoft) <sashal@kernel.org>
      54a07fff
    • Steve French's avatar
      cifs: fallback to older infolevels on findfirst queryinfo retry · 740562f3
      Steve French authored
      [ Upstream commit 3b7960caceafdfc2cdfe2850487f8d091eb41144 ]
      
      In cases where queryinfo fails, we have cases in cifs (vers=1.0)
      where with backupuid mounts we retry the query info with findfirst.
      This doesn't work to some NetApp servers which don't support
      WindowsXP (and later) infolevel 261 (SMB_FIND_FILE_ID_FULL_DIR_INFO)
      so in this case use other info levels (in this case it will usually
      be level 257, SMB_FIND_FILE_DIRECTORY_INFO).
      
      (Also fixes some indentation)
      
      See kernel bugzilla 201435
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      740562f3
    • Chao Yu's avatar
      f2fs: fix to do sanity check with current segment number · 045aac48
      Chao Yu authored
      [ Upstream commit 042be0f849e5fc24116d0afecfaf926eed5cac63 ]
      
      https://bugzilla.kernel.org/show_bug.cgi?id=200219
      
      Reproduction way:
      - mount image
      - run poc code
      - umount image
      
      F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
      ------------[ cut here ]------------
      kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
      invalid opcode: 0000 [#1] PREEMPT SMP
      CPU: 2 PID: 17686 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
      Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      EIP: update_sit_entry+0x459/0x4e0 [f2fs]
      Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
      EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
      ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
      DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
      CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
      Call Trace:
       f2fs_allocate_data_block+0x124/0x580 [f2fs]
       do_write_page+0x78/0x150 [f2fs]
       f2fs_do_write_node_page+0x25/0xa0 [f2fs]
       __write_node_page+0x2bf/0x550 [f2fs]
       f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
       ? sync_inode_metadata+0x2f/0x40
       ? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
       ? up_write+0x1e/0x80
       f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
       ? mark_held_locks+0x5d/0x80
       ? _raw_spin_unlock_irq+0x27/0x50
       kill_f2fs_super+0x68/0x90 [f2fs]
       deactivate_locked_super+0x3d/0x70
       deactivate_super+0x40/0x60
       cleanup_mnt+0x39/0x70
       __cleanup_mnt+0x10/0x20
       task_work_run+0x81/0xa0
       exit_to_usermode_loop+0x59/0xa7
       do_fast_syscall_32+0x1f5/0x22c
       entry_SYSENTER_32+0x53/0x86
      EIP: 0xb7f95c51
      Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
      EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
      ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
      DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
      Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
      ---[ end trace d423f83982cfcdc5 ]---
      
      The reason is, different log headers using the same segment, once
      one log's next block address is used by another log, it will cause
      panic as above.
      
      Main area: 24 segs, 24 secs 24 zones
        - COLD  data: 0, 0, 0
        - WARM  data: 1, 1, 1
        - HOT   data: 20, 20, 20
        - Dir   dnode: 22, 22, 22
        - File   dnode: 22, 22, 22
        - Indir nodes: 21, 21, 21
      
      So this patch adds sanity check to detect such condition to avoid
      this issue.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      045aac48