1. 21 Jul, 2019 3 commits
  2. 10 Jul, 2019 2 commits
  3. 03 Jul, 2019 4 commits
  4. 25 Jun, 2019 2 commits
  5. 22 Jun, 2019 2 commits
  6. 19 Jun, 2019 1 commit
  7. 15 Jun, 2019 8 commits
    • J. Bruce Fields's avatar
      nfsd: allow fh_want_write to be called twice · 76f53b84
      J. Bruce Fields authored
      [ Upstream commit 0b8f62625dc309651d0efcb6a6247c933acd8b45 ]
      
      A fuzzer recently triggered lockdep warnings about potential sb_writers
      deadlocks caused by fh_want_write().
      
      Looks like we aren't careful to pair each fh_want_write() with an
      fh_drop_write().
      
      It's not normally a problem since fh_put() will call fh_drop_write() for
      us.  And was OK for NFSv3 where we'd do one operation that might call
      fh_want_write(), and then put the filehandle.
      
      But an NFSv4 protocol fuzzer can do weird things like call unlink twice
      in a compound, and then we get into trouble.
      
      I'm a little worried about this approach of just leaving everything to
      fh_put().  But I think there are probably a lot of
      fh_want_write()/fh_drop_write() imbalances so for now I think we need it
      to be more forgiving.
      Signed-off-by: 's avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      76f53b84
    • Kirill Smelkov's avatar
      fuse: retrieve: cap requested size to negotiated max_write · 1e0a2528
      Kirill Smelkov authored
      [ Upstream commit 7640682e67b33cab8628729afec8ca92b851394f ]
      
      FUSE filesystem server and kernel client negotiate during initialization
      phase, what should be the maximum write size the client will ever issue.
      Correspondingly the filesystem server then queues sys_read calls to read
      requests with buffer capacity large enough to carry request header + that
      max_write bytes. A filesystem server is free to set its max_write in
      anywhere in the range between [1*page, fc->max_pages*page]. In particular
      go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages
      corresponds to 128K. Libfuse also allows users to configure max_write, but
      by default presets it to possible maximum.
      
      If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we
      allow to retrieve more than max_write bytes, corresponding prepared
      NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
      filesystem server, in full correspondence with server/client contract, will
      be only queuing sys_read with ~max_write buffer capacity, and
      fuse_dev_do_read throws away requests that cannot fit into server request
      buffer. In turn the filesystem server could get stuck waiting indefinitely
      for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is
      understood by clients as that NOTIFY_REPLY was queued and will be sent
      back.
      
      Cap requested size to negotiate max_write to avoid the problem.  This
      aligns with the way NOTIFY_RETRIEVE handler works, which already
      unconditionally caps requested retrieve size to fuse_conn->max_pages.  This
      way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than
      was originally requested.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real, how the situation was traced and for more involving patch that
      did not make it into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
      [2] https://github.com/hanwen/go-fuseSigned-off-by: 's avatarKirill Smelkov <kirr@nexedi.com>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: 's avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      1e0a2528
    • YueHaibing's avatar
      configfs: fix possible use-after-free in configfs_register_group · a074466d
      YueHaibing authored
      [ Upstream commit 35399f87e271f7cf3048eab00a421a6519ac8441 ]
      
      In configfs_register_group(), if create_default_group() failed, we
      forget to unlink the group. It will left a invalid item in the parent list,
      which may trigger the use-after-free issue seen below:
      
      BUG: KASAN: use-after-free in __list_add_valid+0xd4/0xe0 lib/list_debug.c:26
      Read of size 8 at addr ffff8881ef61ae20 by task syz-executor.0/5996
      
      CPU: 1 PID: 5996 Comm: syz-executor.0 Tainted: G         C        5.0.0+ #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xa9/0x10e lib/dump_stack.c:113
       print_address_description+0x65/0x270 mm/kasan/report.c:187
       kasan_report+0x149/0x18d mm/kasan/report.c:317
       __list_add_valid+0xd4/0xe0 lib/list_debug.c:26
       __list_add include/linux/list.h:60 [inline]
       list_add_tail include/linux/list.h:93 [inline]
       link_obj+0xb0/0x190 fs/configfs/dir.c:759
       link_group+0x1c/0x130 fs/configfs/dir.c:784
       configfs_register_group+0x56/0x1e0 fs/configfs/dir.c:1751
       configfs_register_default_group+0x72/0xc0 fs/configfs/dir.c:1834
       ? 0xffffffffc1be0000
       iio_sw_trigger_init+0x23/0x1000 [industrialio_sw_trigger]
       do_one_initcall+0xbc/0x47d init/main.c:887
       do_init_module+0x1b5/0x547 kernel/module.c:3456
       load_module+0x6405/0x8c10 kernel/module.c:3804
       __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
       do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462e99
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f494ecbcc58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
      RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
      RBP: 00007f494ecbcc70 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f494ecbd6bc
      R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
      
      Allocated by task 5987:
       set_track mm/kasan/common.c:87 [inline]
       __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:497
       kmalloc include/linux/slab.h:545 [inline]
       kzalloc include/linux/slab.h:740 [inline]
       configfs_register_default_group+0x4c/0xc0 fs/configfs/dir.c:1829
       0xffffffffc1bd0023
       do_one_initcall+0xbc/0x47d init/main.c:887
       do_init_module+0x1b5/0x547 kernel/module.c:3456
       load_module+0x6405/0x8c10 kernel/module.c:3804
       __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
       do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 5987:
       set_track mm/kasan/common.c:87 [inline]
       __kasan_slab_free+0x130/0x180 mm/kasan/common.c:459
       slab_free_hook mm/slub.c:1429 [inline]
       slab_free_freelist_hook mm/slub.c:1456 [inline]
       slab_free mm/slub.c:3003 [inline]
       kfree+0xe1/0x270 mm/slub.c:3955
       configfs_register_default_group+0x9a/0xc0 fs/configfs/dir.c:1836
       0xffffffffc1bd0023
       do_one_initcall+0xbc/0x47d init/main.c:887
       do_init_module+0x1b5/0x547 kernel/module.c:3456
       load_module+0x6405/0x8c10 kernel/module.c:3804
       __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
       do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8881ef61ae00
       which belongs to the cache kmalloc-192 of size 192
      The buggy address is located 32 bytes inside of
       192-byte region [ffff8881ef61ae00, ffff8881ef61aec0)
      The buggy address belongs to the page:
      page:ffffea0007bd8680 count:1 mapcount:0 mapping:ffff8881f6c03000 index:0xffff8881ef61a700
      flags: 0x2fffc0000000200(slab)
      raw: 02fffc0000000200 ffffea0007ca4740 0000000500000005 ffff8881f6c03000
      raw: ffff8881ef61a700 000000008010000c 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8881ef61ad00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff8881ef61ad80: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
      >ffff8881ef61ae00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
       ffff8881ef61ae80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       ffff8881ef61af00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 5cf6a51e ("configfs: allow dynamic group creation")
      Reported-by: 's avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: 's avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      a074466d
    • Chao Yu's avatar
      f2fs: fix to do sanity check on valid block count of segment · c32e6a51
      Chao Yu authored
      [ Upstream commit e95bcdb2fefa129f37bd9035af1d234ca92ee4ef ]
      
      As Jungyeon reported in bugzilla:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=203233
      
      - Overview
      When mounting the attached crafted image and running program, following errors are reported.
      Additionally, it hangs on sync after running program.
      
      The image is intentionally fuzzed from a normal f2fs image for testing.
      Compile options for F2FS are as follows.
      CONFIG_F2FS_FS=y
      CONFIG_F2FS_STAT_FS=y
      CONFIG_F2FS_FS_XATTR=y
      CONFIG_F2FS_FS_POSIX_ACL=y
      CONFIG_F2FS_CHECK_FS=y
      
      - Reproduces
      cc poc_13.c
      mkdir test
      mount -t f2fs tmp.img test
      cp a.out test
      cd test
      sudo ./a.out
      sync
      
      - Kernel messages
       F2FS-fs (sdb): Bitmap was wrongly set, blk:4608
       kernel BUG at fs/f2fs/segment.c:2102!
       RIP: 0010:update_sit_entry+0x394/0x410
       Call Trace:
        f2fs_allocate_data_block+0x16f/0x660
        do_write_page+0x62/0x170
        f2fs_do_write_node_page+0x33/0xa0
        __write_node_page+0x270/0x4e0
        f2fs_sync_node_pages+0x5df/0x670
        f2fs_write_checkpoint+0x372/0x1400
        f2fs_sync_fs+0xa3/0x130
        f2fs_do_sync_file+0x1a6/0x810
        do_fsync+0x33/0x60
        __x64_sys_fsync+0xb/0x10
        do_syscall_64+0x43/0xf0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      sit.vblocks and sum valid block count in sit.valid_map may be
      inconsistent, segment w/ zero vblocks will be treated as free
      segment, while allocating in free segment, we may allocate a
      free block, if its bitmap is valid previously, it can cause
      kernel crash due to bitmap verification failure.
      
      Anyway, to avoid further serious metadata inconsistence and
      corruption, it is necessary and worth to detect SIT
      inconsistence. So let's enable check_block_count() to verify
      vblocks and valid_map all the time rather than do it only
      CONFIG_F2FS_CHECK_FS is enabled.
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      c32e6a51
    • Chao Yu's avatar
      f2fs: fix to avoid panic in dec_valid_block_count() · 7765cd4c
      Chao Yu authored
      [ Upstream commit 5e159cd349bf3a31fb7e35c23a93308eb30f4f71 ]
      
      As Jungyeon reported in bugzilla:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=203209
      
      - Overview
      When mounting the attached crafted image and running program, I got this error.
      Additionally, it hangs on sync after the this script.
      
      The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on.
      
      - Reproduces
      cc poc_01.c
      ./run.sh f2fs
      sync
      
       kernel BUG at fs/f2fs/f2fs.h:1788!
       RIP: 0010:f2fs_truncate_data_blocks_range+0x342/0x350
       Call Trace:
        f2fs_truncate_blocks+0x36d/0x3c0
        f2fs_truncate+0x88/0x110
        f2fs_setattr+0x3e1/0x460
        notify_change+0x2da/0x400
        do_truncate+0x6d/0xb0
        do_sys_ftruncate+0xf1/0x160
        do_syscall_64+0x43/0xf0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is dec_valid_block_count() will trigger kernel panic due to
      inconsistent count in between inode.i_blocks and actual block.
      
      To avoid panic, let's just print debug message and set SBI_NEED_FSCK to
      give a hint to fsck for latter repairing.
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      [Jaegeuk Kim: fix build warning and add unlikely]
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      7765cd4c
    • Chao Yu's avatar
      f2fs: fix to clear dirty inode in error path of f2fs_iget() · 35ac00c5
      Chao Yu authored
      [ Upstream commit 546d22f070d64a7b96f57c93333772085d3a5e6d ]
      
      As Jungyeon reported in bugzilla:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=203217
      
      - Overview
      When mounting the attached crafted image and running program, I got this error.
      Additionally, it hangs on sync after running the program.
      
      The image is intentionally fuzzed from a normal f2fs image for testing and I enabled option CONFIG_F2FS_CHECK_FS on.
      
      - Reproduces
      cc poc_test_05.c
      mkdir test
      mount -t f2fs tmp.img test
      sudo ./a.out
      sync
      
      - Messages
       kernel BUG at fs/f2fs/inode.c:707!
       RIP: 0010:f2fs_evict_inode+0x33f/0x3a0
       Call Trace:
        evict+0xba/0x180
        f2fs_iget+0x598/0xdf0
        f2fs_lookup+0x136/0x320
        __lookup_slow+0x92/0x140
        lookup_slow+0x30/0x50
        walk_component+0x1c1/0x350
        path_lookupat+0x62/0x200
        filename_lookup+0xb3/0x1a0
        do_readlinkat+0x56/0x110
        __x64_sys_readlink+0x16/0x20
        do_syscall_64+0x43/0xf0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      During inode loading, __recover_inline_status() can recovery inode status
      and set inode dirty, once we failed in following process, it will fail
      the check in f2fs_evict_inode, result in trigger BUG_ON().
      
      Let's clear dirty inode in error path of f2fs_iget() to avoid panic.
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      35ac00c5
    • Chao Yu's avatar
      f2fs: fix to avoid panic in do_recover_data() · 549f0930
      Chao Yu authored
      [ Upstream commit 22d61e286e2d9097dae36f75ed48801056b77cac ]
      
      As Jungyeon reported in bugzilla:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=203227
      
      - Overview
      When mounting the attached crafted image, following errors are reported.
      Additionally, it hangs on sync after trying to mount it.
      
      The image is intentionally fuzzed from a normal f2fs image for testing.
      Compile options for F2FS are as follows.
      CONFIG_F2FS_FS=y
      CONFIG_F2FS_STAT_FS=y
      CONFIG_F2FS_FS_XATTR=y
      CONFIG_F2FS_FS_POSIX_ACL=y
      CONFIG_F2FS_CHECK_FS=y
      
      - Reproduces
      mkdir test
      mount -t f2fs tmp.img test
      sync
      
      - Messages
       kernel BUG at fs/f2fs/recovery.c:549!
       RIP: 0010:recover_data+0x167a/0x1780
       Call Trace:
        f2fs_recover_fsync_data+0x613/0x710
        f2fs_fill_super+0x1043/0x1aa0
        mount_bdev+0x16d/0x1a0
        mount_fs+0x4a/0x170
        vfs_kern_mount+0x5d/0x100
        do_mount+0x200/0xcf0
        ksys_mount+0x79/0xc0
        __x64_sys_mount+0x1c/0x20
        do_syscall_64+0x43/0xf0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      During recovery, if ofs_of_node is inconsistent in between recovered
      node page and original checkpointed node page, let's just fail recovery
      instead of making kernel panic.
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      549f0930
    • Hou Tao's avatar
      fs/fat/file.c: issue flush after the writeback of FAT · 33440c22
      Hou Tao authored
      [ Upstream commit bd8309de0d60838eef6fb575b0c4c7e95841cf73 ]
      
      fsync() needs to make sure the data & meta-data of file are persistent
      after the return of fsync(), even when a power-failure occurs later.  In
      the case of fat-fs, the FAT belongs to the meta-data of file, so we need
      to issue a flush after the writeback of FAT instead before.
      
      Also bail out early when any stage of fsync fails.
      
      Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.comSigned-off-by: 's avatarHou Tao <houtao1@huawei.com>
      Acked-by: 's avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      33440c22
  8. 11 Jun, 2019 6 commits
    • Kirill Smelkov's avatar
      fuse: Add FOPEN_STREAM to use stream_open() · 585724f8
      Kirill Smelkov authored
      commit bbd84f33652f852ce5992d65db4d020aba21f882 upstream.
      
      Starting from commit 9c225f26 ("vfs: atomic f_pos accesses as per
      POSIX") files opened even via nonseekable_open gate read and write via lock
      and do not allow them to be run simultaneously. This can create read vs
      write deadlock if a filesystem is trying to implement a socket-like file
      which is intended to be simultaneously used for both read and write from
      filesystem client.  See commit 10dce8af3422 ("fs: stream_open - opener for
      stream-like files so that read and write can run simultaneously without
      deadlock") for details and e.g. commit 581d21a2 ("xenbus: fix deadlock
      on writes to /proc/xen/xenbus") for a similar deadlock example on
      /proc/xen/xenbus.
      
      To avoid such deadlock it was tempting to adjust fuse_finish_open to use
      stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
      but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
      and in particular GVFS which actually uses offset in its read and write
      handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
      so if we would do such a change it will break a real user.
      
      Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
      opened handler is having stream-like semantics; does not use file position
      and thus the kernel is free to issue simultaneous read and write request on
      opened file handle.
      
      This patch together with stream_open() should be added to stable kernels
      starting from v3.14+. This will allow to patch OSSPD and other FUSE
      filesystems that provide stream-like files to return FOPEN_STREAM |
      FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
      kernel versions. This should work because fuse_finish_open ignores unknown
      open flags returned from a filesystem and so passing FOPEN_STREAM to a
      kernel that is not aware of this flag cannot hurt. In turn the kernel that
      is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
      is sufficient to implement streams without read vs write deadlock.
      
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: 's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: 's avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      585724f8
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · b673f99c
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      commit 10dce8af34226d90fa56746a934f8da5dcdba3df upstream.
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: 's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      b673f99c
    • Kees Cook's avatar
      pstore/ram: Run without kernel crash dump region · 08ae2e88
      Kees Cook authored
      commit 8880fa32c557600f5f624084152668ed3c2ea51e upstream.
      
      The ram pstore backend has always had the crash dumper frontend enabled
      unconditionally. However, it was possible to effectively disable it
      by setting a record_size=0. All the machinery would run (storing dumps
      to the temporary crash buffer), but 0 bytes would ultimately get stored
      due to there being no przs allocated for dumps. Commit 89d328f637b9
      ("pstore/ram: Correctly calculate usable PRZ bytes"), however, assumed
      that there would always be at least one allocated dprz for calculating
      the size of the temporary crash buffer. This was, of course, not the
      case when record_size=0, and would lead to a NULL deref trying to find
      the dprz buffer size:
      
      BUG: unable to handle kernel NULL pointer dereference at (null)
      ...
      IP: ramoops_probe+0x285/0x37e (fs/pstore/ram.c:808)
      
              cxt->pstore.bufsize = cxt->dprzs[0]->buffer_size;
      
      Instead, we need to only enable the frontends based on the success of the
      prz initialization and only take the needed actions when those zones are
      available. (This also fixes a possible error in detecting if the ftrace
      frontend should be enabled.)
      Reported-and-tested-by: 's avatarYaro Slav <yaro330@gmail.com>
      Fixes: 89d328f637b9 ("pstore/ram: Correctly calculate usable PRZ bytes")
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarKees Cook <keescook@chromium.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08ae2e88
    • Kees Cook's avatar
      pstore: Convert buf_lock to semaphore · f72ecfe9
      Kees Cook authored
      commit ea84b580b95521644429cc6748b6c2bf27c8b0f3 upstream.
      
      Instead of running with interrupts disabled, use a semaphore. This should
      make it easier for backends that may need to sleep (e.g. EFI) when
      performing a write:
      
      |BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
      |in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
      |Preemption disabled at:
      |[<ffffffff99d60512>] pstore_dump+0x72/0x330
      |CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G      D           4.20.0-rc3 #45
      |Call Trace:
      | dump_stack+0x4f/0x6a
      | ___might_sleep.cold.91+0xd3/0xe4
      | __might_sleep+0x50/0x90
      | wait_for_completion+0x32/0x130
      | virt_efi_query_variable_info+0x14e/0x160
      | efi_query_variable_store+0x51/0x1a0
      | efivar_entry_set_safe+0xa3/0x1b0
      | efi_pstore_write+0x109/0x140
      | pstore_dump+0x11c/0x330
      | kmsg_dump+0xa4/0xd0
      | oops_exit+0x22/0x30
      ...
      Reported-by: 's avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Fixes: 21b3ddd3 ("efi: Don't use spinlocks for efi vars")
      Signed-off-by: 's avatarKees Cook <keescook@chromium.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f72ecfe9
    • Kees Cook's avatar
      pstore: Remove needless lock during console writes · d80d6f65
      Kees Cook authored
      commit b77fa617a2ff4d6beccad3d3d4b3a1f2d10368aa upstream.
      
      Since the console writer does not use the preallocated crash dump buffer
      any more, there is no reason to perform locking around it.
      
      Fixes: 70ad35db ("pstore: Convert console write to use ->write_buf")
      Signed-off-by: 's avatarKees Cook <keescook@chromium.org>
      Reviewed-by: 's avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d80d6f65
    • Miklos Szeredi's avatar
      fuse: fallocate: fix return with locked inode · 7a28b742
      Miklos Szeredi authored
      commit 35d6fcbb7c3e296a52136347346a698a35af3fda upstream.
      
      Do the proper cleanup in case the size check fails.
      
      Tested with xfstests:generic/228
      Reported-by: 's avatarkbuild test robot <lkp@intel.com>
      Reported-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 0cbade024ba5 ("fuse: honor RLIMIT_FSIZE in fuse_file_fallocate")
      Cc: Liu Bo <bo.liu@linux.alibaba.com>
      Cc: <stable@vger.kernel.org> # v3.5
      Signed-off-by: 's avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a28b742
  9. 09 Jun, 2019 6 commits
    • Benjamin Coddington's avatar
      Revert "lockd: Show pid of lockd for remote locks" · 3420dcef
      Benjamin Coddington authored
      commit 141731d15d6eb2fd9aaefbf9b935ce86ae243074 upstream.
      
      This reverts most of commit b8eee0e90f97 ("lockd: Show pid of lockd for
      remote locks"), which caused remote locks to not be differentiated between
      remote processes for NLM.
      
      We retain the fixup for setting the client's fl_pid to a negative value.
      
      Fixes: b8eee0e90f97 ("lockd: Show pid of lockd for remote locks")
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: 's avatarXueWei Zhang <xueweiz@google.com>
      Signed-off-by: 's avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3420dcef
    • Roberto Bergantinos Corpas's avatar
      CIFS: cifs_read_allocate_pages: don't iterate through whole page array on ENOMEM · dea5d380
      Roberto Bergantinos Corpas authored
      commit 31fad7d41e73731f05b8053d17078638cf850fa6 upstream.
      
       In cifs_read_allocate_pages, in case of ENOMEM, we go through
      whole rdata->pages array but we have failed the allocation before
      nr_pages, therefore we may end up calling put_page with NULL
      pointer, causing oops
      Signed-off-by: 's avatarRoberto Bergantinos Corpas <rbergant@redhat.com>
      Acked-by: 's avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: 's avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dea5d380
    • Filipe Manana's avatar
      Btrfs: incremental send, fix file corruption when no-holes feature is enabled · c2f017bb
      Filipe Manana authored
      commit 6b1f72e5b82a5c2a4da4d1ebb8cc01913ddbea21 upstream.
      
      When using the no-holes feature, if we have a file with prealloc extents
      with a start offset beyond the file's eof, doing an incremental send can
      cause corruption of the file due to incorrect hole detection. Such case
      requires that the prealloc extent(s) exist in both the parent and send
      snapshots, and that a hole is punched into the file that covers all its
      extents that do not cross the eof boundary.
      
      Example reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
        $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.snap /mnt/sdb/base
      
        $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
      
        $ md5sum /mnt/sdb/incr/foobar
        816df6f64deba63b029ca19d880ee10a   /mnt/sdb/incr/foobar
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.snap /mnt/sdc
        $ btrfs receive -f /tmp/incr.snap /mnt/sdc
      
        $ md5sum /mnt/sdc/incr/foobar
        cf2ef71f4a9e90c2f6013ba3b2257ed2   /mnt/sdc/incr/foobar
      
          --> Different checksum, because the prealloc extent beyond the
              file's eof confused the hole detection code and it assumed
              a hole starting at offset 0 and ending at the offset of the
              prealloc extent (1200Kb) instead of ending at the offset
              500Kb (the file's size).
      
      Fix this by ensuring we never cross the file's size when issuing the
      write operations for a hole.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      CC: stable@vger.kernel.org # 3.14+
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c2f017bb
    • Filipe Manana's avatar
      Btrfs: fix fsync not persisting changed attributes of a directory · 69e14cf8
      Filipe Manana authored
      commit 60d9f50308e5df19bc18c2fefab0eba4a843900a upstream.
      
      While logging an inode we follow its ancestors and for each one we mark
      it as logged in the current transaction, even if we have not logged it.
      As a consequence if we change an attribute of an ancestor, such as the
      UID or GID for example, and then explicitly fsync it, we end up not
      logging the inode at all despite returning success to user space, which
      results in the attribute being lost if a power failure happens after
      the fsync.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ chown 6007:6007 /mnt/dir
      
        $ sync
      
        $ chown 9003:9003 /mnt/dir
        $ touch /mnt/dir/file
        $ xfs_io -c fsync /mnt/dir/file
      
        # fsync our directory after fsync'ing the new file, should persist the
        # new values for the uid and gid.
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ stat -c %u:%g /mnt/dir
        6007:6007
      
          --> should be 9003:9003, the uid and gid were not persisted, despite
              the explicit fsync on the directory prior to the power failure
      
      Fix this by not updating the logged_trans field of ancestor inodes when
      logging an inode, since we have not logged them. Let only future calls to
      btrfs_log_inode() to mark inodes as logged.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69e14cf8
    • Filipe Manana's avatar
      Btrfs: fix race updating log root item during fsync · 3562d6e2
      Filipe Manana authored
      commit 06989c799f04810f6876900d4760c0edda369cf7 upstream.
      
      When syncing the log, the final phase of a fsync operation, we need to
      either create a log root's item or update the existing item in the log
      tree of log roots, and that depends on the current value of the log
      root's log_transid - if it's 1 we need to create the log root item,
      otherwise it must exist already and we update it. Since there is no
      synchronization between updating the log_transid and checking it for
      deciding whether the log root's item needs to be created or updated, we
      end up with a tiny race window that results in attempts to update the
      item to fail because the item was not yet created:
      
                    CPU 1                                    CPU 2
      
        btrfs_sync_log()
      
          lock root->log_mutex
      
          set log root's log_transid to 1
      
          unlock root->log_mutex
      
                                                     btrfs_sync_log()
      
                                                       lock root->log_mutex
      
                                                       sets log root's
                                                       log_transid to 2
      
                                                       unlock root->log_mutex
      
          update_log_root()
      
            sees log root's log_transid
            with a value of 2
      
              calls btrfs_update_root(),
              which fails with -EUCLEAN
              and causes transaction abort
      
      Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
      the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a
      root key") we just abort the current transaction.
      
      A sample trace of the BUG_ON() on a SLE12 kernel:
      
        ------------[ cut here ]------------
        kernel BUG at ../fs/btrfs/root-tree.c:157!
        Oops: Exception in kernel mode, sig: 5 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        (...)
        Supported: Yes, External
        CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
        task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
        NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
        REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
        MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
        CFAR: d000000036aba66c SOFTE: 1
        GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
        GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
        GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
        GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
        GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
        GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
        GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
        GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
        NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
        LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
        Call Trace:
        [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
        [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
        [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
        [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
        [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
        [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
        [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
        Instruction dump:
        7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
        e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
        ---[ end trace 8f2dc8f919cabab8 ]---
      
      So fix this by doing the check of log_transid and updating or creating the
      log root's item while holding the root's log_mutex.
      
      Fixes: 7237f183 ("Btrfs: fix tree logs parallel sync")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3562d6e2
    • Filipe Manana's avatar
      Btrfs: fix wrong ctime and mtime of a directory after log replay · ab4dfb10
      Filipe Manana authored
      commit 5338e43abbab13791144d37fd8846847062351c6 upstream.
      
      When replaying a log that contains a new file or directory name that needs
      to be added to its parent directory, we end up updating the mtime and the
      ctime of the parent directory to the current time after we have set their
      values to the correct ones (set at fsync time), efectivelly losing them.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/file
      
        # fsync of the directory is optional, not needed
        $ xfs_io -c fsync /mnt/dir
        $ xfs_io -c fsync /mnt/dir/file
      
        $ stat -c %Y /mnt/dir
        1557856079
      
        <power failure>
      
        $ sleep 3
        $ mount /dev/sdb /mnt
        $ stat -c %Y /mnt/dir
        1557856082
      
          --> should have been 1557856079, the mtime is updated to the current
              time when replaying the log
      
      Fix this by not updating the mtime and ctime to the current time at
      btrfs_add_link() when we are replaying a log tree.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: 's avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab4dfb10
  10. 31 May, 2019 6 commits
    • Benjamin Coddington's avatar
      NFS: Fix a double unlock from nfs_match,get_client · de69696b
      Benjamin Coddington authored
      [ Upstream commit c260121a97a3e4df6536edbc2f26e166eff370ce ]
      
      Now that nfs_match_client drops the nfs_client_lock, we should be
      careful
      to always return it in the same condition: locked.
      
      Fixes: 950a578c6128 ("NFS: make nfs_match_client killable")
      Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com
      Signed-off-by: 's avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: 's avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      de69696b
    • Chengguang Xu's avatar
      chardev: add additional check for minor range overlap · ed3f381e
      Chengguang Xu authored
      [ Upstream commit de36e16d1557a0b6eb328bc3516359a12ba5c25c ]
      
      Current overlap checking cannot correctly handle
      a case which is baseminor < existing baseminor &&
      baseminor + minorct > existing baseminor + minorct.
      Signed-off-by: 's avatarChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      ed3f381e
    • Qu Wenruo's avatar
      btrfs: Don't panic when we can't find a root key · 72c8b100
      Qu Wenruo authored
      [ Upstream commit 7ac1e464c4d473b517bb784f30d40da1f842482e ]
      
      When we failed to find a root key in btrfs_update_root(), we just panic.
      
      That's definitely not cool, fix it by outputting an unique error
      message, aborting current transaction and return -EUCLEAN. This should
      not normally happen as the root has been used by the callers in some
      way.
      Reviewed-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: 's avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: 's avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      72c8b100
    • Josef Bacik's avatar
      btrfs: fix panic during relocation after ENOSPC before writeback happens · a5bc86ba
      Josef Bacik authored
      [ Upstream commit ff612ba7849964b1898fd3ccd1f56941129c6aab ]
      
      We've been seeing the following sporadically throughout our fleet
      
      panic: kernel BUG at fs/btrfs/relocation.c:4584!
      netversion: 5.0-0
      Backtrace:
       #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
       #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
       #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
       #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
       #4 [ffffc90003adb9c0] do_trap at ffffffff81019114
       #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
       #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
          [exception RIP: btrfs_reloc_cow_block+692]
          RIP: ffffffff8143b614  RSP: ffffc90003adbb68  RFLAGS: 00010246
          RAX: fffffffffffffff7  RBX: ffff8806b9c32000  RCX: ffff8806aad00690
          RDX: ffff880850b295e0  RSI: ffff8806b9c32000  RDI: ffff88084f205bd0
          RBP: ffff880849415000   R8: ffffc90003adbbe0   R9: ffff88085ac90000
          R10: ffff8805f7369140  R11: 0000000000000000  R12: ffff880850b295e0
          R13: ffff88084f205bd0  R14: 0000000000000000  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
       #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
       #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c
      
      The way relocation moves data extents is by creating a reloc inode and
      preallocating extents in this inode and then copying the data into these
      preallocated extents.  Once we've done this for all of our extents,
      we'll write out these dirty pages, which marks the extent written, and
      goes into btrfs_reloc_cow_block().  From here we get our current
      reloc_control, which _should_ match the reloc_control for the current
      block group we're relocating.
      
      However if we get an ENOSPC in this path at some point we'll bail out,
      never initiating writeback on this inode.  Not a huge deal, unless we
      happen to be doing relocation on a different block group, and this block
      group is now rc->stage == UPDATE_DATA_PTRS.  This trips the BUG_ON() in
      btrfs_reloc_cow_block(), because we expect to be done modifying the data
      inode.  We are in fact done modifying the metadata for the data inode
      we're currently using, but not the one from the failed block group, and
      thus we BUG_ON().
      
      (This happens when writeback finishes for extents from the previous
      group, when we are at btrfs_finish_ordered_io() which updates the data
      reloc tree (inode item, drops/adds extent items, etc).)
      
      Fix this by writing out the reloc data inode always, and then breaking
      out of the loop after that point to keep from tripping this BUG_ON()
      later.
      Signed-off-by: 's avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: 's avatarFilipe Manana <fdmanana@suse.com>
      [ add note from Filipe ]
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      a5bc86ba
    • Robbie Ko's avatar
      Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve · b971fb6b
      Robbie Ko authored
      [ Upstream commit 39ad317315887c2cb9a4347a93a8859326ddf136 ]
      
      When doing fallocate, we first add the range to the reserve_list and
      then reserve the quota.  If quota reservation fails, we'll release all
      reserved parts of reserve_list.
      
      However, cur_offset is not updated to indicate that this range is
      already been inserted into the list.  Therefore, the same range is freed
      twice.  Once at list_for_each_entry loop, and once at the end of the
      function.  This will result in WARN_ON on bytes_may_use when we free the
      remaining space.
      
      At the end, under the 'out' label we have a call to:
      
         btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);
      
      The start offset, third argument, should be cur_offset.
      
      Everything from alloc_start to cur_offset was freed by the
      list_for_each_entry_safe_loop.
      
      Fixes: 18513091 ("btrfs: update btrfs_space_info's bytes_may_use timely")
      Reviewed-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      b971fb6b
    • Andreas Gruenbacher's avatar
      gfs2: Fix occasional glock use-after-free · c97cbd68
      Andreas Gruenbacher authored
      [ Upstream commit 9287c6452d2b1f24ea8e84bd3cf6f3c6f267f712 ]
      
      This patch has to do with the life cycle of glocks and buffers.  When
      gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata
      object is assigned to track the buffer, and that is queued to various
      lists, including the glock's gl_ail_list to indicate it's on the active
      items list.  Once the page associated with the buffer has been written,
      it is removed from the ail list, but its life isn't over until a revoke
      has been successfully written.
      
      So after the block is written, its bufdata object is moved from the
      glock's gl_ail_list to a file-system-wide list of pending revokes,
      sd_log_le_revoke.  At that point the glock still needs to track how many
      revokes it contributed to that list (in gl_revokes) so that things like
      glock go_sync can ensure all the metadata has been not only written, but
      also revoked before the glock is granted to a different node.  This is
      to guarantee journal replay doesn't replay the block once the glock has
      been granted to another node.
      
      Ross Lagerwall recently discovered a race in which an inode could be
      evicted, and its glock freed after its ail list had been synced, but
      while it still had unwritten revokes on the sd_log_le_revoke list.  The
      evict decremented the glock reference count to zero, which allowed the
      glock to be freed.  After the revoke was written, function
      revoke_lo_after_commit tried to adjust the glock's gl_revokes counter
      and clear its GLF_LFLUSH flag, at which time it referenced the freed
      glock.
      
      This patch fixes the problem by incrementing the glock reference count
      in gfs2_add_revoke when the glock's first bufdata object is moved from
      the glock to the global revokes list. Later, when the glock's last such
      bufdata object is freed, the reference count is decremented. This
      guarantees that whichever process finishes last (the revoke writing or
      the evict) will properly free the glock, and neither will reference the
      glock after it has been freed.
      Reported-by: 's avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: 's avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: 's avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      c97cbd68