1. 27 Apr, 2017 1 commit
    • Dan Williams's avatar
      block: fix del_gendisk() vs blkdev_ioctl crash · 6ddbac9a
      Dan Williams authored
      commit ac34f15e upstream.
      
      When tearing down a block device early in its lifetime, userspace may
      still be performing discovery actions like blkdev_ioctl() to re-read
      partitions.
      
      The nvdimm_revalidate_disk() implementation depends on
      disk->driverfs_dev to be valid at entry.  However, it is set to NULL in
      del_gendisk() and fatally this is happening *before* the disk device is
      deleted from userspace view.
      
      There's no reason for del_gendisk() to clear ->driverfs_dev.  That
      device is the parent of the disk.  It is guaranteed to not be freed
      until the disk, as a child, drops its ->parent reference.
      
      We could also fix this issue locally in nvdimm_revalidate_disk() by
      using disk_to_dev(disk)->parent, but lets fix it globally since
      ->driverfs_dev follows the lifetime of the parent.  Longer term we
      should probably just add a @parent parameter to add_disk(), and stop
      carrying this pointer in the gendisk.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
       CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
       [..]
       Call Trace:
        [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
        [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
        [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
        [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
        [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
        [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
        [<ffffffff812916dd>] block_ioctl+0x3d/0x50
        [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
        [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
        [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff81264f09>] SyS_ioctl+0x79/0x90
        [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Reported-by: default avatarRobert Hu <robert.hu@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ddbac9a
  2. 20 Aug, 2016 1 commit
    • Dan Williams's avatar
      block: fix bdi vs gendisk lifetime mismatch · 0d301856
      Dan Williams authored
      commit df08c32c upstream.
      
      The name for a bdi of a gendisk is derived from the gendisk's devt.
      However, since the gendisk is destroyed before the bdi it leaves a
      window where a new gendisk could dynamically reuse the same devt while a
      bdi with the same name is still live.  Arrange for the bdi to hold a
      reference against its "owner" disk device while it is registered.
      Otherwise we can hit sysfs duplicate name collisions like the following:
      
       WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
       sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'
      
       Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
        0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
        ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
        0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
       Call Trace:
        [<ffffffff8134caec>] dump_stack+0x63/0x87
        [<ffffffff8108c351>] __warn+0xd1/0xf0
        [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
        [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
        [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
        [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
        [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
        [<ffffffff8134ff55>] kobject_add+0x75/0xd0
        [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
        [<ffffffff8148b0a5>] device_add+0x125/0x610
        [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
        [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
        [<ffffffff811b775c>] bdi_register+0x8c/0x180
        [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
        [<ffffffff813317f5>] add_disk+0x175/0x4a0
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Tested-by: default avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Fixed up missing 0 return in bdi_register_owner().
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0d301856
  3. 16 Aug, 2016 1 commit
    • Vegard Nossum's avatar
      block: fix use-after-free in seq file · 9a95c0cf
      Vegard Nossum authored
      commit 77da1605 upstream.
      
      I got a KASAN report of use-after-free:
      
          ==================================================================
          BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
          Read of size 8 by task trinity-c1/315
          =============================================================================
          BUG kmalloc-32 (Not tainted): kasan: bad access detected
          -----------------------------------------------------------------------------
      
          Disabling lock debugging due to kernel taint
          INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
                  ___slab_alloc+0x4f1/0x520
                  __slab_alloc.isra.58+0x56/0x80
                  kmem_cache_alloc_trace+0x260/0x2a0
                  disk_seqf_start+0x66/0x110
                  traverse+0x176/0x860
                  seq_read+0x7e3/0x11a0
                  proc_reg_read+0xbc/0x180
                  do_loop_readv_writev+0x134/0x210
                  do_readv_writev+0x565/0x660
                  vfs_readv+0x67/0xa0
                  do_preadv+0x126/0x170
                  SyS_preadv+0xc/0x10
                  do_syscall_64+0x1a1/0x460
                  return_from_SYSCALL_64+0x0/0x6a
          INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
                  __slab_free+0x17a/0x2c0
                  kfree+0x20a/0x220
                  disk_seqf_stop+0x42/0x50
                  traverse+0x3b5/0x860
                  seq_read+0x7e3/0x11a0
                  proc_reg_read+0xbc/0x180
                  do_loop_readv_writev+0x134/0x210
                  do_readv_writev+0x565/0x660
                  vfs_readv+0x67/0xa0
                  do_preadv+0x126/0x170
                  SyS_preadv+0xc/0x10
                  do_syscall_64+0x1a1/0x460
                  return_from_SYSCALL_64+0x0/0x6a
      
          CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
           ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
           ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
          Call Trace:
           [<ffffffff81d6ce81>] dump_stack+0x65/0x84
           [<ffffffff8146c7bd>] print_trailer+0x10d/0x1a0
           [<ffffffff814704ff>] object_err+0x2f/0x40
           [<ffffffff814754d1>] kasan_report_error+0x221/0x520
           [<ffffffff8147590e>] __asan_report_load8_noabort+0x3e/0x40
           [<ffffffff83888161>] klist_iter_exit+0x61/0x70
           [<ffffffff82404389>] class_dev_iter_exit+0x9/0x10
           [<ffffffff81d2e8ea>] disk_seqf_stop+0x3a/0x50
           [<ffffffff8151f812>] seq_read+0x4b2/0x11a0
           [<ffffffff815f8fdc>] proc_reg_read+0xbc/0x180
           [<ffffffff814b24e4>] do_loop_readv_writev+0x134/0x210
           [<ffffffff814b4c45>] do_readv_writev+0x565/0x660
           [<ffffffff814b8a17>] vfs_readv+0x67/0xa0
           [<ffffffff814b8de6>] do_preadv+0x126/0x170
           [<ffffffff814b92ec>] SyS_preadv+0xc/0x10
      
      This problem can occur in the following situation:
      
      open()
       - pread()
          - .seq_start()
             - iter = kmalloc() // succeeds
             - seqf->private = iter
          - .seq_stop()
             - kfree(seqf->private)
       - pread()
          - .seq_start()
             - iter = kmalloc() // fails
          - .seq_stop()
             - class_dev_iter_exit(seqf->private) // boom! old pointer
      
      As the comment in disk_seqf_stop() says, stop is called even if start
      failed, so we need to reinitialise the private pointer to NULL when seq
      iteration stops.
      
      An alternative would be to set the private pointer to NULL when the
      kmalloc() in disk_seqf_start() fails.
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a95c0cf
  4. 21 Oct, 2015 1 commit
    • Martin K. Petersen's avatar
      block: Inline blk_integrity in struct gendisk · 25520d55
      Martin K. Petersen authored
      Up until now the_integrity profile has been dynamically allocated and
      attached to struct gendisk after the disk has been made active.
      
      This causes problems because NVMe devices need to register the profile
      prior to the partition table being read due to a mandatory metadata
      buffer requirement. In addition, DM goes through hoops to deal with
      preallocating, but not initializing integrity profiles.
      
      Since the integrity profile is small (4 bytes + a pointer), Christoph
      suggested moving it to struct gendisk proper. This requires several
      changes:
      
       - Moving the blk_integrity definition to genhd.h.
      
       - Inlining blk_integrity in struct gendisk.
      
       - Removing the dynamic allocation code.
      
       - Adding helper functions which allow gendisk to set up and tear down
         the integrity sysfs dir when a disk is added/deleted.
      
       - Adding a blk_integrity_revalidate() callback for updating the stable
         pages bdi setting.
      
       - The calls that depend on whether a device has an integrity profile or
         not now key off of the bi->profile pointer.
      
       - Simplifying the integrity support routines in DM (Mike Snitzer).
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      25520d55
  5. 17 Jul, 2015 2 commits
  6. 11 Jun, 2015 1 commit
    • Dan Williams's avatar
      block: fix ext_dev_lock lockdep report · 4d66e5e9
      Dan Williams authored
       =================================
       [ INFO: inconsistent lock state ]
       4.1.0-rc7+ #217 Tainted: G           O
       ---------------------------------
       inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
       {SOFTIRQ-ON-W} state was registered at:
         [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
         [<ffffffff810c1947>] lock_acquire+0xb7/0x290
         [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
         [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
      [..]
        [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
        [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
        [<ffffffff810c1947>] lock_acquire+0xb7/0x290
        [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
        [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
        [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
        [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
        [<ffffffff8143bfec>] part_release+0x1c/0x50
        [<ffffffff8158edf6>] device_release+0x36/0xb0
        [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
        [<ffffffff8145aad0>] kobject_put+0x30/0x70
        [<ffffffff8158f147>] put_device+0x17/0x20
        [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
        [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
        [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
        [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
        [<ffffffff81067e2e>] __do_softirq+0xde/0x600
      
      Neil sees this in his tests and it also triggers on pmem driver unbind
      for the libnvdimm tests.  This fix is on top of an initial fix by Keith
      for incorrect usage of mutex_lock() in this path: 2da78092 "block:
      Fix dev_t minor allocation lifetime".  Both this and 2da78092 are
      candidates for -stable.
      
      Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
      Cc: <stable@vger.kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Reported-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4d66e5e9
  7. 02 Jun, 2015 1 commit
    • Tejun Heo's avatar
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo authored
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      66114cad
  8. 28 May, 2015 1 commit
  9. 19 Nov, 2014 1 commit
  10. 22 Sep, 2014 1 commit
  11. 09 Sep, 2014 1 commit
  12. 03 Sep, 2014 1 commit
  13. 11 Sep, 2013 1 commit
  14. 03 Jul, 2013 1 commit
  15. 14 May, 2013 1 commit
    • Viresh Kumar's avatar
      block: queue work on power efficient wq · 695588f9
      Viresh Kumar authored
      Block layer uses workqueues for multiple purposes. There is no real dependency
      of scheduling these on the cpu which scheduled them.
      
      On a idle system, it is observed that and idle cpu wakes up many times just to
      service this work. It would be better if we can schedule it on a cpu which the
      scheduler believes to be the most appropriate one.
      
      This patch replaces normal workqueues with power efficient versions.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      695588f9
  16. 11 Apr, 2013 1 commit
  17. 08 Apr, 2013 1 commit
    • Kay Sievers's avatar
      driver core: add uid and gid to devtmpfs · 3c2670e6
      Kay Sievers authored
      Some drivers want to tell userspace what uid and gid should be used for
      their device nodes, so allow that information to percolate through the
      driver core to userspace in order to make this happen.  This means that
      some systems (i.e.  Android and friends) will not need to even run a
      udev-like daemon for their device node manager and can just rely in
      devtmpfs fully, reducing their footprint even more.
      Signed-off-by: default avatarKay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c2670e6
  18. 28 Feb, 2013 3 commits
    • Tejun Heo's avatar
      block: convert to idr_alloc() · bab998d6
      Tejun Heo authored
      Convert to the much saner new idr interface.  Both bsg and genhd
      protect idr w/ mutex making preloading unnecessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bab998d6
    • Tejun Heo's avatar
      block: fix synchronization and limit check in blk_alloc_devt() · ce23bba8
      Tejun Heo authored
      idr allocation in blk_alloc_devt() wasn't synchronized against lookup
      and removal, and its limit check was off by one - 1 << MINORBITS is
      the number of minors allowed, not the maximum allowed minor.
      
      Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
      checking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce23bba8
    • Tomas Henzl's avatar
      block: fix ext_devt_idr handling · 7b74e912
      Tomas Henzl authored
      While adding and removing a lot of disks disks and partitions this
      sometimes shows up:
      
        WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
        Hardware name:
        sysfs: cannot create duplicate filename '/dev/block/259:751'
        Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
        Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
        Call Trace:
          warn_slowpath_common+0x87/0xc0
          warn_slowpath_fmt+0x46/0x50
          sysfs_add_one+0xc9/0x130
          sysfs_do_create_link+0x12b/0x170
          sysfs_create_link+0x13/0x20
          device_add+0x317/0x650
          idr_get_new+0x13/0x50
          add_partition+0x21c/0x390
          rescan_partitions+0x32b/0x470
          sd_open+0x81/0x1f0 [sd_mod]
          __blkdev_get+0x1b6/0x3c0
          blkdev_get+0x10/0x20
          register_disk+0x155/0x170
          add_disk+0xa6/0x160
          sd_probe_async+0x13b/0x210 [sd_mod]
          add_wait_queue+0x46/0x60
          async_thread+0x102/0x250
          default_wake_function+0x0/0x20
          async_thread+0x0/0x250
          kthread+0x96/0xa0
          child_rip+0xa/0x20
          kthread+0x0/0xa0
          child_rip+0x0/0x20
      
      This most likely happens because dev_t is freed while the number is
      still used and idr_get_new() is not protected on every use.  The fix
      adds a mutex where it wasn't before and moves the dev_t free function so
      it is called after device del.
      Signed-off-by: default avatarTomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b74e912
  19. 24 Feb, 2013 1 commit
    • Ming Lei's avatar
      block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices · 25e823c8
      Ming Lei authored
      Apply the introduced pm_runtime_set_memalloc_noio on block device so
      that PM core will teach mm to not allocate memory with GFP_IOFS when
      calling the runtime_resume and runtime_suspend callback for block
      devices and its ancestors.
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25e823c8
  20. 19 Dec, 2012 2 commits
    • Derek Basehore's avatar
      block: prevent race/cleanup · 12c2bdb2
      Derek Basehore authored
      Remove a race condition which causes a warning in disk_clear_events.  This
      is a race between disk_clear_events() and disk_flush_events().
      ev->clearing will be altered by disk_flush_events() even though we are
      blocking event checking through disk_flush_events().  If this happens
      after ev->clearing was cleared for disk_clear_events(), this can cause the
      WARN_ON_ONCE() in that function to be triggered.
      
      This change also has disk_clear_events() not go through a workqueue.
      Since we have to wait for the work to complete, we should just call the
      function directly.  Also, since this work cannot be put on a freezable
      workqueue, it will have to contend with increased demand, so calling the
      function directly avoids this.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Signed-off-by: default avatarDerek Basehore <dbasehore@chromium.org>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      12c2bdb2
    • Derek Basehore's avatar
      block: remove deadlock in disk_clear_events · aea24a8b
      Derek Basehore authored
      In disk_clear_events, do not put work on system_nrt_freezable_wq.
      Instead, put it on system_nrt_wq.
      
      There is a race between probing a usb and suspending the device.  Since
      probing a usb calls disk_clear_events, which puts work on a frozen
      workqueue, probing cannot finish after the workqueue is frozen.  However,
      suspending cannot finish until the usb probe is finished, so we get a
      deadlock, causing the system to reboot.
      
      The way to reproduce this bug is to wake up from suspend with a usb
      storage device plugged in, or plugging in a usb storage device right
      before suspend.  The window of time is on the order of time it takes to
      probe the usb device.  As long as the workqueues are frozen before the
      call to add_disk within sd_probe_async finishes, there will be a deadlock
      (which calls blkdev_get, sd_open, check_disk_change, then
      disk_clear_events).  This is not difficult to reproduce after figuring out
      the timings.
      
      [akpm@linux-foundation.org: fix up comment]
      Signed-off-by: default avatarDerek Basehore <dbasehore@chromium.org>
      Reviewed-by: default avatarMandeep Singh Baines <msb@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aea24a8b
  21. 23 Nov, 2012 1 commit
    • Stephen Warren's avatar
      block: store partition_meta_info.uuid as a string · 1ad7e899
      Stephen Warren authored
      This will allow other types of UUID to be stored here, aside from true
      UUIDs.  This also simplifies code that uses this field, since it's usually
      constructed from a, used as a, or compared to other, strings.
      
      Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a
      /PARTNROFF option was found to be present.  However, this modifies the
      input string, and causes subsequent calls to devt_from_partuuid() not to
      see the /PARTNROFF option, which causes different results.  In order to
      avoid misleading future maintainers, this parameter is marked const.
      Signed-off-by: Stephen Warren's avatarStephen Warren <swarren@nvidia.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1ad7e899
  22. 10 Nov, 2012 1 commit
  23. 20 Aug, 2012 1 commit
    • Tejun Heo's avatar
      workqueue: deprecate system_nrt[_freezable]_wq · 3b07e9ca
      Tejun Heo authored
      system_nrt[_freezable]_wq are now spurious.  Mark them deprecated and
      convert all users to system[_freezable]_wq.
      
      If you're cc'd and wondering what's going on: Now all workqueues are
      non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
      Please use system[_freezable]_wq instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-By: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      3b07e9ca
  24. 13 Aug, 2012 1 commit
    • Tejun Heo's avatar
      workqueue: use mod_delayed_work() instead of cancel + queue · 41f63c53
      Tejun Heo authored
      Convert delayed_work users doing cancel_delayed_work() followed by
      queue_delayed_work() to mod_delayed_work().
      
      Most conversions are straight-forward.  Ones worth mentioning are,
      
      * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
        use mod_delayed_work() and cancel loop in
        edac_mc_reset_delay_period() is dropped.
      
      * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
        watchdog is active or not.  @fan_watchdog_active and related code
        dropped.
      
      * drivers/power/charger-manager.c: Seemingly a lot of
        delayed_work_pending() abuse going on here.
        [delayed_]work_pending() are unsynchronized and racy when used like
        this.  I converted one instance in fullbatt_handler().  Please
        conver the rest so that it invokes workqueue APIs for the intended
        target state rather than trying to game work item pending state
        transitions.  e.g. if timer should be modified - call
        mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
      
      * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
        simplified.  Note that round_jiffies() calls in this function are
        meaningless.  round_jiffies() work on absolute jiffies not delta
        delay used by delayed_work.
      
      v2: Tomi pointed out that __cancel_delayed_work() users can't be
          safely converted to mod_delayed_work().  They could be calling it
          from irq context and if that happens while delayed_work_timer_fn()
          is running, it could deadlock.  __cancel_delayed_work() users are
          dropped.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Acked-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: default avatarAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      41f63c53
  25. 03 Aug, 2012 1 commit
    • Jianpeng Ma's avatar
      block: Don't use static to define "void *p" in show_partition_start() · 06768067
      Jianpeng Ma authored
      I met a odd prblem:read /proc/partitions may return zero.
      
      I wrote a file test.c:
      int main()
      {
      	char buff[4096];
      	int ret;
      	int fd;
      	printf("pid=%d\n",getpid());
      	while (1) {
      		fd = open("/proc/partitions", O_RDONLY);
      		if (fd < 0) {
      			printf("open error %s\n", strerror(errno));
      			return 0;
      		}
      		ret = read(fd, buff, 4096);
      		if (ret <= 0)
      			printf("ret=%d, %s, %ld\n", ret,
      				strerror(errno), lseek(fd,0,SEEK_CUR));
      		close(fd);
      	}
      	exit(0);
      }
      
      You can reproduce by:
      1:while true;do cat /proc/partitions > /dev/null ;done
      2:./test
      
      I reviewed the code and found:
      
      >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
      >> {
      >> 	static void *p;
      >>
      >> 	p = disk_seqf_start(seqf, pos);
      >> 	if (!IS_ERR_OR_NULL(p) && !*pos)
      >> 		seq_puts(seqf, "major minor  #blocks  name\n\n");
      >> 	return p;
      >> }
      		test								cat /proc/partitions
      	p = disk_seqf_start()(Not NULL)
      									p = disk_seqf_start()(NULL because pos)
      	if (!IS_ERR_OR_NULL(p) && !*pos)
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      06768067
  26. 01 Aug, 2012 1 commit
    • Vivek Goyal's avatar
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal authored
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c83f6bf9
  27. 15 May, 2012 1 commit
    • Tejun Heo's avatar
      block: fix buffer overflow when printing partition UUIDs · 05c69d29
      Tejun Heo authored
      6d1d8050 "block, partition: add partition_meta_info to hd_struct"
      added part_unpack_uuid() which assumes that the passed in buffer has
      enough space for sprintfing "%pU" - 37 characters including '\0'.
      
      Unfortunately, b5af921e "init: add support for root devices
      specified by partition UUID" supplied 33 bytes buffer to the function
      leading to the following panic with stackprotector enabled.
      
        Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
      
        [<ffffffff815e226b>] panic+0xba/0x1c6
        [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
        [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
        [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
        [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
        [<ffffffff81aee0fa>] mount_root+0x57/0x5b
        [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
        [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
        [<ffffffff81aedd60>] kernel_init+0x155/0x15a
        [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
        [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
        [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
        [<ffffffff815f4d20>] ? gs_change+0x13/0x13
      
      Increase the buffer size, remove the dangerous part_unpack_uuid() and
      use snprintf() directly from printk_all_partitions().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarSzymon Gruszczynski <sz.gruszczynski@googlemail.com>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05c69d29
  28. 02 Mar, 2012 2 commits
    • Alan Stern's avatar
      Block: use a freezable workqueue for disk-event polling · 62d3c543
      Alan Stern authored
      This patch (as1519) fixes a bug in the block layer's disk-events
      polling.  The polling is done by a work routine queued on the
      system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
      polling continues even in the middle of a system sleep transition.
      
      Obviously, polling a suspended drive for media changes and such isn't
      a good thing to do; in the case of USB mass-storage devices it can
      lead to real problems requiring device resets and even re-enumeration.
      
      The patch fixes things by creating a new system-wide, non-reentrant,
      freezable workqueue and using it for disk-events polling.
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      CC: <stable@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      62d3c543
    • Stanislaw Gruszka's avatar
      block: fix __blkdev_get and add_disk race condition · 9f53d2fe
      Stanislaw Gruszka authored
      The following situation might occur:
      
      __blkdev_get:			add_disk:
      
      				register_disk()
      get_gendisk()
      
      disk_block_events()
      	disk->ev == NULL
      
      				disk_add_events()
      
      __disk_unblock_events()
      	disk->ev != NULL
      	--ev->block
      
      Then we unblock events, when they are suppose to be blocked. This can
      trigger events related block/genhd.c warnings, but also can crash in
      sd_check_events() or other places.
      
      I'm able to reproduce crashes with the following scripts (with
      connected usb dongle as sdb disk).
      
      <snip>
      DEV=/dev/sdb
      ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue
      
      function stop_me()
      {
      	for i in `jobs -p` ; do kill $i 2> /dev/null ; done
      	exit
      }
      
      trap stop_me SIGHUP SIGINT SIGTERM
      
      for ((i = 0; i < 10; i++)) ; do
      	while true; do fdisk -l $DEV  2>&1 > /dev/null ; done &
      done
      
      while true ; do
      echo 1 > $ENABLE
      sleep 1
      echo 0 > $ENABLE
      done
      </snip>
      
      I use the script to verify patch fixing oops in sd_revalidate_disk
      http://marc.info/?l=linux-scsi&m=132935572512352&w=2
      Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
      sd_revalidate_disk" or this one, script easily crash kernel within
      a few seconds. With both patches applied I do not observe crash.
      Unfortunately after some time (dozen of minutes), script will hung in:
      
      [ 1563.906432]  [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
      [ 1563.906437]  [<c04532d5>] msleep+0x15/0x20
      [ 1563.906443]  [<c05d60b2>] blk_drain_queue+0x32/0xd0
      [ 1563.906447]  [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
      [ 1563.906454]  [<c06d278f>] scsi_free_queue+0x3f/0x60
      [ 1563.906459]  [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
      [ 1563.906463]  [<c06d4aff>] scsi_forget_host+0x4f/0x60
      [ 1563.906468]  [<c06cd84a>] scsi_remove_host+0x5a/0xf0
      [ 1563.906482]  [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
      [ 1563.906490]  [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]
      
      Anyway I think this patch is some step forward.
      
      As drawback, I do not teardown on sysfs file create error, because I do
      not know how to nullify disk->ev (since it can be used). However add_disk
      error handling practically does not exist too, and things will work
      without this sysfs file, except events will not be exported to user
      space.
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9f53d2fe
  29. 04 Jan, 2012 3 commits
  30. 13 Dec, 2011 1 commit
    • Tejun Heo's avatar
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo authored
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      
      * Make sure the caller checks for the return value.
      
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09ac46c4
  31. 10 Nov, 2011 1 commit
  32. 24 Oct, 2011 1 commit
    • Tejun Heo's avatar
      block: make gendisk hold a reference to its queue · f992ae80
      Tejun Heo authored
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f992ae80
  33. 19 Oct, 2011 1 commit
    • Tejun Heo's avatar
      block: make gendisk hold a reference to its queue · 523e1d39
      Tejun Heo authored
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      523e1d39