1. 13 Apr, 2018 1 commit
  2. 14 May, 2017 1 commit
    • Ilya Dryomov's avatar
      block: get rid of blk_integrity_revalidate() · 6a762074
      Ilya Dryomov authored
      commit 19b7ccf8 upstream.
      
      Commit 25520d55 ("block: Inline blk_integrity in struct gendisk")
      introduced blk_integrity_revalidate(), which seems to assume ownership
      of the stable pages flag and unilaterally clears it if no blk_integrity
      profile is registered:
      
          if (bi->profile)
                  disk->queue->backing_dev_info->capabilities |=
                          BDI_CAP_STABLE_WRITES;
          else
                  disk->queue->backing_dev_info->capabilities &=
                          ~BDI_CAP_STABLE_WRITES;
      
      It's called from revalidate_disk() and rescan_partitions(), making it
      impossible to enable stable pages for drivers that support partitions
      and don't use blk_integrity: while the call in revalidate_disk() can be
      trivially worked around (see zram, which doesn't support partitions and
      hence gets away with zram_revalidate_disk()), rescan_partitions() can
      be triggered from userspace at any time.  This breaks rbd, where the
      ceph messenger is responsible for generating/verifying CRCs.
      
      Since blk_integrity_{un,}register() "must" be used for (un)registering
      the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
      setting there.  This way drivers that call blk_integrity_register() and
      use integrity infrastructure won't interfere with drivers that don't
      but still want stable pages.
      
      Fixes: 25520d55 ("block: Inline blk_integrity in struct gendisk")
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Tested-by: 's avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: 's avatarIlya Dryomov <idryomov@gmail.com>
      [idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a762074
  3. 14 Jun, 2016 1 commit
  4. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  5. 30 Mar, 2016 1 commit
  6. 15 Mar, 2016 1 commit
  7. 30 Jan, 2016 1 commit
  8. 26 Nov, 2015 1 commit
  9. 21 Oct, 2015 1 commit
    • Martin K. Petersen's avatar
      block: Inline blk_integrity in struct gendisk · 25520d55
      Martin K. Petersen authored
      Up until now the_integrity profile has been dynamically allocated and
      attached to struct gendisk after the disk has been made active.
      
      This causes problems because NVMe devices need to register the profile
      prior to the partition table being read due to a mandatory metadata
      buffer requirement. In addition, DM goes through hoops to deal with
      preallocating, but not initializing integrity profiles.
      
      Since the integrity profile is small (4 bytes + a pointer), Christoph
      suggested moving it to struct gendisk proper. This requires several
      changes:
      
       - Moving the blk_integrity definition to genhd.h.
      
       - Inlining blk_integrity in struct gendisk.
      
       - Removing the dynamic allocation code.
      
       - Adding helper functions which allow gendisk to set up and tear down
         the integrity sysfs dir when a disk is added/deleted.
      
       - Adding a blk_integrity_revalidate() callback for updating the stable
         pages bdi setting.
      
       - The calls that depend on whether a device has an integrity profile or
         not now key off of the bi->profile pointer.
      
       - Simplifying the integrity support routines in DM (Mike Snitzer).
      Signed-off-by: 's avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reported-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: 's avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: 's avatarMike Snitzer <snitzer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: 's avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      25520d55
  10. 17 Jul, 2015 2 commits
  11. 03 Sep, 2014 1 commit
  12. 08 Apr, 2013 1 commit
    • Jens Axboe's avatar
      Revert "loop: cleanup partitions when detaching loop device" · c2fccc1c
      Jens Axboe authored
      This reverts commit 8761a3dc.
      
      There are situations where the destruction path is called
      with the bdev->bd_mutex already held, which then deadlocks in
      loop_clr_fd(). The normal partition cleanup does a trylock()
      on the mutex, but it'd be nice to have a more bullet proof
      method in loop. So punt this more involved fix to the next
      merge window, and just back out this buggy fix for now.
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      c2fccc1c
  13. 22 Mar, 2013 1 commit
    • Phillip Susi's avatar
      loop: cleanup partitions when detaching loop device · 8761a3dc
      Phillip Susi authored
      Any partitions added by user space to the loop device were being
      left in place after detaching the loop device.  This was because
      the detach path issued a BLKRRPART to clean up partitions if
      LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
      scanned on attach.  Replace this BLKRRPART with code that
      unconditionally cleans up partitions on detach instead.
      Signed-off-by: 's avatarPhillip Susi <psusi@ubuntu.com>
      
      Modified by Jens to export delete_partition().
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      8761a3dc
  14. 28 Feb, 2013 2 commits
    • Ming Lei's avatar
      block/partitions: optimize memory allocation in check_partition() · ac2e5327
      Ming Lei authored
      Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
      it is easy to trigger page allocation failure by check_partition,
      especially in hotplug block device situation(such as, USB mass storage,
      MMC card, ...), and Felipe Balbi has observed the failure.
      
      This patch does below optimizations on the allocation of struct
      parsed_partitions to try to address the issue:
      
      - make parsed_partitions.parts as pointer so that the pointed memory can
        fit in 32KB buffer, then approximate 32KB memory can be saved
      
      - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
        still a bit big for kmalloc
      
      - given that many devices have the partition count limit, so only
        allocate disk_max_parts() partitions instead of 256 partitions always
      Signed-off-by: 's avatarMing Lei <ming.lei@canonical.com>
      Reported-by: 's avatarFelipe Balbi <balbi@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: 's avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac2e5327
    • Tomas Henzl's avatar
      block: fix ext_devt_idr handling · 7b74e912
      Tomas Henzl authored
      While adding and removing a lot of disks disks and partitions this
      sometimes shows up:
      
        WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
        Hardware name:
        sysfs: cannot create duplicate filename '/dev/block/259:751'
        Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
        Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
        Call Trace:
          warn_slowpath_common+0x87/0xc0
          warn_slowpath_fmt+0x46/0x50
          sysfs_add_one+0xc9/0x130
          sysfs_do_create_link+0x12b/0x170
          sysfs_create_link+0x13/0x20
          device_add+0x317/0x650
          idr_get_new+0x13/0x50
          add_partition+0x21c/0x390
          rescan_partitions+0x32b/0x470
          sd_open+0x81/0x1f0 [sd_mod]
          __blkdev_get+0x1b6/0x3c0
          blkdev_get+0x10/0x20
          register_disk+0x155/0x170
          add_disk+0xa6/0x160
          sd_probe_async+0x13b/0x210 [sd_mod]
          add_wait_queue+0x46/0x60
          async_thread+0x102/0x250
          default_wake_function+0x0/0x20
          async_thread+0x0/0x250
          kthread+0x96/0xa0
          child_rip+0xa/0x20
          kthread+0x0/0xa0
          child_rip+0x0/0x20
      
      This most likely happens because dev_t is freed while the number is
      still used and idr_get_new() is not protected on every use.  The fix
      adds a mutex where it wasn't before and moves the dev_t free function so
      it is called after device del.
      Signed-off-by: 's avatarTomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b74e912
  15. 01 Aug, 2012 1 commit
    • Vivek Goyal's avatar
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal authored
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: 's avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: 's avatarPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      c83f6bf9
  16. 02 Mar, 2012 1 commit
  17. 04 Jan, 2012 2 commits
  18. 13 Jun, 2011 1 commit
  19. 30 May, 2011 1 commit
    • Jens Axboe's avatar
      Revert "block: Remove extra discard_alignment from hd_struct." · a1706ac4
      Jens Axboe authored
      It was not a good idea to start dereferencing disk->queue from
      the fs sysfs strategy for displaying discard alignment. We ran
      into first a NULL pointer deref, and after fixing that we sometimes
      see unvalid disk->queue pointer values.
      
      Since discard is the only one of the bunch actually looking into
      the queue, just revert the change.
      
      This reverts commit 23ceb5b7.
      
      Conflicts:
      	fs/partitions/check.c
      a1706ac4
  20. 26 May, 2011 1 commit
  21. 09 May, 2011 1 commit
    • Jens Axboe's avatar
      fs: fixup warning part_discard_alignment_show() · bbdd304c
      Jens Axboe authored
      Stephen reports:
      
      -----
      
      After merging the block tree, today's linux-next build (x86_64
      allmodconfig) produced this warning:
      
      fs/partitions/check.c: In function 'part_discard_alignment_show':
      fs/partitions/check.c:263: warning: format '%u' expects type 'unsigned int', but argument 3 has type 'long long unsigned int'
      
      Introduced by commit  ("block: Remove extra discard_alignment from
      hd_struct")
      
      -----
      
      Fix it up by just removing the cast, we return an int already.
      Reported-by: 's avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: 's avatarJens Axboe <jaxboe@fusionio.com>
      bbdd304c
  22. 07 May, 2011 1 commit
  23. 31 Mar, 2011 1 commit
  24. 22 Mar, 2011 1 commit
  25. 07 Jan, 2011 1 commit
  26. 05 Jan, 2011 1 commit
    • Jerome Marchand's avatar
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand authored
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: 's avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: 's avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: 's avatarJens Axboe <jaxboe@fusionio.com>
      09e099d4
  27. 16 Dec, 2010 1 commit
  28. 13 Nov, 2010 1 commit
    • Tejun Heo's avatar
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo authored
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Acked-by: 's avatarNeil Brown <neilb@suse.de>
      Acked-by: 's avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: 's avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: 's avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  29. 11 Nov, 2010 1 commit
  30. 24 Oct, 2010 1 commit
  31. 22 Oct, 2010 1 commit
    • Andi Kleen's avatar
      SYSFS: Allow boot time switching between deprecated and modern sysfs layout · e52eec13
      Andi Kleen authored
      I have some systems which need legacy sysfs due to old tools that are
      making assumptions that a directory can never be a symlink to another
      directory, and it's a big hazzle to compile separate kernels for them.
      
      This patch turns CONFIG_SYSFS_DEPRECATED into a run time option
      that can be switched on/off the kernel command line. This way
      the same binary can be used in both cases with just a option
      on the command line.
      
      The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set
      the default. I kept the weird name to not break existing
      config files.
      
      Also the compat code can be still completely disabled by undefining
      CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes
      care of this now instead of lots of ifdefs. This makes the code
      look nicer.
      
      v2: This is an updated version on top of Kay's patch to only
      handle the block devices. I tested it on my old systems
      and that seems to work.
      
      Cc: axboe@kernel.dk
      Signed-off-by: 's avatarAndi Kleen <ak@linux.intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@suse.de>
      e52eec13
  32. 19 Oct, 2010 1 commit
    • Yasuaki Ishimatsu's avatar
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu authored
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: 's avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: 's avatarJens Axboe <jaxboe@fusionio.com>
      7681bfee
  33. 15 Sep, 2010 1 commit
    • Will Drewry's avatar
      block, partition: add partition_meta_info to hd_struct · 6d1d8050
      Will Drewry authored
      I'm reposting this patch series as v4 since there have been no additional
      comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
      are against Linus's tree: 2bfc96a1
      (2.6.36-rc3).
      
      Would this patchset be suitable for inclusion in an mm branch?
      
      This changes adds a partition_meta_info struct which itself contains a
      union of structures that provide partition table specific metadata.
      
      This change leaves the union empty. The subsequent patch includes an
      implementation for CONFIG_EFI_PARTITION-based metadata.
      Signed-off-by: 's avatarWill Drewry <wad@chromium.org>
      Signed-off-by: 's avatarJens Axboe <jaxboe@fusionio.com>
      6d1d8050
  34. 11 Aug, 2010 1 commit
  35. 14 Jun, 2010 1 commit
  36. 21 May, 2010 2 commits
    • Tejun Heo's avatar
      block: improve automatic native capacity unlocking · b403a98e
      Tejun Heo authored
      Currently, native capacity unlocking is initiated only when a
      recognized partition extends beyond the end of the disk.  However,
      there are several other unhandled cases where truncated capacity can
      lead to misdetection of partitions.
      
      * Partition table is fully beyond EOD.
      
      * Partition table is partially beyond EOD (daisy chained ones).
      
      * Recognized partition starts beyond EOD.
      
      This patch updates generic partition check code such that all the
      above three cases are handled too.  For the first two, @state tracks
      whether low level partition check code tried to read beyond EOD during
      partition scan and triggers native capacity unlocking accordingly.
      The third is now handled similarly to the original unlocking case.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Acked-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarJens Axboe <jens.axboe@oracle.com>
      b403a98e
    • Tejun Heo's avatar
      block: use struct parsed_partitions *state universally in partition check code · 1493bf21
      Tejun Heo authored
      Make the following changes to partition check code.
      
      * Add ->bdev to struct parsed_partitions.
      
      * Introduce read_part_sector() which is a simple wrapper around
        read_dev_sector() which takes struct parsed_partitions *state
        instead of @bdev.
      
      * For functions which used to take @state and @bdev, drop @bdev.  For
        functions which used to take @bdev, replace it with @state.
      
      * While updating, drop superflous checks on NULL state/bdev in ldm.c.
      
      This cleans up the API a bit and enables better handling of IO errors
      during partition check as the generic partition check code now has
      much better visibility into what went wrong in the low level code
      paths.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Acked-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarJens Axboe <jens.axboe@oracle.com>
      1493bf21