1. 05 Dec, 2017 1 commit
  2. 28 Sep, 2017 1 commit
  3. 06 Sep, 2017 1 commit
    • Dennis Yang's avatar
      md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list · 184a09eb
      Dennis Yang authored
      In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
      set, it indicates that this stripe_head is already in the raid5_plug_cb
      list and release_stripe() would be called instead to drop a reference
      count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
      stripe_head and it will get queued into the raid5_plug_cb list.
      Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
      A stripe could be re-added to plug list while it is still on that list
      in the following situation. If stripe_head A is added to another
      stripe_head B's batch list, in this case A will have its
      batch_head != NULL and be added into the plug list. After that,
      stripe_head B gets handled and called break_stripe_batch_list() to
      reset all the batched stripe_head(including A which is still on
      the plug list)'s state and reset their batch_head to NULL.
      Before the plug list gets processed, if there is another write request
      comes in and get stripe_head A, A will have its batch_head == NULL
      (cleared by calling break_stripe_batch_list() on B) and be added to
      plug list once again.
      Signed-off-by: default avatarDennis Yang <dennisyang@qnap.com>
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  4. 05 Sep, 2017 1 commit
    • Shaohua Li's avatar
      md/raid5: fix a race condition in stripe batch · 3664847d
      Shaohua Li authored
      We have a race condition in below scenario, say have 3 continuous stripes, sh1,
      sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:
      CPU1				CPU2				CPU3
      				-> lock(sh2, sh3)
      				-> lock batch_lock(sh1)
      				-> add sh3 to batch_list of sh1
      				-> unlock batch_lock(sh1)
      								-> lock(sh1) and batch_lock(sh1)
      								-> clear STRIPE_BATCH_READY for all stripes in batch_list
      								-> unlock(sh1) and batch_lock(sh1)
      -->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
      --->return 0 as sh->batch == NULL
      				-> sh3->batch_head = sh1
      				-> unlock (sh2, sh3)
      In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
      of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
      impossible to clear STRIPE_BATCH_READY before batch_head is set.
      Thanks Stephane for helping debug this tricky issue.
      Reported-and-tested-by: default avatarStephane Thiell <sthiell@stanford.edu>
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  5. 28 Aug, 2017 1 commit
  6. 25 Aug, 2017 1 commit
  7. 24 Aug, 2017 1 commit
    • Song Liu's avatar
      md/raid5: release/flush io in raid5_do_work() · 9c72a18e
      Song Liu authored
      In raid5, there are scenarios where some ios are deferred to a later
      time, and some IO need a flush to complete. To make sure we make
      progress with these IOs, we need to call the following functions:
      Both of these functions are called in raid5d(), but missing in
      raid5_do_work(). As a result, these functions are not called
      when multi-threading (group_thread_cnt > 0) is enabled. This patch
      adds calls to these function to raid5_do_work().
      Note for stable branches:
        r5l_flush_stripe_to_raid(conf->log) is need for 4.4+
        flush_deferred_bios(conf) is only needed for 4.11+
      Cc: stable@vger.kernel.org (4.4+)
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  8. 23 Aug, 2017 2 commits
    • Christoph Hellwig's avatar
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig authored
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Christoph Hellwig's avatar
      raid5: remove a call to get_start_sect · 10433d04
      Christoph Hellwig authored
      The block layer always remaps partitions before calling into the
      ->make_request methods of drivers.  Thus the call to get_start_sect in
      in_chunk_boundary will always return 0 and can be removed.
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  9. 24 Jul, 2017 1 commit
  10. 21 Jul, 2017 1 commit
  11. 10 Jul, 2017 1 commit
    • Xiao Ni's avatar
      Raid5 should update rdev->sectors after reshape · b5d27718
      Xiao Ni authored
      The raid5 md device is created by the disks which we don't use the total size. For example,
      the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
      Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
      and assemble it again. It fails.
      mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
      mdadm /dev/md0 --grow --chunk=64
      wait reshape to finish
      mdadm -S /dev/md0
      mdadm -As
      The error messages:
      [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
      [197519.821686] md: md_import_device returned -22
      After reshape the data offset is changed. It selects backwards direction in this condition.
      In function super_1_load it compares the available space of the underlying device with
      sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
      rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based
      on rdev->sectors. So add md_finish_reshape in end_reshape.
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Acked-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  12. 18 Jun, 2017 1 commit
  13. 13 Jun, 2017 2 commits
    • Mikulas Patocka's avatar
      md: don't use flush_signals in userspace processes · f9c79bc0
      Mikulas Patocka authored
      The function flush_signals clears all pending signals for the process. It
      may be used by kernel threads when we need to prepare a kernel thread for
      responding to signals. However using this function for an userspaces
      processes is incorrect - clearing signals without the program expecting it
      can cause misbehavior.
      The raid1 and raid5 code uses flush_signals in its request routine because
      it wants to prepare for an interruptible wait. This patch drops
      flush_signals and uses sigprocmask instead to block all signals (including
      SIGKILL) around the schedule() call. The signals are not lost, but the
      schedule() call won't respond to them.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown authored
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: default avatarNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  14. 09 Jun, 2017 1 commit
  15. 05 Jun, 2017 1 commit
  16. 24 May, 2017 1 commit
    • Nix's avatar
      md: report sector of stripes with check mismatches · e1539036
      Nix authored
      This makes it possible, with appropriate filesystem support, for a
      sysadmin to tell what is affected by the mismatch, and whether
      it should be ignored (if it's inside a swap partition, for
      We ratelimit to prevent log flooding: if there are so many
      mismatches that ratelimiting is necessary, the individual messages
      are relatively unlikely to be important (either the machine is
      swapping like crazy or something is very wrong with the disk).
      Signed-off-by: default avatarNick Alcock <nick.alcock@oracle.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  17. 12 May, 2017 2 commits
    • Song Liu's avatar
      md/r5cache: handle sync with data in write back cache · 5ddf0440
      Song Liu authored
      Currently, sync of raid456 array cannot make progress when hitting
      data in writeback r5cache.
      This patch fixes this issue by flushing cached data of the stripe
      before processing the sync request. This is achived by:
      1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is
         in write back cache;
      2. In r5c_try_caching_write(), handle the stripe in sync with write
      3. In do_release_stripe(), make stripe in sync write out and send
         it to the state machine.
      Shaohua: explictly set STRIPE_HANDLE after write out completed
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • Song Liu's avatar
      md/r5cache: gracefully handle journal device errors for writeback mode · 70d466f7
      Song Liu authored
      For the raid456 with writeback cache, when journal device failed during
      normal operation, it is still possible to persist all data, as all
      pending data is still in stripe cache. However, it is necessary to handle
      journal failure gracefully.
      During journal failures, the following logic handles the graceful shutdown
      of journal:
      1. raid5_error() marks the device as Faulty and schedules async work
      2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
         suspended, set to write through, and then resumed. mddev_suspend()
         flushes all cached stripes;
      3. All cached stripes need to be flushed carefully to the RAID array.
      This patch fixes issues within the process above:
      1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
         journal failures;
      2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
         since raid5_error() updates superblock.
      3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
         to make progress during log_failed;
      4. In delay_towrite(), if log failed only process data in the cache (skip
         new writes in dev->towrite);
      5. In __get_priority_stripe(), process loprio_list during journal device
      6. In raid5_remove_disk(), wait for all cached stripes are flushed before
         calling log_exit().
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  18. 08 May, 2017 1 commit
  19. 04 May, 2017 1 commit
    • Julia Cartwright's avatar
      md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock · 3d05f3ae
      Julia Cartwright authored
      On mainline, there is no functional difference, just less code, and
      symmetric lock/unlock paths.
      On PREEMPT_RT builds, this fixes the following warning, seen by
      Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.
         BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
         in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
         CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G        W       4.9.20-rt16-stand6-686 #1
         Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
         Workqueue: writeback wb_workfn (flush-253:0)
         Call Trace:
          ? migrate_enable+0x4a/0xf0
          add_stripe_bio+0x4e3/0x6c0 [raid456]
          ? preempt_count_add+0x42/0xb0
          raid5_make_request+0x737/0xdd0 [raid456]
      Reported-by: default avatarAlexander GQ Gerasiov <gq@redlab-i.ru>
      Tested-by: default avatarAlexander GQ Gerasiov <gq@redlab-i.ru>
      Signed-off-by: default avatarJulia Cartwright <julia@ni.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  20. 25 Apr, 2017 1 commit
  21. 11 Apr, 2017 1 commit
    • NeilBrown's avatar
      md/raid5: make chunk_aligned_read() split bios more cleanly. · dd7a8f5d
      NeilBrown authored
      chunk_aligned_read() currently uses fs_bio_set - which is meant for
      filesystems to use - and loops if multiple splits are needed, which is
      not best practice.
      As this is only used for READ requests, not writes, it is unlikely
      to cause a problem.  However it is best to be consistent in how
      we split bios, and to follow the pattern used in raid1/raid10.
      So create a private bioset, bio_split, and use it to perform a single
      split, submitting the remainder to generic_make_request() for later
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  22. 10 Apr, 2017 4 commits
    • Artur Paszkiewicz's avatar
      raid5-ppl: partial parity calculation optimization · ae1713e2
      Artur Paszkiewicz authored
      In case of read-modify-write, partial partity is the same as the result
      of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into
      sh->ppl_page instead of calculating it again.
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • Artur Paszkiewicz's avatar
      raid5-ppl: use resize_stripes() when enabling or disabling ppl · 845b9e22
      Artur Paszkiewicz authored
      Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate
      or free sh->ppl_page at runtime for all stripes in the stripe cache.
      raid5_reset_stripe_cache() required suspending the mddev and could
      deadlock because of GFP_KERNEL allocations.
      Move the 'newsize' check to check_reshape() to allow reallocating the
      stripes with the same number of disks. Allocate sh->ppl_page in
      alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as
      a parameter to alloc_stripe() because it is needed to check whether to
      allocate ppl_page. Add free_stripe() and use it to free stripes rather
      than directly call kmem_cache_free(). Also free sh->ppl_page in
      Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly
      setting it in advance and add another parameter to log_init() to allow
      calling ppl_init_log() without the bit set. Don't try to calculate
      partial parity or add a stripe to log if it does not have ppl_page set.
      Enabling ppl can now be performed without suspending the mddev, because
      the log won't be used until new stripes are allocated with ppl_page.
      Calling mddev_suspend/resume is still necessary when disabling ppl,
      because we want all stripes to finish before stopping the log, but
      resize_stripes() can be called after mddev_resume() when ppl is no
      longer active.
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md/raid6: Fix anomily when recovering a single device in RAID6. · 7471fb77
      NeilBrown authored
      When recoverying a single missing/failed device in a RAID6,
      those stripes where the Q block is on the missing device are
      handled a bit differently.  In these cases it is easy to
      check that the P block is correct, so we do.  This results
      in the P block be destroy.  Consequently the P block needs
      to be read a second time in order to compute Q.  This causes
      lots of seeks and hurts performance.
      It shouldn't be necessary to re-read P as it can be computed
      from the DATA.  But we only compute blocks on missing
      devices, since c337869d ("md: do not compute parity
      unless it is on a failed drive").
      So relax the change made in that commit to allow computing
      of the P block in a RAID6 which it is the only missing that
      This makes RAID6 recovery run much faster as the disk just
      "before" the recovering device is no longer seeking
      Reported-by-tested-by: default avatarBrad Campbell <lists2009@fnarfbargle.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • Dennis Yang's avatar
      md: update slab_cache before releasing new stripes when stripes resizing · 583da48e
      Dennis Yang authored
      When growing raid5 device on machine with small memory, there is chance that
      mdadm will be killed and the following bug report can be observed. The same
      bug could also be reproduced in linux-4.10.6.
      [57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
      [57600.114678] Oops: 0002 [#1] SMP
      [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
      [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
      [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
      [57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
      [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
      [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
      [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
      [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
      [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
      [57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
      [57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
      [57600.274942] Stack:
      [57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
      [57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
      [57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
      [57600.299254] Call Trace:
      [57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
      [57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
      [57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
      [57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
      [57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
      [57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
      [57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
      [57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
      [57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
      [57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
      [57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
      [57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
      [57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
      [57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
      [57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
      [57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
      [57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
      [57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
      [57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
      [57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
      [57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
      [57600.441208]  RSP <ffff880043073810>
      [57600.444690] CR2: 0000000000000000
      [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
      The problem is that resize_stripes() releases new stripe_heads before assigning new
      slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
      after resize_stripes() starting releasing new stripes but right before new slab cache
      being assigned, it is possible that these new stripe_heads will be freed with the old
      slab_cache which was already been destoryed and that triggers this bug.
      Signed-off-by: default avatarDennis Yang <dennisyang@qnap.com>
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1+)
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  23. 08 Apr, 2017 2 commits
  24. 07 Apr, 2017 1 commit
    • NeilBrown's avatar
      block: trace completion of all bios. · fbbaf700
      NeilBrown authored
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      So move the trace_block_bio_complete() call to bio_endio().
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  25. 27 Mar, 2017 1 commit
    • Song Liu's avatar
      md/raid5: use consistency_policy to remove journal feature · 0bb0c105
      Song Liu authored
      When journal device of an array fails, the array is forced into read-only
      mode. To make the array normal without adding another journal device, we
      need to remove journal _feature_ from the array.
      This patch allows remove journal _feature_ from an array, For journal
      existing journal should be either missing or faulty.
      To remove journal feature, it is necessary to remove the journal device
        mdadm --fail /dev/md0 /dev/sdb
        mdadm: set /dev/sdb faulty in /dev/md0
        mdadm --remove /dev/md0 /dev/sdb
        mdadm: hot removed /dev/sdb from /dev/md0
      Then the journal feature can be removed by echoing into the sysfs file:
       cat /sys/block/md0/md/consistency_policy
       echo resync > /sys/block/md0/md/consistency_policy
       cat /sys/block/md0/md/consistency_policy
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
  26. 24 Mar, 2017 1 commit
  27. 23 Mar, 2017 7 commits
    • NeilBrown's avatar
      md/raid5: don't test ->writes_pending in raid5_remove_disk · 84dd97a6
      NeilBrown authored
      This test on ->writes_pending cannot be safe as the counter
      can be incremented at any moment and cannot be locked against.
      Change it to test conf->active_stripes, which at least
      can be locked against.  More changes are still needed.
      A future patch will change ->writes_pending, and testing it here will
      be very inconvenient.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      Revert "md/raid5: limit request size according to implementation limits" · 97d53438
      NeilBrown authored
      This reverts commit e8d7c332.
      Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
      need to impose these limits.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md/raid5: remove over-loading of ->bi_phys_segments. · 0472a42b
      NeilBrown authored
      When a read request, which bypassed the cache, fails, we need to retry
      it through the cache.
      This involves attaching it to a sequence of stripe_heads, and it may not
      be possible to get all the stripe_heads we need at once.
      We do what we can, and record how far we got in ->bi_phys_segments so
      we can pick up again later.
      There is only ever one bio which may have a non-zero offset stored in
      ->bi_phys_segments, the one that is either active in the single thread
      which calls retry_aligned_read(), or is in conf->retry_read_aligned
      waiting for retry_aligned_read() to be called again.
      So we only need to store one offset value.  This can be in a local
      variable passed between remove_bio_from_retry() and
      retry_aligned_read(), or in the r5conf structure next to the
      ->retry_read_aligned pointer.
      Storing it there allows the last usage of ->bi_phys_segments to be
      removed from md/raid5.c.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter · 016c76ac
      NeilBrown authored
      md/raid5 needs to keep track of how many stripe_heads are processing a
      bio so that it can delay calling bio_endio() until all stripe_heads
      have completed.  It currently uses 16 bits of ->bi_phys_segments for
      this purpose.
      16 bits is only enough for 256M requests, and it is possible for a
      single bio to be larger than this, which causes problems.  Also, the
      bio struct contains a larger counter, __bi_remaining, which has a
      purpose very similar to the purpose of our counter.  So stop using
      ->bi_phys_segments, and instead use __bi_remaining.
      This means we don't need to initialize the counter, as our caller
      initializes it to '1'.  It also means we can call bio_endio() directly
      as it tests this counter internally.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md/raid5: call bio_endio() directly rather than queueing for later. · bd83d0a2
      NeilBrown authored
      We currently gather bios that need to be returned into a bio_list
      and call bio_endio() on them all together.
      The original reason for this was to avoid making the calls while
      holding a spinlock.
      Locking has changed a lot since then, and that reason is no longer
      So discard return_io() and various return_bi lists, and just call
      bio_endio() directly as needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md/raid5: simplfy delaying of writes while metadata is updated. · 16d997b7
      NeilBrown authored
      If a device fails during a write, we must ensure the failure is
      recorded in the metadata before the completion of the write is
      Commit c3cce6cd ("md/raid5: ensure device failure recorded before
      write request returns.")  added code for this, but it was
      unnecessarily complicated.  We already had similar functionality for
      handling updates to the bad-block-list, thanks to Commit de393cde
      ("md: make it easier to wait for bad blocks to be acknowledged.")
      So revert most of the former commit, and instead avoid collecting
      completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
      the metadata and retry the stripe_head.
      As this change can leave a stripe_head ready for handling immediately
      after handle_active_stripes() returns, we change raid5_do_work() to
      pause when MD_CHANGE_PENDING is set, so that it doesn't spin.
      We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
      asynchronously.  After analyse_stripe(), we have collected stable data
      about the state of devices, which will be used to make decisions.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
    • NeilBrown's avatar
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown authored
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>