1. 31 Aug, 2017 2 commits
    • Paolo Valente's avatar
      block, bfq: remove direct switch to an entity in higher class · a02195ce
      Paolo Valente authored
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, and finds out
      that E belongs to a higher-priority class than that of the current
      next-in-service entity, then it sets next_in_service directly to
      E. But this may lead to anomalous schedules, because E may happen not
      be eligible for service, because its virtual start time is higher than
      the system virtual time for its service tree.
      
      This commit addresses this issue by simply removing this direct
      switch.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a02195ce
    • Paolo Valente's avatar
      block, bfq: make lookup_next_entity push up vtime on expirations · 80294c3b
      Paolo Valente authored
      To provide a very smooth service, bfq starts to serve a bfq_queue
      only if the queue is 'eligible', i.e., if the same queue would
      have started to be served in the ideal, perfectly fair system that
      bfq simulates internally. This is obtained by associating each
      queue with a virtual start time, and by computing a special system
      virtual time quantity: a queue is eligible only if the system
      virtual time has reached the virtual start time of the
      queue. Finally, bfq guarantees that, when a new queue must be set
      in service, there is always at least one eligible entity for each
      active parent entity in the scheduler. To provide this guarantee,
      the function __bfq_lookup_next_entity pushes up, for each parent
      entity on which it is invoked, the system virtual time to the
      minimum among the virtual start times of the entities in the
      active tree for the parent entity (more precisely, the push up
      occurs if the system virtual time happens to be lower than all
      such virtual start times).
      
      There is however a circumstance in which __bfq_lookup_next_entity
      cannot push up the system virtual time for a parent entity, even
      if the system virtual time is lower than the virtual start times
      of all the child entities in the active tree. It happens if one of
      the child entities is in service. In fact, in such a case, there
      is already an eligible entity, the in-service one, even if it may
      not be not present in the active tree (because in-service entities
      may be removed from the active tree).
      
      Unfortunately, in the last re-design of the
      hierarchical-scheduling engine, the reset of the pointer to the
      in-service entity for a given parent entity--reset to be done as a
      consequence of the expiration of the in-service entity--always
      happens after the function __bfq_lookup_next_entity has been
      invoked. This causes the function to think that there is still an
      entity in service for the parent entity, and then that the system
      virtual time cannot be pushed up, even if actually such a
      no-more-in-service entity has already been properly reinserted
      into the active tree (or in some other tree if no more
      active). Yet, the system virtual time *had* to be pushed up, to be
      ready to correctly choose the next queue to serve. Because of the
      lack of this push up, bfq may wrongly set in service a queue that
      had been speculatively pre-computed as the possible
      next-in-service queue, but that would no more be the one to serve
      after the expiration and the reinsertion into the active trees of
      the previously in-service entities.
      
      This commit addresses this issue by making
      __bfq_lookup_next_entity properly push up the system virtual time
      if an expiration is occurring.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80294c3b
  2. 28 Aug, 2017 2 commits
  3. 25 Aug, 2017 3 commits
  4. 24 Aug, 2017 4 commits
    • Bart Van Assche's avatar
      compat_hdio_ioctl: Fix a declaration · 6a934bb8
      Bart Van Assche authored
      This patch avoids that sparse reports the following warning messages:
      
      block/compat_ioctl.c:85:11: warning: incorrect type in assignment (different address spaces)
      block/compat_ioctl.c:85:11:    expected unsigned long *[noderef] <asn:1>p
      block/compat_ioctl.c:85:11:    got void [noderef] <asn:1>*
      block/compat_ioctl.c:91:21: warning: incorrect type in argument 1 (different address spaces)
      block/compat_ioctl.c:91:21:    expected void const volatile [noderef] <asn:1>*<noident>
      block/compat_ioctl.c:91:21:    got unsigned long *[noderef] <asn:1>p
      block/compat_ioctl.c:87:53: warning: dereference of noderef expression
      block/compat_ioctl.c:91:21: warning: dereference of noderef expression
      
      Fixes: commit d597580d ("generic ...copy_..._user primitives")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6a934bb8
    • weiping zhang's avatar
      block: remove blk_free_devt in add_partition · 47570848
      weiping zhang authored
      put_device(pdev) will call pdev->type->release finally, and blk_free_devt
      has been called in part_release(), so remove it.
      Signed-off-by: default avatarweiping zhang <zhangweiping@didichuxing.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      47570848
    • Benjamin Block's avatar
      bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer · 50b4d485
      Benjamin Block authored
      Since we split the scsi_request out of struct request bsg fails to
      provide a reply-buffer for the drivers. This was done via the pointer
      for sense-data, that is not preallocated anymore.
      
      Failing to allocate/assign it results in illegal dereferences because
      LLDs use this pointer unquestioned.
      
      An example panic on s390x, using the zFCP driver, looks like this (I had
      debugging on, otherwise NULL-pointer dereferences wouldn't even panic on
      s390x):
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6403
      Fault in home space mode while using kernel ASCE.
      AS:0000000001590007 R3:0000000000000024
      Oops: 0038 ilc:2 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: <Long List>
      CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-bsg-regression+ #3
      Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
      task: 0000000065cb0100 task.stack: 0000000065cb4000
      Krnl PSW : 0704e00180000000 000003ff801e4156 (zfcp_fc_ct_els_job_handler+0x16/0x58 [zfcp])
                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000001 000000005fa9d0d0 000000005fa9d078 0000000000e16866
                 000003ff00000290 6b6b6b6b6b6b6b6b 0000000059f78f00 000000000000000f
                 00000000593a0958 00000000593a0958 0000000060d88800 000000005ddd4c38
                 0000000058b50100 07000000659cba08 000003ff801e8556 00000000659cb9a8
      Krnl Code: 000003ff801e4146: e31020500004        lg      %r1,80(%r2)
                 000003ff801e414c: 58402040           l       %r4,64(%r2)
                #000003ff801e4150: e35020200004       lg      %r5,32(%r2)
                >000003ff801e4156: 50405004           st      %r4,4(%r5)
                 000003ff801e415a: e54c50080000       mvhi    8(%r5),0
                 000003ff801e4160: e33010280012       lt      %r3,40(%r1)
                 000003ff801e4166: a718fffb           lhi     %r1,-5
                 000003ff801e416a: 1803               lr      %r0,%r3
      Call Trace:
      ([<000003ff801e8556>] zfcp_fsf_req_complete+0x726/0x768 [zfcp])
       [<000003ff801ea82a>] zfcp_fsf_reqid_check+0x102/0x180 [zfcp]
       [<000003ff801eb980>] zfcp_qdio_int_resp+0x230/0x278 [zfcp]
       [<00000000009b91b6>] qdio_kick_handler+0x2ae/0x2c8
       [<00000000009b9e3e>] __tiqdio_inbound_processing+0x406/0xc10
       [<00000000001684c2>] tasklet_action+0x15a/0x1d8
       [<0000000000bd28ec>] __do_softirq+0x3ec/0x848
       [<00000000001675a4>] irq_exit+0x74/0xf8
       [<000000000010dd6a>] do_IRQ+0xba/0xf0
       [<0000000000bd19e8>] io_int_handler+0x104/0x2d4
       [<00000000001033b6>] enabled_wait+0xb6/0x188
      ([<000000000010339e>] enabled_wait+0x9e/0x188)
       [<000000000010396a>] arch_cpu_idle+0x32/0x50
       [<0000000000bd0112>] default_idle_call+0x52/0x68
       [<00000000001cd0fa>] do_idle+0x102/0x188
       [<00000000001cd41e>] cpu_startup_entry+0x3e/0x48
       [<0000000000118c64>] smp_start_secondary+0x11c/0x130
       [<0000000000bd2016>] restart_int_handler+0x62/0x78
       [<0000000000000000>]           (null)
      INFO: lockdep is turned off.
      Last Breaking-Event-Address:
       [<000003ff801e41d6>] zfcp_fc_ct_job_handler+0x3e/0x48 [zfcp]
      
      Kernel panic - not syncing: Fatal exception in interrupt
      
      This patch moves bsg-lib to allocate and setup struct bsg_job ahead of
      time, including the allocation of a buffer for the reply-data.
      
      This means, struct bsg_job is not allocated separately anymore, but as part
      of struct request allocation - similar to struct scsi_cmd. Reflect this in
      the function names that used to handle creation/destruction of struct
      bsg_job.
      Reported-by: default avatarSteffen Maier <maier@linux.vnet.ibm.com>
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
      Fixes: 82ed4db4 ("block: split scsi_request out of struct request")
      Cc: <stable@vger.kernel.org> #4.11+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      50b4d485
    • Milan Broz's avatar
      bio-integrity: Fix regression if profile verify_fn is NULL · 97e05463
      Milan Broz authored
      In dm-integrity target we register integrity profile that have
      both generate_fn and verify_fn callbacks set to NULL.
      
      This is used if dm-integrity is stacked under a dm-crypt device
      for authenticated encryption (integrity payload contains authentication
      tag and IV seed).
      
      In this case the verification is done through own crypto API
      processing inside dm-crypt; integrity profile is only holder
      of these data. (And memory is owned by dm-crypt as well.)
      
      After the commit (and previous changes)
        Commit 7c20f116
        Author: Christoph Hellwig <hch@lst.de>
        Date:   Mon Jul 3 16:58:43 2017 -0600
      
          bio-integrity: stop abusing bi_end_io
      
      we get this crash:
      
      : BUG: unable to handle kernel NULL pointer dereference at   (null)
      : IP:   (null)
      : *pde = 00000000
      ...
      :
      : Workqueue: kintegrityd bio_integrity_verify_fn
      : task: f48ae180 task.stack: f4b5c000
      : EIP:   (null)
      : EFLAGS: 00210286 CPU: 0
      : EAX: f4b5debc EBX: 00001000 ECX: 00000001 EDX: 00000000
      : ESI: 00001000 EDI: ed25f000 EBP: f4b5dee8 ESP: f4b5dea4
      :  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      : CR0: 80050033 CR2: 00000000 CR3: 32823000 CR4: 001406d0
      : Call Trace:
      :  ? bio_integrity_process+0xe3/0x1e0
      :  bio_integrity_verify_fn+0xea/0x150
      :  process_one_work+0x1c7/0x5c0
      :  worker_thread+0x39/0x380
      :  kthread+0xd6/0x110
      :  ? process_one_work+0x5c0/0x5c0
      :  ? kthread_worker_fn+0x100/0x100
      :  ? kthread_worker_fn+0x100/0x100
      :  ret_from_fork+0x19/0x24
      : Code:  Bad EIP value.
      : EIP:   (null) SS:ESP: 0068:f4b5dea4
      : CR2: 0000000000000000
      
      Patch just skip the whole verify workqueue if verify_fn is set to NULL.
      
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      [hch: trivial whitespace fix]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97e05463
  5. 23 Aug, 2017 6 commits
  6. 18 Aug, 2017 7 commits
  7. 15 Aug, 2017 1 commit
  8. 11 Aug, 2017 3 commits
    • Ritesh Harjani's avatar
      cfq: Give a chance for arming slice idle timer in case of group_idle · b3193bc0
      Ritesh Harjani authored
      In below scenario blkio cgroup does not work as per their assigned
      weights :-
      1. When the underlying device is nonrotational with a single HW queue
      with depth of >= CFQ_HW_QUEUE_MIN
      2. When the use case is forming two blkio cgroups cg1(weight 1000) &
      cg2(wight 100) and two processes(file1 and file2) doing sync IO in
      their respective blkio cgroups.
      
      For above usecase result of fio (without this patch):-
      file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan  1 19:41:49 1970
        write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
      <...>
      file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan  1 19:41:49 1970
        write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
      <...>
      // both the process BW is equal even though they belong to diff.
      cgroups with weight of 1000(cg1) and 100(cg2)
      
      In above case (for non rotational NCQ devices),
      as soon as the request from cg1 is completed and even
      though it is provided with higher set_slice=10, because of CFQ
      algorithm when the driver tries to fetch the request, CFQ expires
      this group without providing any idle time nor weight priority
      and schedules another cfq group (in this case cg2).
      And thus both cfq groups(cg1 & cg2) keep alternating to get the
      disk time and hence loses the cgroup weight based scheduling.
      
      Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
      to arm the slice timer in case group_idle is enabled.
      In case if group_idle is also not required (including for nonrotational
      NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
      such cases.
      
      With this patch result of fio(for above usecase) :-
      file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan  1 00:06:08 1970
        write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
      <..>
      file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan  1 00:06:08 1970
        write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
      <..>
      // In this processes BW is as per their respective cgroups weight.
      Signed-off-by: default avatarRitesh Harjani <riteshh@codeaurora.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b3193bc0
    • Paolo Valente's avatar
      block, bfq: boost throughput with flash-based non-queueing devices · edaf9428
      Paolo Valente authored
      When a queue associated with a process remains empty, there are cases
      where throughput gets boosted if the device is idled to await the
      arrival of a new I/O request for that queue. Currently, BFQ assumes
      that one of these cases is when the device has no internal queueing
      (regardless of the properties of the I/O being served). Unfortunately,
      this condition has proved to be too general. So, this commit refines it
      as "the device has no internal queueing and is rotational".
      
      This refinement provides a significant throughput boost with random
      I/O, on flash-based storage without internal queueing. For example, on
      a HiKey board, throughput increases by up to 125%, growing, e.g., from
      6.9MB/s to 15.6MB/s with two or three random readers in parallel.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      edaf9428
    • Paolo Valente's avatar
      block,bfq: refactor device-idling logic · d5be3fef
      Paolo Valente authored
      The logic that decides whether to idle the device is scattered across
      three functions. Almost all of the logic is in the function
      bfq_bfqq_may_idle, but (1) part of the decision is made in
      bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
      switch off idling regardless of the output of bfq_bfqq_may_idle. In
      addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
      their decisions as a function of parameters that are used, for similar
      purposes, also in bfq_bfqq_may_idle. This commit addresses these
      issues by moving all the logic into bfq_bfqq_may_idle.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5be3fef
  9. 10 Aug, 2017 4 commits
    • Jens Axboe's avatar
      block: remove unused syncfull/asyncfull queue flags · e743eb1e
      Jens Axboe authored
      We haven't used these in years, but somehow the definitions still
      remained. Kill them, and renumber the QUEUE_FLAG_ space. We had
      a hole in the beginning of the space, too.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e743eb1e
    • Bart Van Assche's avatar
      block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time · d4acf365
      Bart Van Assche authored
      The blk_mq_delay_kick_requeue_list() function is used by the device
      mapper and only by the device mapper to rerun the queue and requeue
      list after a delay. This function is called once per request that
      gets requeued. Modify this function such that the queue is run once
      per path change event instead of once per request that is requeued.
      
      Fixes: commit 2849450a ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d4acf365
    • Christoph Hellwig's avatar
      bio-integrity: only verify integrity on the lowest stacked driver · f86e28c4
      Christoph Hellwig authored
      This gets us back to the behavior in 4.12 and earlier.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f86e28c4
    • Milan Broz's avatar
      bio-integrity: Fix regression if profile verify_fn is NULL · c775d209
      Milan Broz authored
      In dm-integrity target we register integrity profile that have
      both generate_fn and verify_fn callbacks set to NULL.
      
      This is used if dm-integrity is stacked under a dm-crypt device
      for authenticated encryption (integrity payload contains authentication
      tag and IV seed).
      
      In this case the verification is done through own crypto API
      processing inside dm-crypt; integrity profile is only holder
      of these data. (And memory is owned by dm-crypt as well.)
      
      After the commit (and previous changes)
        Commit 7c20f116
        Author: Christoph Hellwig <hch@lst.de>
        Date:   Mon Jul 3 16:58:43 2017 -0600
      
          bio-integrity: stop abusing bi_end_io
      
      we get this crash:
      
      : BUG: unable to handle kernel NULL pointer dereference at   (null)
      : IP:   (null)
      : *pde = 00000000
      ...
      :
      : Workqueue: kintegrityd bio_integrity_verify_fn
      : task: f48ae180 task.stack: f4b5c000
      : EIP:   (null)
      : EFLAGS: 00210286 CPU: 0
      : EAX: f4b5debc EBX: 00001000 ECX: 00000001 EDX: 00000000
      : ESI: 00001000 EDI: ed25f000 EBP: f4b5dee8 ESP: f4b5dea4
      :  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      : CR0: 80050033 CR2: 00000000 CR3: 32823000 CR4: 001406d0
      : Call Trace:
      :  ? bio_integrity_process+0xe3/0x1e0
      :  bio_integrity_verify_fn+0xea/0x150
      :  process_one_work+0x1c7/0x5c0
      :  worker_thread+0x39/0x380
      :  kthread+0xd6/0x110
      :  ? process_one_work+0x5c0/0x5c0
      :  ? kthread_worker_fn+0x100/0x100
      :  ? kthread_worker_fn+0x100/0x100
      :  ret_from_fork+0x19/0x24
      : Code:  Bad EIP value.
      : EIP:   (null) SS:ESP: 0068:f4b5dea4
      : CR2: 0000000000000000
      
      Patch just skip the whole verify workqueue if verify_fn is set to NULL.
      
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      [hch: trivial whitespace fix]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c775d209
  10. 09 Aug, 2017 6 commits
  11. 02 Aug, 2017 2 commits