1. 09 Sep, 2017 2 commits
  2. 28 Aug, 2017 1 commit
  3. 11 Aug, 2017 1 commit
    • Ritesh Harjani's avatar
      cfq: Give a chance for arming slice idle timer in case of group_idle · b3193bc0
      Ritesh Harjani authored
      In below scenario blkio cgroup does not work as per their assigned
      weights :-
      1. When the underlying device is nonrotational with a single HW queue
      with depth of >= CFQ_HW_QUEUE_MIN
      2. When the use case is forming two blkio cgroups cg1(weight 1000) &
      cg2(wight 100) and two processes(file1 and file2) doing sync IO in
      their respective blkio cgroups.
      
      For above usecase result of fio (without this patch):-
      file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan  1 19:41:49 1970
        write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
      <...>
      file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan  1 19:41:49 1970
        write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
      <...>
      // both the process BW is equal even though they belong to diff.
      cgroups with weight of 1000(cg1) and 100(cg2)
      
      In above case (for non rotational NCQ devices),
      as soon as the request from cg1 is completed and even
      though it is provided with higher set_slice=10, because of CFQ
      algorithm when the driver tries to fetch the request, CFQ expires
      this group without providing any idle time nor weight priority
      and schedules another cfq group (in this case cg2).
      And thus both cfq groups(cg1 & cg2) keep alternating to get the
      disk time and hence loses the cgroup weight based scheduling.
      
      Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
      to arm the slice timer in case group_idle is enabled.
      In case if group_idle is also not required (including for nonrotational
      NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
      such cases.
      
      With this patch result of fio(for above usecase) :-
      file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan  1 00:06:08 1970
        write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
      <..>
      file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan  1 00:06:08 1970
        write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
      <..>
      // In this processes BW is as per their respective cgroups weight.
      Signed-off-by: 's avatarRitesh Harjani <riteshh@codeaurora.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      b3193bc0
  4. 29 Jul, 2017 1 commit
    • Shaohua Li's avatar
      block: use standard blktrace API to output cgroup info for debug notes · 35fe6d76
      Shaohua Li authored
      Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
      way. Now we have standard blktrace API for this, so convert them to use
      it.
      
      Note, this changes the behavior a little bit. cgroup info isn't output
      by default, we only do this with 'blk_cgroup' option enabled. cgroup
      info isn't output as a string by default too, we only do this with
      'blk_cgname' option enabled. Also cgroup info is output in different
      position of the note string. I think these behavior changes aren't a big
      issue (actually we make trace data shorter which is good), since the
      blktrace note is solely for debugging.
      Signed-off-by: 's avatarShaohua Li <shli@fb.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      35fe6d76
  5. 31 May, 2017 1 commit
    • Hou Tao's avatar
      cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode · 5be6b756
      Hou Tao authored
      When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
      as the delay of cfq_group's vdisktime if there have been other cfq_groups
      already.
      
      When cfq is under iops mode, commit 9a7f38c4 ("cfq-iosched: Convert
      from jiffies to nanoseconds") could result in a large iops delay and
      lead to an abnormal io schedule delay for the added cfq_group. To fix
      it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
      when iops mode is enabled.
      
      Despite having the same value, the delay of a cfq_queue in idle class
      and the delay of cfq_group are different things, so I define two new
      macros for the delay of a cfq_group under time-slice mode and iops mode.
      
      Fixes: 9a7f38c4 ("cfq-iosched: Convert from jiffies to nanoseconds")
      Cc: <stable@vger.kernel.org> # 4.8+
      Signed-off-by: 's avatarHou Tao <houtao1@huawei.com>
      Acked-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      5be6b756
  6. 30 May, 2017 1 commit
  7. 05 Apr, 2017 1 commit
    • Jan Kara's avatar
      cfq: Disable writeback throttling by default · 142bbdfc
      Jan Kara authored
      Writeback throttling does not play well with CFQ since that also tries
      to throttle async writes. As a result async writeback can get starved in
      presence of readers. As an example take a benchmark simulating
      postgreSQL database running over a standard rotating SATA drive. There
      are 16 processes doing random reads from a huge file (2*machine memory),
      1 process doing random writes to the huge file and calling fsync once
      per 50000 writes and 1 process doing sequential 8k writes to a
      relatively small file wrapping around at the end of the file and calling
      fsync every 5 writes. Under this load read latency easily exceeds the
      target latency of 75 ms (just because there are so many reads happening
      against a relatively slow disk) and thus writeback is throttled to a
      point where only 1 write request is allowed at a time. Blktrace data
      then looks like:
      
        8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
        8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
        8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
        8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
        8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
        8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
        8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
        8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
        8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
        8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
        8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
        8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
        8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
        8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
        8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
        8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
        8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
        8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
        8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
        8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
        8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
      ...
        8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
        8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
        8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
        8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
        8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
        8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
        8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
        8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
        8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
        8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
        8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
        8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
      ...
        8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
        8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
      ...
        8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
        8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
        8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
        8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
        8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
        8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
        8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
        8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1
      
      So we can see a Q event for a write request, then IO is blocked by
      writeback throttling and G and I events for the request happen only once
      other writeback IO is completed. Thus CFQ always sees only one write
      request. When it sees it, it queues the async queue behind all the read
      queues and the async queue gets scheduled after about one second. When
      it is scheduled, that one request gets dispatched and async queue is
      expired as it has no more requests to submit. Overall we submit about
      one write request per second.
      
      Although this scheduling is beneficial for read latency, writes are
      heavily starved and this causes large delays all over the system (due to
      processes blocking on page lock, transaction starts, etc.). When
      writeback throttling is disabled, write throughput is about one fifth of
      a read throughput which roughly matches readers/writers ratio and
      overall the system stalls are much shorter.
      
      Mixing writeback throttling logic with CFQ throttling logic is always a
      recipe for surprises as CFQ assumes it sees the big part of the picture
      which is not necessarily true when writeback throttling is blocking
      requests. So disable writeback throttling logic by default when CFQ is
      used as an IO scheduler.
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      142bbdfc
  8. 02 Mar, 2017 1 commit
  9. 16 Feb, 2017 1 commit
  10. 08 Feb, 2017 1 commit
  11. 23 Jan, 2017 2 commits
    • Markus Elfring's avatar
      cfq-iosched: Adjust one function call together with a variable assignment · 1cf41753
      Markus Elfring authored
      The script "checkpatch.pl" pointed information out like the following.
      
      ERROR: do not use assignment in if condition
      
      Thus fix the affected source code place.
      Signed-off-by: 's avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      1cf41753
    • Alexander Potapenko's avatar
      block: Initialize cfqq->ioprio_class in cfq_get_queue() · 4d608baa
      Alexander Potapenko authored
      KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
      uninitialized memory in cfq_init_cfqq():
      
      ==================================================================
      BUG: KMSAN: use of unitialized memory
      ...
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff8202ac97>] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
       [<ffffffff813e9b65>] kmsan_report+0x205/0x360 ??:?
       [<ffffffff813eabbb>] __msan_warning+0x5b/0xb0 ??:?
       [<     inline     >] cfq_init_cfqq block/cfq-iosched.c:3754
       [<ffffffff8201e110>] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
      ...
      origin:
       [<ffffffff8103ab37>] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
       [<ffffffff813e836b>] kmsan_internal_poison_shadow+0xab/0x150 ??:?
       [<ffffffff813e88ab>] kmsan_poison_slab+0xbb/0x120 ??:?
       [<     inline     >] allocate_slab mm/slub.c:1627
       [<ffffffff813e533f>] new_slab+0x3af/0x4b0 mm/slub.c:1641
       [<     inline     >] new_slab_objects mm/slub.c:2407
       [<ffffffff813e0ef3>] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
       [<     inline     >] __slab_alloc mm/slub.c:2606
       [<     inline     >] slab_alloc_node mm/slub.c:2669
       [<ffffffff813dfb42>] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
       [<ffffffff8201d90d>] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
      ...
      ==================================================================
      (the line numbers are relative to 4.8-rc6, but the bug persists
      upstream)
      
      The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
      and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
      before it's initialized.
      Signed-off-by: 's avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      4d608baa
  12. 17 Jan, 2017 1 commit
  13. 28 Nov, 2016 1 commit
  14. 22 Nov, 2016 1 commit
  15. 10 Nov, 2016 2 commits
    • Jens Axboe's avatar
      block: hook up writeback throttling · 87760e5e
      Jens Axboe authored
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      87760e5e
    • Tejun Heo's avatar
      block: cfq_cpd_alloc() should use @gfp · ebc4ff66
      Tejun Heo authored
      cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
      incorrectly hard coding GFP_KERNEL instead of using the mask specified
      through the @gfp parameter.  This currently doesn't cause any actual
      issues because all current callers specify GFP_KERNEL.  Fix it.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Reported-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: e4a9bde9 ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      ebc4ff66
  16. 01 Nov, 2016 2 commits
  17. 28 Oct, 2016 1 commit
    • Christoph Hellwig's avatar
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig authored
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      ef295ecf
  18. 23 Sep, 2016 1 commit
    • Glauber Costa's avatar
      cfq: fix starvation of asynchronous writes · 3932a86b
      Glauber Costa authored
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: 's avatarGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: 's avatarGlauber Costa <glauber@scylladb.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      3932a86b
  19. 07 Aug, 2016 1 commit
    • Jens Axboe's avatar
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe authored
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      1eff9d32
  20. 21 Jul, 2016 1 commit
    • Tahsin Erdogan's avatar
      block: do not merge requests without consulting with io scheduler · 72ef799b
      Tahsin Erdogan authored
      Before merging a bio into an existing request, io scheduler is called to
      get its approval first. However, the requests that come from a plug
      flush may get merged by block layer without consulting with io
      scheduler.
      
      In case of CFQ, this can cause fairness problems. For instance, if a
      request gets merged into a low weight cgroup's request, high weight cgroup
      now will depend on low weight cgroup to get scheduled. If high weigt cgroup
      needs that io request to complete before submitting more requests, then it
      will also lose its timeslice.
      
      Following script demonstrates the problem. Group g1 has a low weight, g2
      and g3 have equal high weights but g2's requests are adjacent to g1's
      requests so they are subject to merging. Due to these merges, g2 gets
      poor disk time allocation.
      
      cat > cfq-merge-repro.sh << "EOF"
      #!/bin/bash
      set -e
      
      IO_ROOT=/mnt-cgroup/io
      
      mkdir -p $IO_ROOT
      
      if ! mount | grep -qw $IO_ROOT; then
        mount -t cgroup none -oblkio $IO_ROOT
      fi
      
      cd $IO_ROOT
      
      for i in g1 g2 g3; do
        if [ -d $i ]; then
          rmdir $i
        fi
      done
      
      mkdir g1 && echo 10 > g1/blkio.weight
      mkdir g2 && echo 495 > g2/blkio.weight
      mkdir g3 && echo 495 > g3/blkio.weight
      
      RUNTIME=10
      
      (echo $BASHPID > g1/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=0k &> /dev/null)&
      
      (echo $BASHPID > g2/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=64k &> /dev/null)&
      
      (echo $BASHPID > g3/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=256k &> /dev/null)&
      
      sleep $((RUNTIME+1))
      
      for i in g1 g2 g3; do
        echo ---- $i ----
        cat $i/blkio.time
      done
      
      EOF
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 162
      ---- g2 ----
      8:16 165
      ---- g3 ----
      8:16 686
      
      After applying the patch:
      
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 90
      ---- g2 ----
      8:16 445
      ---- g3 ----
      8:16 471
      Signed-off-by: 's avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      72ef799b
  21. 28 Jun, 2016 3 commits
    • Jan Kara's avatar
      cfq-iosched: Charge at least 1 jiffie instead of 1 ns · 0b31c10c
      Jan Kara authored
      Commit 9a7f38c4 (cfq-iosched: Convert from jiffies to nanoseconds)
      could result in charging just 1 ns to a cgroup submitting IO instead of 1
      jiffie we always charged before. It is arguable what is the right amount
      to change but for now lets retain the old behavior of always charging at
      least one jiffie.
      
      Fixes: 9a7f38c4Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      0b31c10c
    • Jan Kara's avatar
      cfq-iosched: Fix regression in bonnie++ rewrite performance · 149321a6
      Jan Kara authored
      Commit 9a7f38c4 (cfq-iosched: Convert from jiffies to nanoseconds)
      broke the condition for detecting starved sync IO in
      cfq_completed_request() because rq->start_time remained in jiffies but
      we compared it with nanosecond values. This manifested as a regression
      in bonnie++ rewrite performance because we always ended up considering
      sync IO starved and thus never increased async IO queue depth.
      
      Since rq->start_time is used in a lot of places, converting it to ns
      values would be non-trivial. So just revert the condition in CFQ to use
      comparison with jiffies. This will lead to suboptimal results if
      cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
      relatively far from that with the storage used with CFQ (the default
      value is 128 ms).
      
      Fixes: 9a7f38c4Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      149321a6
    • Jan Kara's avatar
      cfq-iosched: Convert slice_resid from u64 to s64 · 93fdf147
      Jan Kara authored
      slice_resid can be both positive and negative. Commit 9a7f38c4
      (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
      long to u64. Although this did not introduce any functional regression
      (the operations just overflow and the result was fine), it is certainly
      wrong and could cause issues in future. So convert the type to more
      appropriate s64.
      
      Fixes: 9a7f38c4Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      93fdf147
  22. 09 Jun, 2016 1 commit
    • Jens Axboe's avatar
      cfq-iosched: temporarily boost queue priority for idle classes · b8269db4
      Jens Axboe authored
      If we're queuing REQ_PRIO IO and the task is running at an idle IO
      class, then temporarily boost the priority. This prevents livelocks
      due to priority inversion, when a low priority task is holding file
      system resources while attempting to do IO.
      
      An example of that is shown below. An ioniced idle task is holding
      the directory mutex, while a normal priority task is trying to do
      a directory lookup.
      
      [478381.198925] ------------[ cut here ]------------
      [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
      [478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
      [478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
      [478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
      [478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
      [478381.206874] Call Trace:
      [478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
      [478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
      [478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
      [478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
      [478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
      [478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
      [478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
      [478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
      [478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
      [478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
      [478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
      [478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
      [478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
      [478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
      [478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
      [478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
      [478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
      [478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
      [478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
      [478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
      [478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
      [478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
      [478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
      [478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
      [478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
      [478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
      [478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
      [478381.239126] ------------[ cut here ]------------
      [478381.239915] ------------[ cut here ]------------
      [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
      [478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
      [478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
      [478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
      [478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
      [478381.248252] Call Trace:
      [478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
      [478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
      [478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
      [478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
      [478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
      [478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
      [478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
      [478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
      [478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
      [478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
      [478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
      [478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
      [478381.264551] ------------[ cut here ]------------
      [478381.265377] ------------[ cut here ]------------
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Reviewed-by: 's avatarJeff Moyer <jmoyer@redhat.com>
      b8269db4
  23. 08 Jun, 2016 3 commits
  24. 07 Jun, 2016 3 commits
  25. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  26. 04 Feb, 2016 4 commits
    • Jan Kara's avatar
      cfq-iosched: Allow parent cgroup to preempt its child · 3984aa55
      Jan Kara authored
      Currently we don't allow sync workload of one cgroup to preempt sync
      workload of any other cgroup. This is because we want to achieve service
      separation between cgroups. However in cases where cgroup preempting is
      ancestor of the current cgroup, there is no need of separation and
      idling introduces unnecessary overhead. This hurts for example the case
      when workload is isolated within a cgroup but journalling threads are in
      root cgroup. Simple way to demostrate the issue is using:
      
      dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1
      
      on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
      difference more visible). When all processes are in the root cgroup,
      reported throughput is 153.132 MB/sec. When dbench process gets its own
      blkio cgroup, reported throughput drops to 26.1006 MB/sec.
      
      Fix the problem by making check in cfq_should_preempt() more benevolent
      and allow preemption by ancestor cgroup. This improves the throughput
      reported by dbench4 to 48.9106 MB/sec.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      3984aa55
    • Jan Kara's avatar
      cfq-iosched: Allow sync noidle workloads to preempt each other · a257ae3e
      Jan Kara authored
      The original idea with preemption of sync noidle queues (introduced in
      commit 718eee05 "cfq-iosched: fairness for sync no-idle queues") was
      that we service all sync noidle queues together, we don't idle on any of
      the queues individually and we idle only if there is no sync noidle
      queue to be served. This intention also matches the original test:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
      	   && new_cfqq->service_tree == cfqq->service_tree)
      		return true;
      
      However since at that time cfqq->service_tree was not set for idling
      queues, this test was unreliable and was replaced in commit e4a22919
      "cfq-iosched: fix no-idle preemption logic" by:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
      	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
      	    new_cfqq->service_tree->count == 1)
      		return true;
      
      That was a reliable test but was actually doing something different -
      now we preempt sync noidle queue only if the new queue is the only one
      busy in the service tree.
      
      These days cfq queue is kept in service tree even if it is idling and
      thus the original check would be safe again. But since we actually check
      that cfq queues are in the same cgroup, of the same priority class and
      workload type (sync noidle), we know that new_cfqq is fine to preempt
      cfqq. So just remove the service tree check.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      a257ae3e
    • Jan Kara's avatar
      cfq-iosched: Reorder checks in cfq_should_preempt() · 6c80731c
      Jan Kara authored
      Move check for preemption by rt class up. There is no functional change
      but it makes arguing about conditions simpler since we can be sure both
      cfq queues are from the same ioprio class.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      6c80731c
    • Jan Kara's avatar
      cfq-iosched: Don't group_idle if cfqq has big thinktime · e795421e
      Jan Kara authored
      There is no point in idling on a cfq group if the only cfq queue that is
      there has too big thinktime.
      Signed-off-by: 's avatarJan Kara <jack@suse.com>
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      e795421e
  27. 18 Sep, 2015 1 commit
    • Tejun Heo's avatar
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo authored
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Acked-by: 's avatarZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130