1. 07 Jun, 2016 2 commits
  2. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 04 Feb, 2016 4 commits
    • Jan Kara's avatar
      cfq-iosched: Allow parent cgroup to preempt its child · 3984aa55
      Jan Kara authored
      Currently we don't allow sync workload of one cgroup to preempt sync
      workload of any other cgroup. This is because we want to achieve service
      separation between cgroups. However in cases where cgroup preempting is
      ancestor of the current cgroup, there is no need of separation and
      idling introduces unnecessary overhead. This hurts for example the case
      when workload is isolated within a cgroup but journalling threads are in
      root cgroup. Simple way to demostrate the issue is using:
      
      dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1
      
      on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
      difference more visible). When all processes are in the root cgroup,
      reported throughput is 153.132 MB/sec. When dbench process gets its own
      blkio cgroup, reported throughput drops to 26.1006 MB/sec.
      
      Fix the problem by making check in cfq_should_preempt() more benevolent
      and allow preemption by ancestor cgroup. This improves the throughput
      reported by dbench4 to 48.9106 MB/sec.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3984aa55
    • Jan Kara's avatar
      cfq-iosched: Allow sync noidle workloads to preempt each other · a257ae3e
      Jan Kara authored
      The original idea with preemption of sync noidle queues (introduced in
      commit 718eee05 "cfq-iosched: fairness for sync no-idle queues") was
      that we service all sync noidle queues together, we don't idle on any of
      the queues individually and we idle only if there is no sync noidle
      queue to be served. This intention also matches the original test:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
      	   && new_cfqq->service_tree == cfqq->service_tree)
      		return true;
      
      However since at that time cfqq->service_tree was not set for idling
      queues, this test was unreliable and was replaced in commit e4a22919
      "cfq-iosched: fix no-idle preemption logic" by:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
      	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
      	    new_cfqq->service_tree->count == 1)
      		return true;
      
      That was a reliable test but was actually doing something different -
      now we preempt sync noidle queue only if the new queue is the only one
      busy in the service tree.
      
      These days cfq queue is kept in service tree even if it is idling and
      thus the original check would be safe again. But since we actually check
      that cfq queues are in the same cgroup, of the same priority class and
      workload type (sync noidle), we know that new_cfqq is fine to preempt
      cfqq. So just remove the service tree check.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a257ae3e
    • Jan Kara's avatar
      cfq-iosched: Reorder checks in cfq_should_preempt() · 6c80731c
      Jan Kara authored
      Move check for preemption by rt class up. There is no functional change
      but it makes arguing about conditions simpler since we can be sure both
      cfq queues are from the same ioprio class.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6c80731c
    • Jan Kara's avatar
      cfq-iosched: Don't group_idle if cfqq has big thinktime · e795421e
      Jan Kara authored
      There is no point in idling on a cfq group if the only cfq queue that is
      there has too big thinktime.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e795421e
  4. 18 Sep, 2015 1 commit
    • Tejun Heo's avatar
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo authored
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
  5. 18 Aug, 2015 28 commits
    • Tejun Heo's avatar
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo authored
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
        enabled.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      69d7fde5
    • Tejun Heo's avatar
      blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/ · 3ecca629
      Tejun Heo authored
      blkcg is gonna switch to cgroup common weight range as defined by
      CGROUP_WEIGHT_* on the unified hierarchy.  In preparation, rename
      CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3ecca629
    • Tejun Heo's avatar
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo authored
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2ee867dc
    • Tejun Heo's avatar
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo authored
      * Export blkg_dev_name()
      
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      dd165eb3
    • Tejun Heo's avatar
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo authored
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      36aa9e5f
    • Tejun Heo's avatar
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo authored
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      880f50e2
    • Tejun Heo's avatar
      blkcg: refine error codes returned during blkcg configuration · 20386ce0
      Tejun Heo authored
      blkcg currently returns -EINVAL for most errors which can be pretty
      confusing given that the failure modes are quite varied.  Update the
      error returns so that
      
      * -EINVAL only for syntactic errors.
      * -ERANGE if the value is out of range.
      * -ENODEV if the target device can't be found.
      * -EOPNOTSUPP if the policy is not enabled on the target device.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      20386ce0
    • Tejun Heo's avatar
      blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device() · 5332dfc3
      Tejun Heo authored
      blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
      enabled are guaranteed to return non-NULL and the counterpart in
      blk-throttle doesn't have these checks either.  Remove the spurious
      NULL checks.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5332dfc3
    • Tejun Heo's avatar
      blkcg: remove cfqg_stats->sectors · 702747ca
      Tejun Heo authored
      cfq_stats->sectors is a blkg_stat which keeps track of the total
      number of sectors serviced; however, this can be trivially calculated
      from blkcg_gq->stat_bytes.  The only thing necessary is adding up
      READs and WRITEs and then dividing by sector size.
      
      Remove cfqg_stats->sectors and make cfq print "sectors" and
      "sectors_recursive" from stat_bytes.
      
      While this is a bit more code, it removes duplicate stat allocations
      and updates and ensures that the reported stats stay in tune with each
      other.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      702747ca
    • Tejun Heo's avatar
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo authored
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      
      v3: Update blkio-controller.txt accordingly.
      
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      77ea7338
    • Tejun Heo's avatar
      blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq · f12c74ca
      Tejun Heo authored
      Currently, blkg_[rw]stat_recursive_sum() assume that the target
      counter is located in pd (blkg_policy_data); however, some counters
      are planned to be moved to blkg (blkcg_gq).
      
      This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
      blkg_policy pointers instead of pd.  If policy is NULL, it indexes
      into blkg.  If non-NULL, into the blkg's pd of the policy.
      
      The existing usages are updated to maintain the current behaviors.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f12c74ca
    • Tejun Heo's avatar
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo authored
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      
      * Custom per-cpu stat implementation in blk-throttle is removed.
      
      This makes all blkcg stat counters per-cpu without complicating policy
      implmentations.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      24bdb8ef
    • Tejun Heo's avatar
      blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it · e6269c44
      Tejun Heo authored
      cgroup stats are local to each cgroup and doesn't propagate to
      ancestors by default.  When recursive stats are necessary, the sum is
      calculated over all the descendants.  This initially was for backward
      compatibility to support both group-local and recursive stats but this
      mode of operation makes general sense as stat update is much hotter
      thafn reporting those stats.
      
      This however ends up losing recursive stats when a child is removed.
      To work around this, cfq-iosched adds its stats to its parent
      cfq_group->dead_stats which is summed up together when calculating
      recursive stats.
      
      It's planned that the core stats will be moved to blkcg_gq, so we want
      to move the mechanism for keeping track of the stats of dead children
      from cfq to blkcg core.  This patch adds blkg_[rw]stat->aux_cnt which
      are atomic64_t's keeping track of auxiliary counts which are excluded
      when reading local counts but included for recursive.
      
      blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
      are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
      a dead cgroup to the aux counts of parent->stats instead of separate
      ->dead_stats.
      
      This will also help making blkg_[rw]stats per-cpu.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e6269c44
    • Tejun Heo's avatar
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo authored
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      performed.
      
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
          !CONFIG_BLK_DEV_THROTTLING.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ae118896
    • Tejun Heo's avatar
      blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods · e4a9bde9
      Tejun Heo authored
      Each active policy has a cpd (blkcg_policy_data) on each blkcg.  The
      cpd's were allocated by blkcg core and each policy could request to
      allocate extra space at the end by setting blkcg_policy->cpd_size
      larger than the size of cpd.
      
      This is a bit unusual but blkg (blkcg_gq) policy data used to be
      handled this way too so it made sense to be consistent; however, blkg
      policy data switched to alloc/free callbacks.
      
      This patch makes similar changes to cpd handling.
      blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size.  As
      cpd allocation is now done from policy side, it can simply allocate a
      larger area which embeds cpd at the beginning.
      
      As ->cpd_alloc_fn() may be able to perform all necessary
      initializations, this patch makes ->cpd_init_fn() optional.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e4a9bde9
    • Tejun Heo's avatar
      blkcg: minor updates around blkcg_policy_data · 81437648
      Tejun Heo authored
      * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
        for blkcg_policy_data.
      
      * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
        blkcg.  This makes it consistent with blkg_policy_data methods and
        to-be-added cpd alloc/free methods.
      
      * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
        cpd_init_fn() can determine the associated blkcg from
        blkcg_policy_data.
      
      v2: blkcg_policy_data->blkcg initializations were missing.  Added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      81437648
    • Tejun Heo's avatar
      blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data · a9520cd6
      Tejun Heo authored
      The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
      (blkg_policy_data) while the older ones use blkg (blkcg_gq).  As using
      blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
      can always be mapped to blkg and given that these are policy-specific
      methods, it makes sense to converge on pd.
      
      This patch makes all methods deal with pd instead of blkg.  Most
      conversions are trivial.  In blk-cgroup.c, a couple method invocation
      sites now test whether pd exists instead of policy state for
      consistency.  This shouldn't cause any behavioral differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a9520cd6
    • Tejun Heo's avatar
      blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods · b2ce2643
      Tejun Heo authored
      With the recent addition of alloc and free methods, things became
      messier.  This patch reorganizes them according to the followings.
      
      * ->pd_alloc_fn()
      
        Responsible for allocation and static initializations - the ones
        which can be done independent of where the pd might be attached.
      
      * ->pd_init_fn()
      
        Initializations which require the knowledge of where the pd is
        attached.
      
      * ->pd_free_fn()
      
        The counter part of pd_alloc_fn().  Static de-init and freeing.
      
      This leaves ->pd_exit_fn() without any users.  Removed.
      
      While at it, collapse an one liner function throtl_pd_exit(), which
      has only one user, into its user.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b2ce2643
    • Tejun Heo's avatar
      blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods · 001bea73
      Tejun Heo authored
      A blkg (blkcg_gq) represents the relationship between a cgroup and
      request_queue.  Each active policy has a pd (blkg_policy_data) on each
      blkg.  The pd's were allocated by blkcg core and each policy could
      request to allocate extra space at the end by setting
      blkcg_policy->pd_size larger than the size of pd.
      
      This is a bit unusual but was done this way mostly to simplify error
      handling and all the existing use cases could be handled this way;
      however, this is becoming too restrictive now that percpu memory can
      be allocated without blocking.
      
      This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
      and pd_free_fn() - which are used to allocate and release pd for a
      given policy.  As pd allocation is now done from policy side, it can
      simply allocate a larger area which embeds pd at the beginning.  This
      change makes ->pd_size pointless.  Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      001bea73
    • Tejun Heo's avatar
      cfq-iosched: charge async IOs to the appropriate blkcg's instead of the root · 60a83707
      Tejun Heo authored
      Up until now, all async IOs were queued to async queues which are
      shared across the whole request_queue, which means that blkcg resource
      control is completely void on async IOs including all writeback IOs.
      It was done this way because writeback didn't support writeback and
      there was no way of telling which writeback IO belonged to which
      cgroup; however, writeback recently became cgroup aware and writeback
      bio's are sent down properly tagged with the blkcg's to charge them
      against.
      
      This patch makes async cfq_queues per-cfq_cgroup instead of
      per-cfq_data so that each async IO is charged to the blkcg that it was
      tagged for instead of unconditionally attributing it to root.
      
      * cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
        and alloc / destroy paths are updated accordingly.
      
      * cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
        queues.
      
      * check_blkcg_changed() now also invalidates async queues as they no
        longer stay the same across cgroups.
      
      After this patch, cfq's proportional IO control through blkio.weight
      works correctly when cgroup writeback is in use.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      60a83707
    • Tejun Heo's avatar
      cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue() · d4aad7ff
      Tejun Heo authored
      cfq_find_alloc_queue() checks whether a queue actually needs to be
      allocated, which is unnecessary as its sole caller, cfq_get_queue(),
      only calls it if so.  Also, the oom queue fallback logic is scattered
      between cfq_get_queue() and cfq_find_alloc_queue().  There really
      isn't much going on in the latter and things can be made simpler by
      folding it into cfq_get_queue().
      
      This patch collapses cfq_find_alloc_queue() into cfq_get_queue().  The
      change is fairly straight-forward with one exception - async_cfqq is
      now initialized to NULL and the "!is_sync" test in the last if
      conditional is replaced with "async_cfqq" test.  This is because gcc
      (5.1.1) gets confused for some reason and warns that async_cfqq may be
      used uninitialized otherwise.  Oh well, the code isn't necessarily
      worse this way.
      
      This patch doesn't cause any functional difference.
      
      v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d4aad7ff
    • Tejun Heo's avatar
      cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to cfq_get_queue() · 322731ed
      Tejun Heo authored
      This is necessary for making async cfq_cgroups per-cfq_group instead
      of per-cfq_data.  While this change makes cfq_get_queue() perform RCU
      locking and look up cfq_group even when it reuses async queue, the
      extra overhead is extremely unlikely to be noticeable given that this
      is already sitting behind cic->cfqq[] cache and the overall cost of
      cfq operation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      322731ed
    • Tejun Heo's avatar
      cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue() · 2da8de0b
      Tejun Heo authored
      Even when allocations fail, cfq_find_alloc_queue() always returns a
      valid cfq_queue by falling back to the oom cfq_queue.  As such, there
      isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
      is set.  GFP_NOWAIT allocations don't fail often and even when they do
      the degraded behavior is acceptable and temporary.
      
      After all, the only reason get_request(), which ultimately determines
      the gfp_mask, cares about __GFP_WAIT is to guarantee request
      allocation, assuming IO forward progress, for callers which are
      willing to wait.  There's no reason for cfq_find_alloc_queue() to
      behave differently on __GFP_WAIT when it already has a fallback
      mechanism.
      
      Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
      to its callers.  This simplifies the function quite a bit and will
      help making async queues per-cfq_group.
      
      v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2da8de0b
    • Tejun Heo's avatar
      blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations · d93a11f1
      Tejun Heo authored
      blkcg performs several allocations to track IOs per cgroup and enforce
      resource control.  Most of these allocations are performed lazily on
      demand in the IO path and thus can't involve reclaim path.  Currently,
      these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
      with occassional failures of these allocations by punting IOs to the
      root cgroup and there's no reason to reach into the emergency reserve.
      
      This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
      allocations.
      
      * bdi_writeback_congested and blkcg_gq allocations in blkg_create().
      
      * radix tree node allocations for blkcg->blkg_tree.
      
      * cfq_queue allocation on ioprio changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-and-Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Suggested-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d93a11f1
    • Tejun Heo's avatar
      cfq-iosched: minor cleanups · 563180a4
      Tejun Heo authored
      * Some were accessing cic->cfqq[] directly.  Always use cic_to_cfqq()
        and cic_set_cfqq().
      
      * check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
        return for NULL.  It's always non-NULL.  Simplify accordingly.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      563180a4
    • Tejun Heo's avatar
      cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request() · bce6133b
      Tejun Heo authored
      If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
      replaces it by invoking cfq_get_queue() again without putting the oom
      queue leaking the reference it was holding.  While oom queues are not
      released through reference counting, they're still reference counted
      and this can theoretically lead to the reference count overflowing and
      incorrectly invoke the usual release path on it.
      
      Fix it by making cfq_set_request() put the ref it was holding.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bce6133b
    • Tejun Heo's avatar
      cfq-iosched: fix async oom queue handling · 95e5d6f6
      Tejun Heo authored
      Async cfqq's (cfq_queue's) are shared across cfq_data.  When
      cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
      stashes the pointer in cfq_data and reuses it from then on; however,
      the function doesn't consider that cfq_find_alloc_queue() may return
      the oom_cfqq under memory pressure and installs the returned queue
      unconditionally.
      
      If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
      continue calling cfq_get_queue() hoping to replace it with a proper
      queue; however, cfq_get_queue() will keep returning the cached queue
      for the slot - the oom_cfqq.
      
      Fix it by skipping caching if the queue is the oom one.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      95e5d6f6
    • Tejun Heo's avatar
      cfq-iosched: simplify control flow in cfq_get_queue() · 4ebc1c61
      Tejun Heo authored
      cfq_get_queue()'s control flow looks like the following.
      
      	async_cfqq = NULL;
      	cfqq = NULL;
      
      	if (!is_sync) {
      		...
      		async_cfqq = ...;
      		cfqq = *async_cfqq;
      	}
      
      	if (!cfqq)
      		cfqq = ...;
      
      	if (!is_sync && !(*async_cfqq))
      		...;
      
      The only thing the local variable init, the second if, and the
      async_cfqq test in the third if achieves is to skip cfqq creation and
      installation if *async_cfqq was already non-NULL.  This is needlessly
      complicated with different tests examining the same condition.
      Simplify it to the following.
      
      	if (!is_sync) {
      		...
      		async_cfqq = ...;
      		cfqq = *async_cfqq;
      		if (cfqq)
      			goto out;
      	}
      
      	cfqq = ...;
      
      	if (!is_sync)
      		...;
       out:
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4ebc1c61
  6. 20 Jun, 2015 1 commit
  7. 19 Jun, 2015 2 commits
  8. 10 Jun, 2015 1 commit
    • Jens Axboe's avatar
      cfq-iosched: fix the setting of IOPS mode on SSDs · 0bb97947
      Jens Axboe authored
      A previous commit wanted to make CFQ default to IOPS mode on
      non-rotational storage, however it did so when the queue was
      initialized and the non-rotational flag is only set later on
      in the probe.
      
      Add an elevator hook that gets called off the add_disk() path,
      at that point we know that feature probing has finished, and
      we can reliably check for the various flags that drivers can
      set.
      
      Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
      Tested-by: default avatarRomain Francoise <romain@orebokech.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0bb97947