1. 17 Jan, 2017 1 commit
  2. 22 Nov, 2016 1 commit
  3. 30 Sep, 2016 1 commit
  4. 14 Jun, 2016 1 commit
  5. 09 Feb, 2016 1 commit
    • Roman Pen's avatar
      block: fix module reference leak on put_disk() call for cgroups throttle · 39a169b6
      Roman Pen authored
      get_disk(),get_gendisk() calls have non explicit side effect: they
      increase the reference on the disk owner module.
      The following is the correct sequence how to get a disk reference and
      to put it:
          disk = get_gendisk(...);
          /* use disk */
          owner = disk->fops->owner;
      fs/block_dev.c is aware of this required module_put() call, but f.e.
      blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
      a module reference.  To see a leakage in action cgroups throttle config
      can be used.  In the following script I'm removing throttle for /dev/ram0
      (actually this is NOP, because throttle was never set for this device):
          # lsmod | grep brd
          brd                     5175  0
          # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
              /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
          # lsmod | grep brd
          brd                     5175  100
      Now brd module has 100 references.
      The issue is fixed by calling module_put() just right away put_disk().
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  6. 03 Dec, 2015 1 commit
    • Tejun Heo's avatar
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo authored
      Consider the following v2 hierarchy.
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
  7. 22 Oct, 2015 1 commit
    • Tejun Heo's avatar
      blkcg: don't create "io.stat" on the root cgroup · ca0752c5
      Tejun Heo authored
      The stat files on the root cgroup shows stats for the whole system and
      usually don't contain any information which isn't available through
      the usual system monitoring mechanisms.  Some controllers skip
      collecting these duplicate stats to optimize cases where cgroup isn't
      used and later try to emulate the result on demand.
      This leads to complexities and subtle differences in the information
      shown through different channels.  This is entirely unnecessary and
      cgroup v2 is dropping stat files which are duplicate from all
      controllers.  This patch removes "io.stat" from the root hierarchy.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
  8. 11 Sep, 2015 1 commit
    • Tejun Heo's avatar
      block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg · 6fe810bd
      Tejun Heo authored
      While making the root blkg unconditional, ec13b1d6 ("blkcg: always
      create the blkcg_gq for the root blkcg") removed the part which clears
      q->root_blkg and ->root_rl.blkg during q exit.  This leaves the two
      pointers dangling after blkg_destroy_all().  blk-throttle exit path
      performs blkg traversals and dereferences ->root_blkg and can lead to
      the following oops.
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
       IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
       RIP: 0010:[<ffffffff81389746>]  [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       Call Trace:
        [<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
        [<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
        [<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
        [<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
        [<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
        [<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
        [<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
        [<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
      While the bug is a straigh-forward use-after-free bug, it is tricky to
      reproduce because blkg release is RCU protected and the rest of exit
      path usually finishes before RCU grace period.
      This patch fixes the bug by updating blkg_destro_all() to clear
      q->root_blkg and ->root_rl.blkg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatar"Richard W.M. Jones" <rjones@redhat.com>
      Reported-by: default avatarJosh Boyer <jwboyer@fedoraproject.org>
      Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
      Fixes: ec13b1d6 ("blkcg: always create the blkcg_gq for the root blkcg")
      Cc: stable@vger.kernel.org # v4.2+
      Tested-by: default avatarRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  9. 18 Aug, 2015 24 commits
    • Tejun Heo's avatar
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo authored
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo authored
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
      * The original interface consisted of 27 files is replaced with the
        following three files.
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo authored
      * Export blkg_dev_name()
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo authored
      Currently, blkg_conf_prep() expects input to be of the following form
       MAJ:MIN NUM
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      This doesn't cause any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo authored
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      Pure renames.  No functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo authored
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: refine error codes returned during blkcg configuration · 20386ce0
      Tejun Heo authored
      blkcg currently returns -EINVAL for most errors which can be pretty
      confusing given that the failure modes are quite varied.  Update the
      error returns so that
      * -EINVAL only for syntactic errors.
      * -ERANGE if the value is out of range.
      * -ENODEV if the target device can't be found.
      * -EOPNOTSUPP if the policy is not enabled on the target device.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: reduce stack usage of blkg_rwstat_recursive_sum() · 3a7faead
      Tejun Heo authored
      The recent percpu conversion of blkg_rwstat triggered the following
      warning in certain configurations.
       block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
      This is because blkg_rwstat now contains four percpu_counter which can
      be pretty big depending on debug options although it shouldn't be a
      problem in production configs.  This patch removes one of the two
      local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
      reduce stack usage.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo authored
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      v3: Update blkio-controller.txt accordingly.
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq · f12c74ca
      Tejun Heo authored
      Currently, blkg_[rw]stat_recursive_sum() assume that the target
      counter is located in pd (blkg_policy_data); however, some counters
      are planned to be moved to blkg (blkcg_gq).
      This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
      blkg_policy pointers instead of pd.  If policy is NULL, it indexes
      into blkg.  If non-NULL, into the blkg's pd of the policy.
      The existing usages are updated to maintain the current behaviors.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo authored
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      * Custom per-cpu stat implementation in blk-throttle is removed.
      This makes all blkcg stat counters per-cpu without complicating policy
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it · e6269c44
      Tejun Heo authored
      cgroup stats are local to each cgroup and doesn't propagate to
      ancestors by default.  When recursive stats are necessary, the sum is
      calculated over all the descendants.  This initially was for backward
      compatibility to support both group-local and recursive stats but this
      mode of operation makes general sense as stat update is much hotter
      thafn reporting those stats.
      This however ends up losing recursive stats when a child is removed.
      To work around this, cfq-iosched adds its stats to its parent
      cfq_group->dead_stats which is summed up together when calculating
      recursive stats.
      It's planned that the core stats will be moved to blkcg_gq, so we want
      to move the mechanism for keeping track of the stats of dead children
      from cfq to blkcg core.  This patch adds blkg_[rw]stat->aux_cnt which
      are atomic64_t's keeping track of auxiliary counts which are excluded
      when reading local counts but included for recursive.
      blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
      are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
      a dead cgroup to the aux counts of parent->stats instead of separate
      This will also help making blkg_[rw]stats per-cpu.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo authored
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: inline [__]blkg_lookup() · 24f29046
      Tejun Heo authored
      blkg_lookup() checks whether the target queue is bypassing and, if
      not, calls __blkg_lookup() which first checks the lookup hint and then
      performs radix tree walk.  The operations upto hint checking are
      trivial and there are many users of this function.  This patch inlines
      blkg_lookup() and the fast path part of __blkg_lookup().  The radix
      tree lookup and hint update are now in blkg_lookup_slowpath().
      This will help consolidating blkg handling by easing moving root blkcg
      short-circuit to inlined lookup fast path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods · e4a9bde9
      Tejun Heo authored
      Each active policy has a cpd (blkcg_policy_data) on each blkcg.  The
      cpd's were allocated by blkcg core and each policy could request to
      allocate extra space at the end by setting blkcg_policy->cpd_size
      larger than the size of cpd.
      This is a bit unusual but blkg (blkcg_gq) policy data used to be
      handled this way too so it made sense to be consistent; however, blkg
      policy data switched to alloc/free callbacks.
      This patch makes similar changes to cpd handling.
      blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size.  As
      cpd allocation is now done from policy side, it can simply allocate a
      larger area which embeds cpd at the beginning.
      As ->cpd_alloc_fn() may be able to perform all necessary
      initializations, this patch makes ->cpd_init_fn() optional.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: minor updates around blkcg_policy_data · 81437648
      Tejun Heo authored
      * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
        for blkcg_policy_data.
      * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
        blkcg.  This makes it consistent with blkg_policy_data methods and
        to-be-added cpd alloc/free methods.
      * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
        cpd_init_fn() can determine the associated blkcg from
      v2: blkcg_policy_data->blkcg initializations were missing.  Added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data · a9520cd6
      Tejun Heo authored
      The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
      (blkg_policy_data) while the older ones use blkg (blkcg_gq).  As using
      blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
      can always be mapped to blkg and given that these are policy-specific
      methods, it makes sense to converge on pd.
      This patch makes all methods deal with pd instead of blkg.  Most
      conversions are trivial.  In blk-cgroup.c, a couple method invocation
      sites now test whether pd exists instead of policy state for
      consistency.  This shouldn't cause any behavioral differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods · b2ce2643
      Tejun Heo authored
      With the recent addition of alloc and free methods, things became
      messier.  This patch reorganizes them according to the followings.
      * ->pd_alloc_fn()
        Responsible for allocation and static initializations - the ones
        which can be done independent of where the pd might be attached.
      * ->pd_init_fn()
        Initializations which require the knowledge of where the pd is
      * ->pd_free_fn()
        The counter part of pd_alloc_fn().  Static de-init and freeing.
      This leaves ->pd_exit_fn() without any users.  Removed.
      While at it, collapse an one liner function throtl_pd_exit(), which
      has only one user, into its user.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods · 001bea73
      Tejun Heo authored
      A blkg (blkcg_gq) represents the relationship between a cgroup and
      request_queue.  Each active policy has a pd (blkg_policy_data) on each
      blkg.  The pd's were allocated by blkcg core and each policy could
      request to allocate extra space at the end by setting
      blkcg_policy->pd_size larger than the size of pd.
      This is a bit unusual but was done this way mostly to simplify error
      handling and all the existing use cases could be handled this way;
      however, this is becoming too restrictive now that percpu memory can
      be allocated without blocking.
      This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
      and pd_free_fn() - which are used to allocate and release pd for a
      given policy.  As pd allocation is now done from policy side, it can
      simply allocate a larger area which embeds pd at the beginning.  This
      change makes ->pd_size pointless.  Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn · 3e418710
      Tejun Heo authored
      blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
      doesn't.  As both in-kernel policies implement ->pd_init_fn, it
      currently doesn't break anything.  Update blkcg_activate_policy() so
      that its behavior is consistent with blkg_create().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy() · 4c55f4f9
      Tejun Heo authored
      When a policy gets activated, it needs to allocate and install its
      policy data on all existing blkg's (blkcg_gq's).  Because blkg
      iteration is protected by a spinlock, it currently counts the total
      number of blkg's in the system, allocates the matching number of
      policy data on a list and installs them during a single iteration.
      This can be simplified by using speculative GFP_NOWAIT allocations
      while iterating and falling back to a preallocated policy data on
      failure.  If the preallocated one has already been consumed, it
      releases the lock, preallocate with GFP_KERNEL and then restarts the
      iteration.  This can be a bit more expensive than before but policy
      activation is a very cold path and shouldn't matter.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths · bc915e61
      Tejun Heo authored
      blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
      bypasses policy data and blkcg freeing for blkcg_root.  There's no
      reason to to treat policy data any differently for blkcg_root.  If the
      root css gets allocated after policies are registered, policy
      registration path will add policy data; otherwise, the alloc path
      will.  The free path isn't never invoked for root csses.
      This patch removes the unnecessary special handling of blkcg_root from
      css_alloc/free paths.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: use blkg_free() in blkcg_init_queue() failure path · 994b7832
      Tejun Heo authored
      When blkcg_init_queue() fails midway after creating a new blkg, it
      performs kfree() directly; however, this doesn't free the policy data
      areas.  Make it use blkg_free() instead.  In turn, blkg_free() is
      updated to handle root request_list special case.
      While this fixes a possible memory leak, it's on an unlikely failure
      path of an already cold path and the size leaked per occurrence is
      miniscule too.  I don't think it needs to be tagged for -stable.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations · d93a11f1
      Tejun Heo authored
      blkcg performs several allocations to track IOs per cgroup and enforce
      resource control.  Most of these allocations are performed lazily on
      demand in the IO path and thus can't involve reclaim path.  Currently,
      these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
      with occassional failures of these allocations by punting IOs to the
      root cgroup and there's no reason to reach into the emergency reserve.
      This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
      * bdi_writeback_congested and blkcg_gq allocations in blkg_create().
      * radix tree node allocations for blkcg->blkg_tree.
      * cfq_queue allocation on ioprio changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-and-Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Suggested-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  10. 22 Jul, 2015 1 commit
  11. 09 Jul, 2015 4 commits
    • Tejun Heo's avatar
      blkcg: fix blkcg_policy_data allocation bug · 06b285bd
      Tejun Heo authored
      e48453c3 ("block, cgroup: implement policy-specific per-blkcg
      data") updated per-blkcg policy data to be dynamically allocated.
      When a policy is registered, its policy data aren't created.  Instead,
      when the policy is activated on a queue, the policy data are allocated
      if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
      This is buggy.  Consider the following scenario.
      1. A blkcg is created.  No blkg's attached yet.
      2. The policy is registered.  No policy data is allocated.
      3. The policy is activated on a queue.  As the above blkcg doesn't
         have any blkg's, it won't allocate the matching blkcg_policy_data.
      4. An IO is issued from the blkcg and blkg is created and the blkcg
         still doesn't have the matching policy data allocated.
      With cfq-iosched, this leads to an oops.
      It also doesn't free policy data on policy unregistration assuming
      that freeing of all policy data on blkcg destruction should take care
      of it; however, this also is incorrect.
      1. A blkcg has policy data.
      2. The policy gets unregistered but the policy data remains.
      3. Another policy gets registered on the same slot.
      4. Later, the new policy tries to allocate policy data on the previous
         blkcg but the slot is already occupied and gets skipped.  The
         policy ends up operating on the policy data of the previous policy.
      There's no reason to manage blkcg_policy_data lazily.  The reason we
      do lazy allocation of blkg's is that the number of all possible blkg's
      is the product of cgroups and block devices which can reach a
      surprising level.  blkcg_policy_data is contrained by the number of
      cgroups and shouldn't be a problem.
      This patch makes blkcg_policy_data to be allocated for all existing
      blkcg's on policy registration and freed on unregistration and removes
      blkcg_policy_data handling from policy [de]activation paths.  This
      makes that blkcg_policy_data are created and removed with the policy
      they belong to and fixes the above described problems.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: implement all_blkcgs list · 7876f930
      Tejun Heo authored
      Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
      protected by blkcg_pol_mutex.  This will be used to fix
      blkcg_policy_data allocation bug.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating blkcg_policy[] · 144232b3
      Tejun Heo authored
      An entry in blkcg_policy[] is stable while there are non-bypassing
      in-flight IOs on a request_queue which has the policy activated.  This
      is why most derefs of blkcg_policy[] don't need explicit locking;
      however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
      have this protection and may race policies being added and removed.
      Fix it by adding explicit blkcg_pol_mutex protection around
      blkcg_policy[] iteration in blkcg_css_alloc().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methods · 838f13bf
      Tejun Heo authored
      blkcg_pol_mutex primarily protects the blkcg_policy array.  It also
      protects cgroup file type [un]registration during policy addition /
      removal.  This puts blkcg_pol_mutex outside cgroup internal
      synchronization and in turn makes it impossible to grab from blkcg's
      cgroup methods as that leads to cyclic dependency.
      Another problematic dependency arising from this is through cgroup
      interface file deactivation.  Removing a cftype requires removing all
      files of the type which in turn involves draining all on-going
      invocations of the file methods.  This means that an interface file
      implementation can't grab blkcg_pol_mutex as draining can lead to AA
      blkcg_reset_stats() is already in this situation.  It currently
      trylocks blkcg_pol_mutex and then unwinds and retries the whole
      operation on failure, which is cumbersome at best.  It has a lengthy
      comment explaining how cgroup internal synchronization is involved and
      expected to be updated but as explained above this doesn't need cgroup
      internal locking to deadlock.  It's a self-contained AA deadlock.
      The described circular dependencies can be easily broken by moving
      cftype [un]registration out of blkcg_pol_mutex and protect them with
      an outer mutex.  This patch introduces blkcg_pol_register_mutex which
      wraps entire policy [un]registration including cftype operations and
      shrinks blkcg_pol_mutex critical section.  This also makes the trylock
      dancing in blkcg_reset_stats() unnecessary.  Removed.
      This patch is necessary for the following blkcg_policy_data allocation
      bug fixes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  12. 07 Jul, 2015 1 commit
  13. 07 Jun, 2015 1 commit
    • Arianna Avanzini's avatar
      block, cgroup: implement policy-specific per-blkcg data · e48453c3
      Arianna Avanzini authored
      The block IO (blkio) controller enables the block layer to provide service
      guarantees in a hierarchical fashion. Specifically, service guarantees
      are provided by registered request-accounting policies. As of now, a
      proportional-share and a throttling policy are available. They are
      implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
      subsystem. Unfortunately, as for adding new policies, the current
      implementation of the block IO controller is only halfway ready to allow
      new policies to be plugged in. This commit provides a solution to make
      the block IO controller fully ready to handle new policies.
      In what follows, we first describe briefly the current state, and then
      list the changes made by this commit.
      The throttling policy does not need any per-cgroup information to perform
      its task. In contrast, the proportional share policy uses, for each cgroup,
      both the weight assigned by the user to the cgroup, and a set of dynamically-
      computed weights, one for each device.
      The first, user-defined weight is stored in the blkcg data structure: the
      block IO controller allocates a private blkcg data structure for each
      cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
      In other words, the block IO controller internally mirrors the blkio cgroups
      with private blkcg data structures.
      On the other hand, for each cgroup and device, the corresponding dynamically-
      computed weight is maintained in the following, different way. For each device,
      the block IO controller keeps a private blkcg_gq structure for each cgroup in
      blkio. In other words, block IO also keeps one private mirror copy of the blkio
      cgroups hierarchy for each device, made of blkcg_gq structures.
      Each blkcg_gq structure keeps per-policy information in a generic array of
      dynamically-allocated 'dedicated' data structures, one for each registered
      policy (so currently the array contains two elements). To be inserted into the
      generic array, each dedicated data structure embeds a generic blkg_policy_data
      structure. Consider now the array contained in the blkcg_gq structure
      corresponding to a given pair of cgroup and device: one of the elements
      of the array contains the dedicated data structure for the proportional-share
      policy, and this dedicated data structure contains the dynamically-computed
      weight for that pair of cgroup and device.
      The generic strategy adopted for storing per-policy data in blkcg_gq structures
      is already capable of handling new policies, whereas the one adopted with blkcg
      structures is not, because per-policy data are hard-coded in the blkcg
      structures themselves (currently only data related to the proportional-
      share policy).
      This commit addresses the above issues through the following changes:
      . It generalizes blkcg structures so that per-policy data are stored in the same
        way as in blkcg_gq structures.
        Specifically, it lets also the blkcg structure store per-policy data in a
        generic array of dynamically-allocated dedicated data structures. We will
        refer to these data structures as blkcg dedicated data structures, to
        distinguish them from the dedicated data structures inserted in the generic
        arrays kept by blkcg_gq structures.
        To allow blkcg dedicated data structures to be inserted in the generic array
        inside a blkcg structure, this commit also introduces a new blkcg_policy_data
        structure, which is the equivalent of blkg_policy_data for blkcg dedicated
        data structures.
      . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
        cpd_size field and a cpd_init field, to be initialized by the policy with,
        respectively, the size of the blkcg dedicated data structures, and the
        address of a constructor function for blkcg dedicated data structures.
      . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
        the fields related to the proportional-share policy), into a new blkcg
        dedicated data structure called cfq_group_data.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  14. 02 Jun, 2015 1 commit
    • Tejun Heo's avatar
      writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested · ce7acfea
      Tejun Heo authored
      A blkg (blkcg_gq) can be congested and decongested independently from
      other blkgs on the same request_queue.  Accordingly, for cgroup
      writeback support, the congestion status at bdi (backing_dev_info)
      should be split and updated separately from matching blkg's.
      This patch prepares by adding blkg->wb_congested and associating a
      blkg with its matching per-blkcg bdi_writeback_congested on creation.
      v2: Updated to associate bdi_writeback_congested instead of
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>