Commit 34a9304a authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-4.5' of git://

Pull cgroup updates from Tejun Heo:

 - cgroup v2 interface is now official.  It's no longer hidden behind a
   devel flag and can be mounted using the new cgroup2 fs type.

   Unfortunately, cpu v2 interface hasn't made it yet due to the
   discussion around in-process hierarchical resource distribution and
   only memory and io controllers can be used on the v2 interface at the

 - The existing documentation which has always been a bit of mess is
   relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
   is added as the authoritative documentation for the v2 interface.

 - Some features are added through for-4.5-ancestor-test branch to
   enable netfilter xt_cgroup match to use cgroup v2 paths.  The actual
   netfilter changes will be merged through the net tree which pulled in
   the said branch.

 - Various cleanups

* 'for-4.5' of git://
  cgroup: rename cgroup documentations
  cgroup: fix a typo.
  cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
  cgroup: demote subsystem init messages to KERN_DEBUG
  cgroup: Fix uninitialized variable warning
  cgroup: put controller Kconfig options in meaningful order
  cgroup: clean up the kernel configuration menu nomenclature
  cgroup_pids: fix a typo.
  Subject: cgroup: Fix incomplete dd command in blkio documentation
  cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
  cpuset: Replace all instances of time_t with time64_t
  cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
  cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
  cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
parents aee3bfa3 6255c46f
......@@ -24,7 +24,5 @@ net_prio.txt
- Network priority cgroups details and usages.
- Process number cgroups details and usages.
- Resource Counter API.
- Description the new/next cgroup interface.
......@@ -84,8 +84,7 @@ Throttling/Upper Limit policy
- Run dd to read a file and see if rate is throttled to 1MB/s or not.
# dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
# iflag=direct
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
......@@ -374,82 +373,3 @@ One can experience an overall throughput drop if you have created multiple
groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.
Page cache is dirtied through buffered writes and shared mmaps and
written asynchronously to the backing filesystem by the writeback
mechanism. Writeback sits between the memory and IO domains and
regulates the proportion of dirty memory by balancing dirtying and
write IOs.
On traditional cgroup hierarchies, relationships between different
controllers cannot be established making it impossible for writeback
to operate accounting for cgroup resource restrictions and all
writeback IOs are attributed to the root cgroup.
If both the blkio and memory controllers are used on the v2 hierarchy
and the filesystem supports cgroup writeback, writeback operations
correctly follow the resource restrictions imposed by both memory and
blkio controllers.
Writeback examines both system-wide and per-cgroup dirty memory status
and enforces the more restrictive of the two. Also, writeback control
parameters which are absolute values - vm.dirty_bytes and
vm.dirty_background_bytes - are distributed across cgroups according
to their current writeback bandwidth.
There's a peculiarity stemming from the discrepancy in ownership
granularity between memory controller and writeback. While memory
controller tracks ownership per page, writeback operates on inode
basis. cgroup writeback bridges the gap by tracking ownership by
inode but migrating ownership if too many foreign pages, pages which
don't match the current inode ownership, have been encountered while
writing back the inode.
This is a conscious design choice as writeback operations are
inherently tied to inodes making strictly following page ownership
complicated and inefficient. The only use case which suffers from
this compromise is multiple cgroups concurrently dirtying disjoint
regions of the same inode, which is an unlikely use case and decided
to be unsupported. Note that as memory controller assigns page
ownership on the first use and doesn't update it until the page is
released, even if cgroup writeback strictly follows page ownership,
multiple cgroups dirtying overlapping areas wouldn't work as expected.
In general, write-sharing an inode across multiple cgroups is not well
Filesystem support for cgroup writeback
A filesystem can make writeback IOs cgroup-aware by updating
address_space_operations->writepage[s]() to annotate bio's using the
following two functions.
* wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and associates
the bio with the inode's owner cgroup. Can be called anytime
between bio allocation and submission.
* wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out. While
this function doesn't care exactly when it's called during the
writeback session, it's the easiest and most natural to call it as
data segments are added to a bio.
With writeback bio's annotated, cgroup support can be enabled per
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
selective disabling of cgroup writeback support which is helpful when
certain filesystem features, e.g. journaled data mode, are
wbc_init_bio() binds the specified bio to its cgroup. Depending on
the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
This diff is collapsed.
This diff is collapsed.
......@@ -34,17 +34,12 @@ struct seq_file;
/* define the enumeration of all cgroup subsystems */
#define SUBSYS(_x) _x ## _cgrp_id,
#define SUBSYS_TAG(_t) CGROUP_ ## _t, \
__unused_tag_ ## _t = CGROUP_ ## _t - 1,
enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h>
#undef SUBSYS
/* bits in struct cgroup_subsys_state flags field */
enum {
CSS_NO_REF = (1 << 0), /* no reference counting for this css */
......@@ -66,7 +61,6 @@ enum {
/* cgroup_root->flags */
enum {
CGRP_ROOT_SANE_BEHAVIOR = (1 << 0), /* __DEVEL__sane_behavior specified */
CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
......@@ -439,9 +433,9 @@ struct cgroup_subsys {
int (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset);
void (*attach)(struct cgroup_taskset *tset);
int (*can_fork)(struct task_struct *task, void **priv_p);
void (*cancel_fork)(struct task_struct *task, void *priv);
void (*fork)(struct task_struct *task, void *priv);
int (*can_fork)(struct task_struct *task);
void (*cancel_fork)(struct task_struct *task);
void (*fork)(struct task_struct *task);
void (*exit)(struct task_struct *task);
void (*free)(struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css);
......@@ -527,7 +521,6 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
#else /* CONFIG_CGROUPS */
static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {}
......@@ -97,12 +97,9 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *tsk);
void cgroup_fork(struct task_struct *p);
extern int cgroup_can_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT]);
extern void cgroup_cancel_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT]);
extern void cgroup_post_fork(struct task_struct *p,
void *old_ss_priv[CGROUP_CANFORK_COUNT]);
extern int cgroup_can_fork(struct task_struct *p);
extern void cgroup_cancel_fork(struct task_struct *p);
extern void cgroup_post_fork(struct task_struct *p);
void cgroup_exit(struct task_struct *p);
void cgroup_free(struct task_struct *p);
......@@ -562,13 +559,9 @@ static inline int cgroupstats_build(struct cgroupstats *stats,
struct dentry *dentry) { return -EINVAL; }
static inline void cgroup_fork(struct task_struct *p) {}
static inline int cgroup_can_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT])
{ return 0; }
static inline void cgroup_cancel_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT]) {}
static inline void cgroup_post_fork(struct task_struct *p,
void *ss_priv[CGROUP_CANFORK_COUNT]) {}
static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
static inline void cgroup_cancel_fork(struct task_struct *p) {}
static inline void cgroup_post_fork(struct task_struct *p) {}
static inline void cgroup_exit(struct task_struct *p) {}
static inline void cgroup_free(struct task_struct *p) {}
......@@ -6,14 +6,8 @@
* This file *must* be included with SUBSYS() defined.
* SUBSYS_TAG() is a noop if undefined.
#ifndef SUBSYS_TAG
#define __TMP_SUBSYS_TAG
#define SUBSYS_TAG(_x)
......@@ -58,17 +52,10 @@ SUBSYS(net_prio)
* Subsystems that implement the can_fork() family of callbacks.
* The following subsystems are not supported on the default hierarchy.
......@@ -76,11 +63,6 @@ SUBSYS_TAG(CANFORK_END)
......@@ -54,6 +54,7 @@
#define SMB_SUPER_MAGIC 0x517B
#define CGROUP_SUPER_MAGIC 0x27e0eb
#define CGROUP2_SUPER_MAGIC 0x63677270
#define STACK_END_MAGIC 0x57AC6E9D
This diff is collapsed.
......@@ -211,6 +211,7 @@ static unsigned long have_free_callback __read_mostly;
/* Ditto for the can_fork callback. */
static unsigned long have_canfork_callback __read_mostly;
static struct file_system_type cgroup2_fs_type;
static struct cftype cgroup_dfl_base_files[];
static struct cftype cgroup_legacy_base_files[];
......@@ -1623,10 +1624,6 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
all_ss = true;
if (!strcmp(token, "__DEVEL__sane_behavior")) {
if (!strcmp(token, "noprefix")) {
opts->flags |= CGRP_ROOT_NOPREFIX;
......@@ -1693,15 +1690,6 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
return -ENOENT;
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
if (nr_opts != 1) {
pr_err("sane_behavior: no other mount options allowed\n");
return -EINVAL;
return 0;
* If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were
......@@ -1981,6 +1969,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL;
struct cgroup_subsys *ss;
struct cgroup_root *root;
......@@ -1997,6 +1986,17 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (!use_task_css_set_links)
if (is_v2) {
if (data) {
pr_err("cgroup2: unknown option \"%s\"\n", (char *)data);
return ERR_PTR(-EINVAL);
cgrp_dfl_root_visible = true;
root = &cgrp_dfl_root;
goto out_mount;
/* First find the desired set of subsystems */
......@@ -2004,15 +2004,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
if (ret)
goto out_unlock;
/* look for a matching existing root */
if (opts.flags & CGRP_ROOT_SANE_BEHAVIOR) {
cgrp_dfl_root_visible = true;
root = &cgrp_dfl_root;
ret = 0;
goto out_unlock;
* Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the
......@@ -2123,9 +2114,10 @@ out_free:
if (ret)
return ERR_PTR(ret);
dentry = kernfs_mount(fs_type, flags, root->kf_root,
if (IS_ERR(dentry) || !new_sb)
......@@ -2168,6 +2160,12 @@ static struct file_system_type cgroup_fs_type = {
.kill_sb = cgroup_kill_sb,
static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
* task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
* @task: target task
......@@ -4039,7 +4037,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
goto out_err;
* Migrate tasks one-by-one until @form is empty. This fails iff
* Migrate tasks one-by-one until @from is empty. This fails iff
* ->can_attach() fails.
do {
......@@ -5171,7 +5169,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
struct cgroup_subsys_state *css;
printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);
pr_debug("Initializing cgroup subsys %s\n", ss->name);
......@@ -5329,6 +5327,7 @@ int __init cgroup_init(void)
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
WARN_ON(!proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations));
return 0;
......@@ -5472,19 +5471,6 @@ static const struct file_operations proc_cgroupstats_operations = {
.release = single_release,
static void **subsys_canfork_priv_p(void *ss_priv[CGROUP_CANFORK_COUNT], int i)
return &ss_priv[i - CGROUP_CANFORK_START];
return NULL;
static void *subsys_canfork_priv(void *ss_priv[CGROUP_CANFORK_COUNT], int i)
void **private = subsys_canfork_priv_p(ss_priv, i);
return private ? *private : NULL;
* cgroup_fork - initialize cgroup related fields during copy_process()
* @child: pointer to task_struct of forking parent process.
......@@ -5507,14 +5493,13 @@ void cgroup_fork(struct task_struct *child)
* returns an error, the fork aborts with that error code. This allows for
* a cgroup subsystem to conditionally allow or deny new forks.
int cgroup_can_fork(struct task_struct *child,
void *ss_priv[CGROUP_CANFORK_COUNT])
int cgroup_can_fork(struct task_struct *child)
struct cgroup_subsys *ss;
int i, j, ret;
for_each_subsys_which(ss, i, &have_canfork_callback) {
ret = ss->can_fork(child, subsys_canfork_priv_p(ss_priv, i));
ret = ss->can_fork(child);
if (ret)
goto out_revert;
......@@ -5526,7 +5511,7 @@ out_revert:
if (j >= i)
if (ss->cancel_fork)
ss->cancel_fork(child, subsys_canfork_priv(ss_priv, j));
return ret;
......@@ -5539,15 +5524,14 @@ out_revert:
* This calls the cancel_fork() callbacks if a fork failed *after*
* cgroup_can_fork() succeded.
void cgroup_cancel_fork(struct task_struct *child,
void *ss_priv[CGROUP_CANFORK_COUNT])
void cgroup_cancel_fork(struct task_struct *child)
struct cgroup_subsys *ss;
int i;
for_each_subsys(ss, i)
if (ss->cancel_fork)
ss->cancel_fork(child, subsys_canfork_priv(ss_priv, i));
......@@ -5560,8 +5544,7 @@ void cgroup_cancel_fork(struct task_struct *child,
* cgroup_task_iter_start() - to guarantee that the new task ends up on its
* list.
void cgroup_post_fork(struct task_struct *child,
void *old_ss_priv[CGROUP_CANFORK_COUNT])
void cgroup_post_fork(struct task_struct *child)
struct cgroup_subsys *ss;
int i;
......@@ -5605,7 +5588,7 @@ void cgroup_post_fork(struct task_struct *child,
* and addition to css_set.
for_each_subsys_which(ss, i, &have_fork_callback)
ss->fork(child, subsys_canfork_priv(old_ss_priv, i));
......@@ -200,7 +200,7 @@ static void freezer_attach(struct cgroup_taskset *tset)
* to do anything as freezer_attach() will put @task into the appropriate
* state.
static void freezer_fork(struct task_struct *task, void *private)
static void freezer_fork(struct task_struct *task)
struct freezer *freezer;
......@@ -134,7 +134,7 @@ static void pids_charge(struct pids_cgroup *pids, int num)
* This function follows the set limit. It will fail if the charge would cause
* the new value to exceed the hierarchical limit. Returns 0 if the charge
* succeded, otherwise -EAGAIN.
* succeeded, otherwise -EAGAIN.
static int pids_try_charge(struct pids_cgroup *pids, int num)
......@@ -209,7 +209,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
* task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies
* on threadgroup_change_begin() held by the copy_process().
static int pids_can_fork(struct task_struct *task, void **priv_p)
static int pids_can_fork(struct task_struct *task)
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
......@@ -219,7 +219,7 @@ static int pids_can_fork(struct task_struct *task, void **priv_p)
return pids_try_charge(pids, 1);
static void pids_cancel_fork(struct task_struct *task, void *priv)
static void pids_cancel_fork(struct task_struct *task)
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
......@@ -51,6 +51,7 @@
#include <linux/stat.h>
#include <linux/string.h>
#include <linux/time.h>
#include <linux/time64.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
......@@ -68,7 +69,7 @@ struct static_key cpusets_enabled_key __read_mostly = STATIC_KEY_INIT_FALSE;
struct fmeter {
int cnt; /* unprocessed events count */
int val; /* most recent output value */
time_t time; /* clock (secs) when val computed */
time64_t time; /* clock (secs) when val computed */
spinlock_t lock; /* guards read or write of above */
......@@ -1374,7 +1375,7 @@ out:
#define FM_COEF 933 /* coefficient for half-life of 10 secs */
#define FM_MAXTICKS ((time_t)99) /* useless computing more ticks than this */
#define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */
#define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */
#define FM_SCALE 1000 /* faux fixed point scale */
......@@ -1390,8 +1391,11 @@ static void fmeter_init(struct fmeter *fmp)
/* Internal meter update - process cnt events and update value */
static void fmeter_update(struct fmeter *fmp)
time_t now = get_seconds();
time_t ticks = now - fmp->time;
time64_t now;
u32 ticks;
now = ktime_get_seconds();
ticks = now - fmp->time;
if (ticks == 0)
......@@ -1250,7 +1250,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
int retval;
struct task_struct *p;
void *cgrp_ss_priv[CGROUP_CANFORK_COUNT] = {};
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
......@@ -1527,7 +1526,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
* between here and cgroup_post_fork() if an organisation operation is in
* progress.
retval = cgroup_can_fork(p, cgrp_ss_priv);
retval = cgroup_can_fork(p);
if (retval)
goto bad_fork_free_pid;
......@@ -1609,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
cgroup_post_fork(p, cgrp_ss_priv);
......@@ -1619,7 +1618,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
return p;
cgroup_cancel_fork(p, cgrp_ss_priv);
if (pid != &init_struct_pid)
......@@ -8342,7 +8342,7 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
static void cpu_cgroup_fork(struct task_struct *task, void *private)
static void cpu_cgroup_fork(struct task_struct *task)
......@@ -4813,7 +4813,7 @@ static void mem_cgroup_clear_mc(void)
static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg;
struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
struct mem_cgroup *from;
struct task_struct *leader, *p;
struct mm_struct *mm;
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment