Commit 029190c5 authored by Paul Jackson's avatar Paul Jackson Committed by Linus Torvalds

cpuset sched_load_balance flag

Add a new per-cpuset flag called 'sched_load_balance'.

When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.

When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.

Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.

If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those

Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.

This serves two purposes:
 1) It provides a mechanism for real time isolation of some CPUs, and
 2) it can be used to improve performance on systems with many CPUs
    by supporting configurations in which load balancing is not done
    across all CPUs at once, but rather only done in several smaller
    disjoint sets of CPUs.

This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets

See further the Documentation and comments in the code itself.

[ don't be weird]
Signed-off-by: default avatarPaul Jackson <>
Acked-by: default avatarIngo Molnar <>
Signed-off-by: default avatarAndrew Morton <>
Signed-off-by: default avatarLinus Torvalds <>
parent 2f2a3a46
......@@ -19,7 +19,8 @@ CONTENTS:
1.4 What are exclusive cpusets ?
1.5 What is memory_pressure ?
1.6 What is memory spread ?
1.7 How do I use cpusets ?
1.7 What is sched_load_balance ?
1.8 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
......@@ -359,8 +360,144 @@ policy, especially for jobs that might have one thread reading in the
data set, the memory allocation across the nodes in the jobs cpuset
can become very uneven.
1.7 What is sched_load_balance ?
1.7 How do I use cpusets ?
The kernel scheduler (kernel/sched.c) automatically load balances
tasks. If one CPU is underutilized, kernel code running on that
CPU will look for tasks on other more overloaded CPUs and move those
tasks to itself, within the constraints of such placement mechanisms
as cpusets and sched_setaffinity.
The algorithmic cost of load balancing and its impact on key shared
kernel data structures such as the task list increases more than
linearly with the number of CPUs being balanced. So the scheduler
has support to partition the systems CPUs into a number of sched
domains such that it only load balances within each sched domain.
Each sched domain covers some subset of the CPUs in the system;
no two sched domains overlap; some CPUs might not be in any sched
domain and hence won't be load balanced.
Put simply, it costs less to balance between two smaller sched domains
than one big one, but doing so means that overloads in one of the
two domains won't be load balanced to the other one.
By default, there is one sched domain covering all CPUs, except those
marked isolated using the kernel boot time "isolcpus=" argument.
This default load balancing across all CPUs is not well suited for
the following two situations:
1) On large systems, load balancing across many CPUs is expensive.
If the system is managed using cpusets to place independent jobs
on separate sets of CPUs, full load balancing is unnecessary.
2) Systems supporting realtime on some CPUs need to minimize
system overhead on those CPUs, including avoiding task load
balancing if that is not needed.
When the per-cpuset flag "sched_load_balance" is enabled (the default
setting), it requests that all the CPUs in that cpusets allowed 'cpus'
be contained in a single sched domain, ensuring that load balancing
can move a task (not otherwised pinned, as by sched_setaffinity)
from any CPU in that cpuset to any other.
When the per-cpuset flag "sched_load_balance" is disabled, then the
scheduler will avoid load balancing across the CPUs in that cpuset,
--except-- in so far as is necessary because some overlapping cpuset
has "sched_load_balance" enabled.
So, for example, if the top cpuset has the flag "sched_load_balance"
enabled, then the scheduler will have one sched domain covering all
CPUs, and the setting of the "sched_load_balance" flag in any other
cpusets won't matter, as we're already fully load balancing.
Therefore in the above two situations, the top cpuset flag
"sched_load_balance" should be disabled, and only some of the smaller,
child cpusets have this flag enabled.
When doing this, you don't usually want to leave any unpinned tasks in
the top cpuset that might use non-trivial amounts of CPU, as such tasks
may be artificially constrained to some subset of CPUs, depending on
the particulars of this flag setting in descendent cpusets. Even if
such a task could use spare CPU cycles in some other CPUs, the kernel
scheduler might not consider the possibility of load balancing that
task to that underused CPU.
Of course, tasks pinned to a particular CPU can be left in a cpuset
that disables "sched_load_balance" as those tasks aren't going anywhere
else anyway.
There is an impedance mismatch here, between cpusets and sched domains.
Cpusets are hierarchical and nest. Sched domains are flat; they don't
overlap and each CPU is in at most one sched domain.
It is necessary for sched domains to be flat because load balancing
across partially overlapping sets of CPUs would risk unstable dynamics
that would be beyond our understanding. So if each of two partially
overlapping cpusets enables the flag 'sched_load_balance', then we
form a single sched domain that is a superset of both. We won't move
a task to a CPU outside it cpuset, but the scheduler load balancing
code might waste some compute cycles considering that possibility.
This mismatch is why there is not a simple one-to-one relation
between which cpusets have the flag "sched_load_balance" enabled,
and the sched domain configuration. If a cpuset enables the flag, it
will get balancing across all its CPUs, but if it disables the flag,
it will only be assured of no load balancing if no other overlapping
cpuset enables the flag.
If two cpusets have partially overlapping 'cpus' allowed, and only
one of them has this flag enabled, then the other may find its
tasks only partially load balanced, just on the overlapping CPUs.
This is just the general case of the top_cpuset example given a few
paragraphs above. In the general case, as in the top cpuset case,
don't leave tasks that might use non-trivial amounts of CPU in
such partially load balanced cpusets, as they may be artificially
constrained to some subset of the CPUs allowed to them, for lack of
load balancing to the other CPUs.
1.7.1 sched_load_balance implementation details.
The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
to most cpuset flags.) When enabled for a cpuset, the kernel will
ensure that it can load balance across all the CPUs in that cpuset
(makes sure that all the CPUs in the cpus_allowed of that cpuset are
in the same sched domain.)
If two overlapping cpusets both have 'sched_load_balance' enabled,
then they will be (must be) both in the same sched domain.
If, as is the default, the top cpuset has 'sched_load_balance' enabled,
then by the above that means there is a single sched domain covering
the whole system, regardless of any other cpuset settings.
The kernel commits to user space that it will avoid load balancing
where it can. It will pick as fine a granularity partition of sched
domains as it can while still providing load balancing for any set
of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
The internal kernel cpuset to scheduler interface passes from the
cpuset code to the scheduler code a partition of the load balanced
CPUs in the system. This partition is a set of subsets (represented
as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
the CPUs that must be load balanced.
Whenever the 'sched_load_balance' flag changes, or CPUs come or go
from a cpuset with this flag enabled, or a cpuset with this flag
enabled is removed, the cpuset code builds a new such partition and
passes it to the scheduler sched domain setup code, to have the sched
domains rebuilt as necessary.
This partition exactly defines what sched domains the scheduler should
setup - one sched domain for each element (cpumask_t) in the partition.
The scheduler remembers the currently active sched domain partitions.
When the scheduler routine partition_sched_domains() is invoked from
the cpuset code to update these sched domains, it compares the new
partition requested with the current, and updates its sched domains,
removing the old and adding the new, for each change.
1.8 How do I use cpusets ?
In order to minimize the impact of cpusets on critical kernel
......@@ -737,6 +737,8 @@ struct sched_domain {
extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new);
#endif /* CONFIG_SMP */
This diff is collapsed.
......@@ -6376,26 +6376,31 @@ error:
return -ENOMEM;
static cpumask_t *doms_cur; /* current sched domains */
static int ndoms_cur; /* number of sched domains in 'doms_cur' */
* Special case: If a kmalloc of a doms_cur partition (array of
* cpumask_t) fails, then fallback to a single sched domain,
* as determined by the single cpumask_t fallback_doms.
static cpumask_t fallback_doms;
* Set up scheduler domains and groups. Callers must hold the hotplug lock.
* For now this just excludes isolated cpus, but could be used to
* exclude other special cases in the future.
static int arch_init_sched_domains(const cpumask_t *cpu_map)
cpumask_t cpu_default_map;
int err;
* Setup mask for cpus without special case scheduling requirements.
* For now this just excludes isolated cpus, but could be used to
* exclude other special cases in the future.
cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
err = build_sched_domains(&cpu_default_map);
ndoms_cur = 1;
doms_cur = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
if (!doms_cur)
doms_cur = &fallback_doms;
cpus_andnot(*doms_cur, *cpu_map, cpu_isolated_map);
return err;
return build_sched_domains(doms_cur);
static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
......@@ -6419,6 +6424,68 @@ static void detach_destroy_domains(const cpumask_t *cpu_map)
* Partition sched domains as specified by the 'ndoms_new'
* cpumasks in the array doms_new[] of cpumasks. This compares
* doms_new[] to the current sched domain partitioning, doms_cur[].
* It destroys each deleted domain and builds each new domain.
* 'doms_new' is an array of cpumask_t's of length 'ndoms_new'.
* The masks don't intersect (don't overlap.) We should setup one
* sched domain for each mask. CPUs not in any of the cpumasks will
* not be load balanced. If the same cpumask appears both in the
* current 'doms_cur' domains and in the new 'doms_new', we can leave
* it as it is.
* The passed in 'doms_new' should be kmalloc'd. This routine takes
* ownership of it and will kfree it when done with it. If the caller
* failed the kmalloc call, then it can pass in doms_new == NULL,
* and partition_sched_domains() will fallback to the single partition
* 'fallback_doms'.
* Call with hotplug lock held
void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)
int i, j;
if (doms_new == NULL) {
ndoms_new = 1;
doms_new = &fallback_doms;
cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
/* Destroy deleted domains */
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < ndoms_new; j++) {
if (cpus_equal(doms_cur[i], doms_new[j]))
goto match1;
/* no match - a current sched domain not in new doms_new[] */
detach_destroy_domains(doms_cur + i);
/* Build new domains */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < ndoms_cur; j++) {
if (cpus_equal(doms_new[i], doms_cur[j]))
goto match2;
/* no match - add a new doms_new */
build_sched_domains(doms_new + i);
/* Remember the new sched domains */
if (doms_cur != &fallback_doms)
doms_cur = doms_new;
ndoms_cur = ndoms_new;
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
static int arch_reinit_sched_domains(void)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment