Skip to content
  • Tejun Heo's avatar
    cgroup: keep zombies associated with their original cgroups · 2e91fa7f
    Tejun Heo authored
    
    
    cgroup_exit() is called when a task exits and disassociates the
    exiting task from its cgroups and half-attach it to the root cgroup.
    This is unnecessary and undesirable.
    
    No controller actually needs an exiting task to be disassociated with
    non-root cgroups.  Both cpu and perf_event controllers update the
    association to the root cgroup from their exit callbacks just to keep
    consistent with the cgroup core behavior.
    
    Also, this disassociation makes it difficult to track resources held
    by zombies or determine where the zombies came from.  Currently, pids
    controller is completely broken as it uncharges on exit and zombies
    always escape the resource restriction.  With cgroup association being
    reset on exit, fixing it is pretty painful.
    
    There's no reason to reset cgroup membership on exit.  The zombie can
    be removed from its css_set so that it doesn't show up on
    "cgroup.procs" and thus can't be migrated or interfere with cgroup
    removal.  It can still pin and point to the css_set so that its cgroup
    membership is maintained.  This patch makes cgroup core keep zombies
    associated with their cgroups at the time of exit.
    
    * Previous patches decoupled populated_cnt tracking from css_set
      lifetime, so a dying task can be simply unlinked from its css_set
      while pinning and pointing to the css_set.  This keeps css_set
      association from task side alive while hiding it from "cgroup.procs"
      and populated_cnt tracking.  The css_set reference is dropped when
      the task_struct is freed.
    
    * ->exit() callback no longer needs the css arguments as the
      associated css never changes once PF_EXITING is set.  Removed.
    
    * cpu and perf_events controllers no longer need ->exit() callbacks.
      There's no reason to explicitly switch away on exit.  The final
      schedule out is enough.  The callbacks are removed.
    
    * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
      still reports "/" for all zombies.  On the default hierarchy,
      "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
      to at the time of exit.  If the cgroup gets removed before the task
      is reaped, " (deleted)" is appended.
    
    v2: Build brekage due to missing dummy cgroup_free() when
        !CONFIG_CGROUP fixed.
    
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    2e91fa7f