Skip to content
  • Konstantin Khlebnikov's avatar
    mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
    Konstantin Khlebnikov authored
    Currently the kernel sets mm->exe_file during sys_execve() and then tracks
    number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
    as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
    resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
    
    VMA with VM_EXECUTABLE flag appears after mapping file with flag
    MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
    splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
    mm->exe_file and mmaps executable vmas with this file, they hold
    mm->exe_file while task is running.
    
    comment from v2.6.25-6245-g925d1c40
    
     ("procfs task exe symlink"),
    where all this stuff was introduced:
    
    > The kernel implements readlink of /proc/pid/exe by getting the file from
    > the first executable VMA.  Then the path to the file is reconstructed and
    > reported as the result.
    >
    > Because of the VMA walk the code is slightly different on nommu systems.
    > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
    > walking the VMAs to find the first executable file-backed VMA we store a
    > reference to the exec'd file in the mm_struct.
    >
    > That reference would prevent the filesystem holding the executable file
    > from being unmounted even after unmapping the VMAs.  So we track the number
    > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    > unmapped.  This avoids pinning the mounted filesystem.
    
    exe_file's vma accounting is hooked into every file mmap/unmmap and vma
    split/merge just to fix some hypothetical pinning fs from umounting by mm,
    which already unmapped all its executable files, but still alive.
    
    Seems like currently nobody depends on this behaviour.  We can try to
    remove this logic and keep mm->exe_file until final mmput().
    
    mm->exe_file is still protected with mm->mmap_sem, because we want to
    change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
    task can change its mm->exe_file and unpin mountpoint explicitly.
    
    Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Carsten Otte <cotte@de.ibm.com>
    Cc: Chris Metcalf <cmetcalf@tilera.com>
    Cc: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: Eric Paris <eparis@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: James Morris <james.l.morris@oracle.com>
    Cc: Jason Baron <jbaron@redhat.com>
    Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
    Cc: Matt Helsley <matthltc@us.ibm.com>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Robert Richter <robert.richter@amd.com>
    Cc: Suresh Siddha <suresh.b.siddha@intel.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Venkatesh Pallipadi <venki@google.com>
    Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    e9714acf