Skip to content
  • Tim Chen's avatar
    sched/wait: Break up long wake list walk · 2554db91
    Tim Chen authored
    We encountered workloads that have very long wake up list on large
    systems. A waker takes a long time to traverse the entire wake list and
    execute all the wake functions.
    
    We saw page wait list that are up to 3700+ entries long in tests of
    large 4 and 8 socket systems. It took 0.8 sec to traverse such list
    during wake up. Any other CPU that contends for the list spin lock will
    spin for a long time. It is a result of the numa balancing migration of
    hot pages that are shared by many threads.
    
    Multiple CPUs waking are queued up behind the lock, and the last one
    queued has to wait until all CPUs did all the wakeups.
    
    The page wait list is traversed with interrupt disabled, which caused
    various problems. This was the original cause that triggered the NMI
    watch dog timer in: https://patchwork.kernel.org/patch/9800303/
    
     . Only
    extending the NMI watch dog timer there helped.
    
    This patch bookmarks the waker's scan position in wake list and break
    the wake up walk, to allow access to the list before the waker resume
    its walk down the rest of the wait list. It lowers the interrupt and
    rescheduling latency.
    
    This patch also provides a performance boost when combined with the next
    patch to break up page wakeup list walk. We saw 22% improvement in the
    will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
    
    [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
      simply access to flags. ]
    
    Reported-by: default avatarKan Liang <kan.liang@intel.com>
    Tested-by: default avatarKan Liang <kan.liang@intel.com>
    Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    2554db91