Skip to content
  • Jason Wessel's avatar
    debug_core: refactor locking for master/slave cpus · dfee3a7b
    Jason Wessel authored
    
    
    For quite some time there have been problems with memory barriers and
    various races with NMI on multi processor systems using the kernel
    debugger.  The algorithm for entering the kernel debug core and
    resuming kernel execution was racy and had several known edge case
    problems with attempting to debug something on a heavily loaded system
    using breakpoints that are hit repeatedly and quickly.
    
    The prior "locking" design entry worked as follows:
    
      * The atomic counter kgdb_active was used with atomic exchange in
        order to elect a master cpu out of all the cpus that may have
        taken a debug exception.
      * The master cpu increments all elements of passive_cpu_wait[].
      * The master cpu issues the round up cpus message.
      * Each "slave cpu" that enters the debug core increments its own
        element in cpu_in_kgdb[].
      * Each "slave cpu" spins on passive_cpu_wait[] until it becomes 0.
      * The master cpu debugs the system.
    
    The new scheme removes the two arrays of atomic counters and replaces
    them with 2 single counters.  One counter is used to count the number
    of cpus waiting to become a master cpu (because one or more hit an
    exception). The second counter is use to indicate how many cpus have
    entered as slave cpus.
    
    The new entry logic works as follows:
    
      * One or more cpus enters via kgdb_handle_exception() and increments
        the masters_in_kgdb. Each cpu attempts to get the spin lock called
        dbg_master_lock.
      * The master cpu sets kgdb_active to the current cpu.
      * The master cpu takes the spinlock dbg_slave_lock.
      * The master cpu asks to round up all the other cpus.
      * Each slave cpu that is not already in kgdb_handle_exception()
        will enter and increment slaves_in_kgdb.  Each slave will now spin
        try_locking on dbg_slave_lock.
      * The master cpu waits for the sum of masters_in_kgdb and slaves_in_kgdb
        to be equal to the sum of the online cpus.
      * The master cpu debugs the system.
    
    In the new design the kgdb_active can only be changed while holding
    dbg_master_lock.  Stress testing has not turned up any further
    entry/exit races that existed in the prior locking design.  The prior
    locking design suffered from atomic variables not being truly atomic
    (in the capacity as used by kgdb) along with memory barrier races.
    
    Signed-off-by: default avatarJason Wessel <jason.wessel@windriver.com>
    Acked-by: default avatarDongdong Deng <dongdong.deng@windriver.com>
    dfee3a7b