Skip to content
  • Srivatsa S. Bhat's avatar
    smp: print more useful debug info upon receiving IPI on an offline CPU · a219ccf4
    Srivatsa S. Bhat authored
    
    
    There is a longstanding problem related to CPU hotplug which causes IPIs
    to be delivered to offline CPUs, and the smp-call-function IPI handler
    code prints out a warning whenever this is detected.  Every once in a
    while this (usually harmless) warning gets reported on LKML, but so far
    it has not been completely fixed.  Usually the solution involves finding
    out the IPI sender and fixing it by adding appropriate synchronization
    with CPU hotplug.
    
    However, while going through one such internal bug reports, I found that
    there is a significant bug in the receiver side itself (more
    specifically, in stop-machine) that can lead to this problem even when
    the sender code is perfectly fine.  This patchset fixes that
    synchronization problem in the CPU hotplug stop-machine code.
    
    Patch 1 adds some additional debug code to the smp-call-function
    framework, to help debug such issues easily.
    
    Patch 2 modifies the stop-machine code to ensure that any IPIs that were
    sent while the target CPU was online, would be noticed and handled by
    that CPU without fail before it goes offline.  Thus, this avoids
    scenarios where IPIs are received on offline CPUs (as long as the sender
    uses proper hotplug synchronization).
    
    In fact, I debugged the problem by using Patch 1, and found that the
    payload of the IPI was always the block layer's trigger_softirq()
    function.  But I was not able to find anything wrong with the block
    layer code.  That's when I started looking at the stop-machine code and
    realized that there is a race-window which makes the IPI _receiver_ the
    culprit, not the sender.  Patch 2 fixes that race and hence this should
    put an end to most of the hard-to-debug IPI-to-offline-CPU issues.
    
    This patch (of 2):
    
    Today the smp-call-function code just prints a warning if we get an IPI
    on an offline CPU.  This info is sufficient to let us know that
    something went wrong, but often it is very hard to debug exactly who
    sent the IPI and why, from this info alone.
    
    In most cases, we get the warning about the IPI to an offline CPU,
    immediately after the CPU going offline comes out of the stop-machine
    phase and reenables interrupts.  Since all online CPUs participate in
    stop-machine, the information regarding the sender of the IPI is already
    lost by the time we exit the stop-machine loop.  So even if we dump the
    stack on each CPU at this point, we won't find anything useful since all
    of them will show the stack-trace of the stopper thread.  So we need a
    better way to figure out who sent the IPI and why.
    
    To achieve this, when we detect an IPI targeted to an offline CPU, loop
    through the call-single-data linked list and print out the payload
    (i.e., the name of the function which was supposed to be executed by the
    target CPU).  This would give us an insight as to who might have sent
    the IPI and help us debug this further.
    
    [akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
    Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Rusty Russell <rusty@rustcorp.com.au>
    Cc: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Mike Galbraith <mgalbraith@suse.de>
    Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a219ccf4