Skip to content
  • Michal Hocko's avatar
    watchdog: update watchdog_thresh properly · 9809b18f
    Michal Hocko authored
    
    
    watchdog_tresh controls how often nmi perf event counter checks per-cpu
    hrtimer_interrupts counter and blows up if the counter hasn't changed
    since the last check.  The counter is updated by per-cpu
    watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
    period which guarantees that hrtimer is scheduled 2 times per the main
    period.  Both hrtimer and perf event are started together when the
    watchdog is enabled.
    
    So far so good.  But...
    
    But what happens when watchdog_thresh is updated from sysctl handler?
    
    proc_dowatchdog will set a new sampling period and hrtimer callback
    (watchdog_timer_fn) will use the new value in the next round.  The
    problem, however, is that nobody tells the perf event that the sampling
    period has changed so it is ticking with the period configured when it
    has been set up.
    
    This might result in an ear ripping dissonance between perf and hrtimer
    parts if the watchdog_thresh is increased.  And even worse it might lead
    to KABOOM if the watchdog is configured to panic on such a spurious
    lockup.
    
    This patch fixes the issue by updating both nmi perf even counter and
    hrtimers if the threshold value has changed.
    
    The nmi one is disabled and then reinitialized from scratch.  This has
    an unpleasant side effect that the allocation of the new event might
    fail theoretically so the hard lockup detector would be disabled for
    such cpus.  On the other hand such a memory allocation failure is very
    unlikely because the original event is deallocated right before.
    
    It would be much nicer if we just changed perf event period but there
    doesn't seem to be any API to do that right now.  It is also unfortunate
    that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
    cannot use on_each_cpu() and do the same thing from the per-cpu context.
    The update from the current CPU should be safe because
    perf_event_disable removes the event atomically before it clears the
    per-cpu watchdog_ev so it cannot change anything under running handler
    feet.
    
    The hrtimer is simply restarted (thanks to Don Zickus who has pointed
    this out) if it is queued because we cannot rely it will fire&adopt to
    the new sampling period before a new nmi event triggers (when the
    treshold is decreased).
    
    [akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Acked-by: default avatarDon Zickus <dzickus@redhat.com>
    Cc: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Fabio Estevam <festevam@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    9809b18f