1. 01 Nov, 2018 26 commits
    • Philippe Gerum's avatar
      stop_machine: fix build in !SMP case · 74985847
      Philippe Gerum authored
      74985847
    • Philippe Gerum's avatar
      PM: ipipe: converge to Dovetail's CPUIDLE management · cb5702e0
      Philippe Gerum authored
      Handle requests for transitioning to deeper C-states the way Dovetail
      does, which prevents us from losing the timer when grabbed by a
      co-kernel, in presence of a CPUIDLE driver.
      cb5702e0
    • Philippe Gerum's avatar
      genirq: ipipe: add local_irq_{enable, disable}_full() helpers · c116c682
      Philippe Gerum authored
      Those helpers affect both the real (in CPU) and virtual interrupt
      states for the root stage, reconciling them.
      c116c682
    • Philippe Gerum's avatar
      ipipe: tick: cap timer_set op to device supported max · 78580ae1
      Philippe Gerum authored
      At this chance, switch the min_delay_tick value to unsigned long to
      match the corresponding clockevent definition.
      78580ae1
    • Philippe Gerum's avatar
      stop_machine: ipipe: ensure atomic stop-context operations · aed125e2
      Philippe Gerum authored
      stop_machine() guarantees that all online CPUs are spinning
      non-preemptible in a known code location before a subset of them may
      safely run a stop-context function. This service is typically useful
      for live patching the kernel code, or changing global memory mappings,
      so that no activity could run in parallel until the system has
      returned to a stable state after all stop-context operations have
      completed.
      
      When interrupt pipelining is enabled, we have to provide the same
      guarantee by restoring hard interrupt disabling where virtualizing the
      interrupt disable flag would defeat it.
      aed125e2
    • Philippe Gerum's avatar
      irqchip: gic: ipipe: enable interrupt pipelining · f64b3af9
      Philippe Gerum authored
      Fix up the ARM generic interrupt controller driver in order to channel
      interrupts through the interrupt pipeline.
      f64b3af9
    • Gilles Chanteperdrix's avatar
      clocksource: dw_apb: ipipe: add support for user-visible TSC · 1a190d8b
      Gilles Chanteperdrix authored
      Expose the clocksource to userland processes via the ARM-specific
      user-TSC interface if enabled.
      1a190d8b
    • Gilles Chanteperdrix's avatar
      clocksource: sp804: ipipe: add support for user-visible TSC · 75012af4
      Gilles Chanteperdrix authored
      Expose the clocksource to userland processes via the ARM-specific
      user-TSC interface.
      75012af4
    • Philippe Gerum's avatar
      ipipe: add cpuidle control interface · caf90a26
      Philippe Gerum authored
      Add a kernel interface for sharing CPU idling control between the host
      kernel and a co-kernel. The former invokes ipipe_cpuidle_control()
      which the latter should implement, for determining whether entering a
      sleep state is ok. This hook should return boolean true if so.
      
      The co-kernel may veto such entry if need be, in order to prevent
      latency spikes, as exiting sleep states might be costly depending on
      the CPU idling operation being used.
      caf90a26
    • Philippe Gerum's avatar
      gpio: ipipe: convert to hard locking · 6fd84340
      Philippe Gerum authored
      The generic GPIO chip needs to be hard locked as its handlers may be
      called from out-of-band code running in the head domain. To this end,
      convert the regular spinlock protecting the gpio_chip descriptor to a
      hard lock.
      6fd84340
    • Philippe Gerum's avatar
      ftrace: ipipe: enable tracing from the head domain · 1430346b
      Philippe Gerum authored
      Enabling ftrace for a co-kernel running in the head domain of a
      pipelined interrupt context means to:
      
      - make sure that ftrace's live kernel code patching still runs
        unpreempted by any head domain activity (so that the latter can't
        tread on invalid or half-baked changes in the .text section).
      
      - allow the co-kernel code running in the head domain to traverse
        ftrace's tracepoints safely.
      
      The changes introduced by this commit ensure this by fixing up some
      key critical sections so that interrupts are still disabled in the
      CPU, undoing the interrupt flag virtualization in those particular
      cases.
      1430346b
    • Philippe Gerum's avatar
      mm: ipipe: disable ondemand memory · 8ade150c
      Philippe Gerum authored
      Co-kernels cannot bear with the extra latency caused by memory access
      faults involved in COW or
      overcommit. __ipipe_disable_ondemand_mappings() force commits all
      common memory mappings with physical RAM.
      
      In addition, the architecture code is given a chance to pre-load page
      table entries for ioremap and vmalloc memory, for preventing further
      minor faults accessing such memory due to PTE misses (if that ever
      makes sense for them).
      
      Revisit: Further COW breaking in copy_user_page() and copy_pte_range()
      may be useless once __ipipe_disable_ondemand_mappings() has run for a
      co-kernel task, since all of its mappings have been populated, and
      unCOWed if applicable.
      8ade150c
    • Philippe Gerum's avatar
      KVM: ipipe: keep hypervisor state consistent across domain preemption · 1db0d31d
      Philippe Gerum authored
      In order for the hypervisor to operate properly in presence of a
      co-kernel, we need:
      
      - the virtualization core to know when the hypervisor stalls due
        to a preemption by the co-kernel.
      
      - to know when the VM enters and leaves guest mode.
      1db0d31d
    • Philippe Gerum's avatar
      sched: ipipe: add domain debug checks to common scheduling paths · 5de5c14f
      Philippe Gerum authored
      Catch invalid calls of root-only code from the head domain from common
      paths which may lead to blocking the current task linux-wise. Checks
      are enabled by CONFIG_IPIPE_DEBUG_CONTEXT.
      5de5c14f
    • Philippe Gerum's avatar
      sched: ipipe: enable task migration between domains · 957ac4c9
      Philippe Gerum authored
      This is the basic code enabling alternate control of tasks between the
      regular kernel and an embedded co-kernel. The changes cover the
      following aspects:
      
      - extend the per-thread information block with a private area usable
        by the co-kernel for storing additional state information
      
      - provide the API enabling a scheduler exchange mechanism, so that
        tasks can run under the control of either kernel alternatively. This
        includes a service to move the current task to the head domain under
        the control of the co-kernel, and the converse service to re-enter
        the root domain once the co-kernel has released such task.
      
      - ensure the generic context switching code can be used from any
        domain, serializing execution as required.
      
      These changes have to be paired with arch-specific code further
      enabling context switching from the head domain.
      957ac4c9
    • Philippe Gerum's avatar
      clockevents: ipipe: connect clock chips to abstract tick device · 1befe95a
      Philippe Gerum authored
      Announce all clock event chips as they are registered to the
      out-of-band tick device infrastructure, so that we can interpose on
      key handlers in their descriptors.
      1befe95a
    • Philippe Gerum's avatar
      timekeeping: ipipe: forward clock shift value to DSO helpers · 9e2ee43a
      Philippe Gerum authored
      In order to propagate the "host real-time update" event to a co-kernek
      (IPIPE_KEVT_HOSTRT), we need the clock shift value of the monotonic
      clock to be passed to the legacy vDSO handler, for (re)calculating the
      new wall clock time which is eventually announced to the co-kernel.
      
      Only architectures which still implement the legacy
      update_vsyscall_old() interface need this change.
      9e2ee43a
    • Philippe Gerum's avatar
      ipipe: add kernel event notifiers · 96df83cb
      Philippe Gerum authored
      Add the core API for enabling (regular) kernel event notifications to
      a co-kernel running over the head domain. For instance, such a
      co-kernel may need to know when a task is about to be resumed upon
      signal receipt, or when it gets an access fault trap.
      
      This commit adds the client-side API for enabling such notification
      for class of events, but does not provide the notification points per
      se, which comes later.
      96df83cb
    • Philippe Gerum's avatar
      printk: ipipe: add raw console channel · 4cd03bbb
      Philippe Gerum authored
      A raw output handler (.write_raw) is added to the console descriptor
      for writing (short) text output unmodified, without any logging,
      header or preparation whatsoever, usable from any pipeline domain.
      
      The dedicated raw_printk() variant formats the output message then
      passes it on to the handler holding a hard spinlock, irqs off.
      
      This is a very basic debug channel for situations when resorting to
      the fairly complex printk() handling is not an option. Unlike early
      consoles, regular consoles can provide a raw output service past the
      boot sequence. Raw output handlers are typically provided by serial
      console devices.
      4cd03bbb
    • Philippe Gerum's avatar
      atomic: ipipe: keep atomic when pipelining IRQs · 76498343
      Philippe Gerum authored
      Because of the virtualization of interrupt masking for the regular
      kernel code when the pipeline is enabled, atomic helpers relying on
      common interrupt disabling helpers such as local_irq_save/restore
      pairs would not be atomic anymore, leading to data corruption.
      
      This commit restores true atomicity for the atomic helpers that would
      be otherwise affected by interrupt virtualization.
      76498343
    • Philippe Gerum's avatar
      preempt: ipipe: : add preemption-safe hard_preempt_{enable, disable}() ops · 0c4d0889
      Philippe Gerum authored
      Some inner code of the interrupt pipeline may have to traverse regular
      kernel code which manipulates the preemption count, expecting full
      serialization including with out-of-band contexts.
      
      The hard_preempt_*() variants are substituted to the original
      preempt_{enable, disable}() calls in these cases.
      0c4d0889
    • Philippe Gerum's avatar
      genirq: ipipe: protect generic chip against domain preemption · 0a80a322
      Philippe Gerum authored
      As described in Documentation/ipipe.rst, irq_chip drivers need to be
      specifically adapted for dealing with interrupt pipelining safely.
      
      The basic issue to address is proper serialization between some
      irq_chip handlers which may be called from out-of-band context
      immediately upon receipt of an IRQ, and the rest of the driver which
      may access the same data / IO registers from the regular - in-band -
      context on the same CPU.
      
      This commit converts the generic irq_chip lock to a hard spinlock,
      which ensures such serialization.
      0a80a322
    • Philippe Gerum's avatar
      ipipe: add latency tracer · 20700661
      Philippe Gerum authored
      The latency tracer is a variant of ftrace's 'function' tracer
      providing detailed information about the current interrupt state at
      each function entry (i.e. virtual interrupt flag and CPU interrupt
      disable bit). This commit introduces the generic tracer code, which
      builds upon the regular ftrace API.
      
      The arch-specific code should provide for ipipe_read_tsc(), a helper
      routine returning a 64bit monotonic time value for timestamping
      purpose. HAVE_IPIPE_TRACER_SUPPORT should be selected by the
      arch-specific code for enabling the tracer, which in turn makes
      CONFIG_IPIPE_TRACE available from the Kconfig interface.
      20700661
    • Philippe Gerum's avatar
      ipipe: add out-of-band tick device · 88f3dede
      Philippe Gerum authored
      The out-of-band tick device manages the timer hardware by interposing
      on selected clockevent handlers transparently, so that a client domain
      (e.g. a co-kernel) eventually controls such hardware for scheduling
      the high-precision timer events it needs to. Those events are
      delivered to out-of-hand activities running on the head stage,
      unimpeded by (only virtually) interrupt-free sections of the regular
      kernel code.
      
      This commit introduces the generic API for controlling the out-of-band
      tick device from a co-kernel. It also provides for the internal API
      clock event chip drivers should use for enabling high-precision
      timing for their hardware.
      88f3dede
    • Philippe Gerum's avatar
      locking: ipipe: add hard lock alternative to regular spinlocks · 7c28f350
      Philippe Gerum authored
      Hard spinlocks manipulate the CPU interrupt mask, without affecting
      the kernel preemption state in locking/unlocking operations.
      
      This type of spinlock is useful for implementing a critical section to
      serialize concurrent accesses from both in-band and out-of-band
      contexts, i.e. from root and head stages.
      
      Hard spinlocks exclusively depend on the pre-existing arch-specific
      bits which implement regular spinlocks. They can be seen as basic
      spinlocks still affecting the CPU's interrupt state when all other
      spinlock types only deal with the virtual interrupt flag managed by
      the pipeline core - i.e. only disable interrupts for the regular
      in-band kernel activity.
      7c28f350
    • Philippe Gerum's avatar
      genirq: add generic I-pipe core · d9f057db
      Philippe Gerum authored
      This commit provides the arch-independent bits for implementing the
      interrupt pipeline core, a lightweight layer introducing a separate,
      high-priority execution stage for handling all IRQs in pseudo-NMI
      mode, which cannot be delayed by the regular kernel code. See
      Documentation/ipipe.rst for details about interrupt pipelining.
      
      Architectures which support interrupt pipelining should select
      HAVE_IPIPE_SUPPORT, along with implementing the required arch-specific
      code. In such a case, CONFIG_IPIPE becomes available to the user via
      the Kconfig interface for enabling the feature.
      d9f057db
  2. 20 Oct, 2018 1 commit
  3. 18 Oct, 2018 9 commits
    • Roman Gushchin's avatar
      mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES · c605894c
      Roman Gushchin authored
      commit eb592546 upstream.
      
      Patch series "indirectly reclaimable memory", v2.
      
      This patchset introduces the concept of indirectly reclaimable memory
      and applies it to fix the issue of when a big number of dentries with
      external names can significantly affect the MemAvailable value.
      
      This patch (of 3):
      
      Introduce a concept of indirectly reclaimable memory and adds the
      corresponding memory counter and /proc/vmstat item.
      
      Indirectly reclaimable memory is any sort of memory, used by the kernel
      (except of reclaimable slabs), which is actually reclaimable, i.e.  will
      be released under memory pressure.
      
      The counter is in bytes, as it's not always possible to count such
      objects in pages.  The name contains BYTES by analogy to
      NR_KERNEL_STACK_KB.
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c605894c
    • Will Deacon's avatar
      arm64: perf: Reject stand-alone CHAIN events for PMUv3 · 3e6275d9
      Will Deacon authored
      commit ca2b4972 upstream.
      
      It doesn't make sense for a perf event to be configured as a CHAIN event
      in isolation, so extend the arm_pmu structure with a ->filter_match()
      function to allow the backend PMU implementation to reject CHAIN events
      early.
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e6275d9
    • Tejun Heo's avatar
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · bc183079
      Tejun Heo authored
      commit 479adb89 upstream.
      
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatar"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: default avatarAmin Jamali <ajamali@pivotal.io>
      Reported-by: default avatarJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc183079
    • Yu Zhao's avatar
      sound: don't call skl_init_chip() to reset intel skl soc · 37ca1cc8
      Yu Zhao authored
      [ Upstream commit 75383f8d ]
      
      Internally, skl_init_chip() calls snd_hdac_bus_init_chip() which
      1) sets bus->chip_init to prevent multiple entrances before device
      is stopped; 2) enables interrupt.
      
      We shouldn't use it for the purpose of resetting device only because
      1) when we really want to initialize device, we won't be able to do
      so; 2) we are ready to handle interrupt yet, and kernel crashes when
      interrupt comes in.
      
      Rename azx_reset() to snd_hdac_bus_reset_link(), and use it to reset
      device properly.
      
      Fixes: 60767abc ("ASoC: Intel: Skylake: Reset the controller in probe")
      Reviewed-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37ca1cc8
    • Eric Dumazet's avatar
      inet: make sure to grab rcu_read_lock before using ireq->ireq_opt · c5df5813
      Eric Dumazet authored
      [ Upstream commit 2ab2ddd3 ]
      
      Timer handlers do not imply rcu_read_lock(), so my recent fix
      triggered a LOCKDEP warning when SYNACK is retransmit.
      
      Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
      usages instead of guessing what is done by callers, since it is
      not worth the pain.
      
      Get rid of ireq_opt_deref() helper since it hides the logic
      without real benefit, since it is now a standard rcu_dereference().
      
      Fixes: 1ad98e9d ("tcp/dccp: fix lockdep issue when SYN is backlogged")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5df5813
    • Eric Dumazet's avatar
      tcp/dccp: fix lockdep issue when SYN is backlogged · 17af5475
      Eric Dumazet authored
      [ Upstream commit 1ad98e9d ]
      
      In normal SYN processing, packets are handled without listener
      lock and in RCU protected ingress path.
      
      But syzkaller is known to be able to trick us and SYN
      packets might be processed in process context, after being
      queued into socket backlog.
      
      In commit 06f877d6 ("tcp/dccp: fix other lockdep splats
      accessing ireq_opt") I made a very stupid fix, that happened
      to work mostly because of the regular path being RCU protected.
      
      Really the thing protecting ireq->ireq_opt is RCU read lock,
      and the pseudo request refcnt is not relevant.
      
      This patch extends what I did in commit 449809a6 ("tcp/dccp:
      block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
      pair in the paths that might be taken when processing SYN from
      socket backlog (thus possibly in process context)
      
      Fixes: 06f877d6 ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      17af5475
    • Jianfeng Tan's avatar
      net/packet: fix packet drop as of virtio gso · 5150140b
      Jianfeng Tan authored
      [ Upstream commit 9d2f67e4 ]
      
      When we use raw socket as the vhost backend, a packet from virito with
      gso offloading information, cannot be sent out in later validaton at
      xmit path, as we did not set correct skb->protocol which is further used
      for looking up the gso function.
      
      To fix this, we set this field according to virito hdr information.
      
      Fixes: e858fae2 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: default avatarJianfeng Tan <jianfeng.tan@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5150140b
    • Sabrina Dubroca's avatar
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · 9b4869cf
      Sabrina Dubroca authored
      [ Upstream commit af7d6cce ]
      
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b4869cf
    • Mahesh Bandewar's avatar
      bonding: avoid possible dead-lock · 94402f23
      Mahesh Bandewar authored
      [ Upstream commit d4859d74 ]
      
      Syzkaller reported this on a slightly older kernel but it's still
      applicable to the current kernel -
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.18.0-next-20180823+ #46 Not tainted
      ------------------------------------------------------
      syz-executor4/26841 is trying to acquire lock:
      00000000dd41ef48 ((wq_completion)bond_dev->name){+.+.}, at: flush_workqueue+0x2db/0x1e10 kernel/workqueue.c:2652
      
      but task is already holding lock:
      00000000768ab431 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:77 [inline]
      00000000768ab431 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x412/0xc30 net/core/rtnetlink.c:4708
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (rtnl_mutex){+.+.}:
             __mutex_lock_common kernel/locking/mutex.c:925 [inline]
             __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
             mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
             rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77
             bond_netdev_notify drivers/net/bonding/bond_main.c:1310 [inline]
             bond_netdev_notify_work+0x44/0xd0 drivers/net/bonding/bond_main.c:1320
             process_one_work+0xc73/0x1aa0 kernel/workqueue.c:2153
             worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
             kthread+0x35a/0x420 kernel/kthread.c:246
             ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415
      
      -> #1 ((work_completion)(&(&nnw->work)->work)){+.+.}:
             process_one_work+0xc0b/0x1aa0 kernel/workqueue.c:2129
             worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
             kthread+0x35a/0x420 kernel/kthread.c:246
             ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415
      
      -> #0 ((wq_completion)bond_dev->name){+.+.}:
             lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
             flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
             drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
             destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
             __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
             bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
             register_netdevice+0x337/0x1100 net/core/dev.c:8410
             bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
             rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
             rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
             netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
             rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
             netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
             netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
             netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
             sock_sendmsg_nosec net/socket.c:622 [inline]
             sock_sendmsg+0xd5/0x120 net/socket.c:632
             ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
             __sys_sendmsg+0x11d/0x290 net/socket.c:2153
             __do_sys_sendmsg net/socket.c:2162 [inline]
             __se_sys_sendmsg net/socket.c:2160 [inline]
             __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
             do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)bond_dev->name --> (work_completion)(&(&nnw->work)->work) --> rtnl_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(rtnl_mutex);
                                     lock((work_completion)(&(&nnw->work)->work));
                                     lock(rtnl_mutex);
        lock((wq_completion)bond_dev->name);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor4/26841:
      
      stack backtrace:
      CPU: 1 PID: 26841 Comm: syz-executor4 Not tainted 4.18.0-next-20180823+ #46
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       print_circular_bug.isra.34.cold.55+0x1bd/0x27d kernel/locking/lockdep.c:1222
       check_prev_add kernel/locking/lockdep.c:1862 [inline]
       check_prevs_add kernel/locking/lockdep.c:1975 [inline]
       validate_chain kernel/locking/lockdep.c:2416 [inline]
       __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3412
       lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
       flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
       drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
       destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
       __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
       bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
       register_netdevice+0x337/0x1100 net/core/dev.c:8410
       bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
       rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
       rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
       netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
       sock_sendmsg_nosec net/socket.c:622 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:632
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
       __sys_sendmsg+0x11d/0x290 net/socket.c:2153
       __do_sys_sendmsg net/socket.c:2162 [inline]
       __se_sys_sendmsg net/socket.c:2160 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457089
      Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f2df20a5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f2df20a66d4 RCX: 0000000000457089
      RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
      RBP: 0000000000930140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000004d40b8 R14: 00000000004c8ad8 R15: 0000000000000001
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94402f23
  4. 13 Oct, 2018 2 commits
    • Michael S. Tsirkin's avatar
      virtio_balloon: fix deadlock on OOM · 7f42eada
      Michael S. Tsirkin authored
      commit c7cdff0e upstream.
      
      fill_balloon doing memory allocations under balloon_lock
      can cause a deadlock when leak_balloon is called from
      virtballoon_oom_notify and tries to take same lock.
      
      To fix, split page allocation and enqueue and do allocations outside the lock.
      
      Here's a detailed analysis of the deadlock by Tetsuo Handa:
      
      In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to
      serialize against fill_balloon(). But in fill_balloon(),
      alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is
      called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE]
      implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY
      is specified, this allocation attempt might indirectly depend on somebody
      else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect
      __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via
      virtballoon_oom_notify() via blocking_notifier_call_chain() callback via
      out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock
      mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it
      will cause OOM lockup.
      
        Thread1                                       Thread2
          fill_balloon()
            takes a balloon_lock
            balloon_page_enqueue()
              alloc_page(GFP_HIGHUSER_MOVABLE)
                direct reclaim (__GFP_FS context)       takes a fs lock
                  waits for that fs lock                  alloc_page(GFP_NOFS)
                                                            __alloc_pages_may_oom()
                                                              takes the oom_lock
                                                              out_of_memory()
                                                                blocking_notifier_call_chain()
                                                                  leak_balloon()
                                                                    tries to take that balloon_lock and deadlocks
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f42eada
    • Mike Kravetz's avatar
      mm: migration: fix migration of huge PMD shared pages · 5f4f5b1f
      Mike Kravetz authored
      commit 017b1660 upstream.
      
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f4f5b1f
  5. 04 Oct, 2018 2 commits
    • Sakari Ailus's avatar
      media: v4l: event: Prevent freeing event subscriptions while accessed · d61ba341
      Sakari Ailus authored
      commit ad608fbc upstream.
      
      The event subscriptions are added to the subscribed event list while
      holding a spinlock, but that lock is subsequently released while still
      accessing the subscription object. This makes it possible to unsubscribe
      the event --- and freeing the subscription object's memory --- while
      the subscription object is simultaneously accessed.
      
      Prevent this by adding a mutex to serialise the event subscription and
      unsubscription. This also gives a guarantee to the callback ops that the
      add op has returned before the del op is called.
      
      This change also results in making the elems field less special:
      subscriptions are only added to the event list once they are fully
      initialised.
      Signed-off-by: default avatarSakari Ailus <sakari.ailus@linux.intel.com>
      Reviewed-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: default avatarLaurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: stable@vger.kernel.org # for 4.14 and up
      Fixes: c3b5b024 ("V4L/DVB: V4L: Events: Add backend")
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d61ba341
    • Marc Zyngier's avatar
      arm/arm64: smccc-1.1: Handle function result as parameters · 647b6d4f
      Marc Zyngier authored
      [ Upstream commit 755a8bf5 ]
      
      If someone has the silly idea to write something along those lines:
      
      	extern u64 foo(void);
      
      	void bar(struct arm_smccc_res *res)
      	{
      		arm_smccc_1_1_smc(0xbad, foo(), res);
      	}
      
      they are in for a surprise, as this gets compiled as:
      
      	0000000000000588 <bar>:
      	 588:   a9be7bfd        stp     x29, x30, [sp, #-32]!
      	 58c:   910003fd        mov     x29, sp
      	 590:   f9000bf3        str     x19, [sp, #16]
      	 594:   aa0003f3        mov     x19, x0
      	 598:   aa1e03e0        mov     x0, x30
      	 59c:   94000000        bl      0 <_mcount>
      	 5a0:   94000000        bl      0 <foo>
      	 5a4:   aa0003e1        mov     x1, x0
      	 5a8:   d4000003        smc     #0x0
      	 5ac:   b4000073        cbz     x19, 5b8 <bar+0x30>
      	 5b0:   a9000660        stp     x0, x1, [x19]
      	 5b4:   a9010e62        stp     x2, x3, [x19, #16]
      	 5b8:   f9400bf3        ldr     x19, [sp, #16]
      	 5bc:   a8c27bfd        ldp     x29, x30, [sp], #32
      	 5c0:   d65f03c0        ret
      	 5c4:   d503201f        nop
      
      The call to foo "overwrites" the x0 register for the return value,
      and we end up calling the wrong secure service.
      
      A solution is to evaluate all the parameters before assigning
      anything to specific registers, leading to the expected result:
      
      	0000000000000588 <bar>:
      	 588:   a9be7bfd        stp     x29, x30, [sp, #-32]!
      	 58c:   910003fd        mov     x29, sp
      	 590:   f9000bf3        str     x19, [sp, #16]
      	 594:   aa0003f3        mov     x19, x0
      	 598:   aa1e03e0        mov     x0, x30
      	 59c:   94000000        bl      0 <_mcount>
      	 5a0:   94000000        bl      0 <foo>
      	 5a4:   aa0003e1        mov     x1, x0
      	 5a8:   d28175a0        mov     x0, #0xbad
      	 5ac:   d4000003        smc     #0x0
      	 5b0:   b4000073        cbz     x19, 5bc <bar+0x34>
      	 5b4:   a9000660        stp     x0, x1, [x19]
      	 5b8:   a9010e62        stp     x2, x3, [x19, #16]
      	 5bc:   f9400bf3        ldr     x19, [sp, #16]
      	 5c0:   a8c27bfd        ldp     x29, x30, [sp], #32
      	 5c4:   d65f03c0        ret
      Reported-by: default avatarJulien Grall <julien.grall@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      647b6d4f