1. 21 Jul, 2019 5 commits
  2. 03 Jul, 2019 3 commits
    • Will Deacon's avatar
      futex: Update comments and docs about return values of arch futex code · b5800872
      Will Deacon authored
      commit 427503519739e779c0db8afe876c1b33f3ac60ae upstream.
      The architecture implementations of 'arch_futex_atomic_op_inuser()' and
      'futex_atomic_cmpxchg_inatomic()' are permitted to return only -EFAULT,
      -EAGAIN or -ENOSYS in the case of failure.
      Update the comments in the asm-generic/ implementation and also a stray
      reference in the robust futex documentation.
      Signed-off-by: 's avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Sasha Levin's avatar
      Revert "compiler.h: update definition of unreachable()" · c6975e41
      Sasha Levin authored
      This reverts commit 82017e26, which is
      upstream commit fe0640eb30b7da261ae84d252ed9ed3c7e68dfd8.
      On Fri, Jun 28, 2019 at 8:53 AM Tony Battersby <tonyb@cybernetics.com> wrote:
      > Old versions of gcc cannot compile 4.14 since 4.14.113:
      > ./include/asm-generic/fixmap.h:37: error: implicit declaration of function ‘__builtin_unreachable’
      > The stable commit that caused the problem is 82017e26 ("compiler.h:
      > update definition of unreachable()") (upstream commit fe0640eb30b7).
      > Reverting the commit fixes the problem.
      > Kernel 4.17 dropped support for older versions of gcc in upstream commit
      > cafa0010 ("Raise the minimum required gcc version to 4.6").  This
      > was not backported to 4.14 since that would go against the stable kernel
      > rules.
      > Upstream commit 815f0ddb ("include/linux/compiler*.h: make
      > compiler-*.h mutually exclusive") was a fix for cafa0010.  This was
      > not backported to 4.14.
      > Upstream commit fe0640eb30b7 ("compiler.h: update definition of
      > unreachable()") was a fix for 815f0ddb.  This is the commit that was
      > backported to 4.14.  But it only fixed a problem introduced in the other
      > commits, and without those commits, it ends up introducing a problem
      > instead of fixing one.  So I recommend reverting that patch in 4.14,
      > which will enable old gcc to compile 4.14 again.  If I understand
      > correctly, I believe that clang will still be able to compile 4.14 with
      > the patch reverted, although I haven't tried to compile with clang.
      > The problematic commit is not present in 4.9.x, 4.4.x, 3.18.x, or 3.16.x.
      CC: Nick Desaulniers <ndesaulniers@google.com>
      CC: Tony Battersby <tonyb@cybernetics.com>,
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
    • Christoph Hellwig's avatar
      block: add a lower-level bio_add_page interface · 515e2f3e
      Christoph Hellwig authored
      [ Upstream commit 0aa69fd3 ]
      For the upcoming removal of buffer heads in XFS we need to keep track of
      the number of outstanding writeback requests per page.  For this we need
      to know if bio_add_page merged a region with the previous bvec or not.
      Instead of adding additional arguments this refactors bio_add_page to
      be implemented using three lower level helpers which users like XFS can
      use directly if they care about the merge decisions.
      Signed-off-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: 's avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: 's avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: 's avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: 's avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
  3. 25 Jun, 2019 1 commit
  4. 22 Jun, 2019 1 commit
    • Andrea Arcangeli's avatar
      coredump: fix race condition between collapse_huge_page() and core dumping · dc30d2ce
      Andrea Arcangeli authored
      commit 59ea6d06cfa9247b586a695c21f94afa7183af74 upstream.
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  5. 19 Jun, 2019 2 commits
    • Borislav Petkov's avatar
      x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback · 9e701842
      Borislav Petkov authored
      commit 78f4e932f7760d965fb1569025d1576ab77557c5 upstream.
      Adric Blake reported the following warning during suspend-resume:
        Enabling non-boot CPUs ...
        x86: Booting SMP configuration:
        smpboot: Booting Node 0 Processor 1 APIC 0x2
        unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
         at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
        Call Trace:
         ? x86_pmu_dead_cpu
         ? _raw_spin_lock_irqsave
        microcode: sig=0x806ea, pf=0x80, revision=0x96
        microcode: updated to revision 0xb4, date = 2019-04-01
        CPU1 is up
      The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
      by microcode. The log above shows that the microcode loader callback
      happens after the PMU restoration, leading to the conjecture that
      because the microcode hasn't been updated yet, that MSR is not present
      yet, leading to the #GP.
      Add a microcode loader-specific hotplug vector which comes before
      the PERF vectors and thus executes earlier and makes sure the MSR is
      Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
      Reported-by: 's avatarAdric Blake <promarbler14@gmail.com>
      Signed-off-by: 's avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: x86@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() · 0b1c102e
      Tejun Heo authored
      commit 18fa84a2db0e15b02baa5d94bdb5bd509175d2f6 upstream.
      A PF_EXITING task can stay associated with an offline css.  If such
      task calls task_get_css(), it can get stuck indefinitely.  This can be
      triggered by BSD process accounting which writes to a file with
      PF_EXITING set when racing against memcg disable as in the backtrace
      at the end.
      After this change, task_get_css() may return a css which was already
      offline when the function was called.  None of the existing users are
      affected by this change.
        INFO: rcu_sched self-detected stall on CPU
        INFO: rcu_sched detected stalls on CPUs/tasks:
        NMI backtrace for cpu 0
        Call Trace:
        RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.2+
      Fixes: ec438699 ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 17 Jun, 2019 3 commits
    • Eric Dumazet's avatar
      tcp: add tcp_min_snd_mss sysctl · cd6f35b8
      Eric Dumazet authored
      commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream.
      Some TCP peers announce a very small MSS option in their SYN and/or
      SYN/ACK messages.
      This forces the stack to send packets with a very high network/cpu
      Linux has enforced a minimal value of 48. Since this value includes
      the size of TCP options, and that the options can consume up to 40
      bytes, this means that each segment can include only 8 bytes of payload.
      In some cases, it can be useful to increase the minimal value
      to a saner value.
      We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
      Note that TCP_MAXSEG socket option enforces a minimal value
      of (TCP_MIN_MSS). David Miller increased this minimal value
      in commit c39508d6 ("tcp: Make TCP_MAXSEG minimum more correct.")
      from 64 to 88.
      We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
      CVE-2019-11479 -- tcp mss hardcoded to 48
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Suggested-by: 's avatarJonathan Looney <jtl@netflix.com>
      Acked-by: 's avatarNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric Dumazet's avatar
      tcp: tcp_fragment() should apply sane memory limits · 9daf226f
      Eric Dumazet authored
      commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream.
      Jonathan Looney reported that a malicious peer can force a sender
      to fragment its retransmit queue into tiny skbs, inflating memory
      usage and/or overflow 32bit counters.
      TCP allows an application to queue up to sk_sndbuf bytes,
      so we need to give some allowance for non malicious splitting
      of retransmit queue.
      A new SNMP counter is added to monitor how many times TCP
      did not allow to split an skb if the allowance was exceeded.
      Note that this counter might increase in the case applications
      use SO_SNDBUF socket option to lower sk_sndbuf.
      CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
      	socket is already using more than half the allowed space
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Reported-by: 's avatarJonathan Looney <jtl@netflix.com>
      Acked-by: 's avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: 's avatarYuchung Cheng <ycheng@google.com>
      Reviewed-by: 's avatarTyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric Dumazet's avatar
      tcp: limit payload size of sacked skbs · d6329205
      Eric Dumazet authored
      commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      Backport notes, provided by Joao Martins <joao.m.martins@oracle.com>
      v4.15 or since commit 737ff314 ("tcp: use sequence distance to
      detect reordering") had switched from the packet-based FACK tracking and
      switched to sequence-based.
      v4.14 and older still have the old logic and hence on
      tcp_skb_shift_data() needs to retain its original logic and have
      @fack_count in sync. In other words, we keep the increment of pcount with
      tcp_skb_pcount(skb) to later used that to update fack_count. To make it
      more explicit we track the new skb that gets incremented to pcount in
      @next_pcount, and we get to avoid the constant invocation of
      tcp_skb_pcount(skb) all together.
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Reported-by: 's avatarJonathan Looney <jtl@netflix.com>
      Acked-by: 's avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: 's avatarTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 15 Jun, 2019 3 commits
    • Helen Koike's avatar
      drm: don't block fb changes for async plane updates · ff8386d9
      Helen Koike authored
      commit 89a4aac0ab0e6f5eea10d7bf4869dd15c3de2cd4 upstream.
      In the case of a normal sync update, the preparation of framebuffers (be
      it calling drm_atomic_helper_prepare_planes() or doing setups with
      drm_framebuffer_get()) are performed in the new_state and the respective
      cleanups are performed in the old_state.
      In the case of async updates, the preparation is also done in the
      new_state but the cleanups are done in the new_state (because updates
      are performed in place, i.e. in the current state).
      The current code blocks async udpates when the fb is changed, turning
      async updates into sync updates, slowing down cursor updates and
      introducing regressions in igt tests with errors of type:
      "CRITICAL: completed 97 cursor updated in a period of 30 flips, we
      expect to complete approximately 15360 updates, with the threshold set
      at 7680"
      Fb changes in async updates were prevented to avoid the following scenario:
      - Async update, oldfb = NULL, newfb = fb1, prepare fb1, cleanup fb1
      - Async update, oldfb = fb1, newfb = fb2, prepare fb2, cleanup fb2
      - Non-async commit, oldfb = fb2, newfb = fb1, prepare fb1, cleanup fb2 (wrong)
      Where we have a single call to prepare fb2 but double cleanup call to fb2.
      To solve the above problems, instead of blocking async fb changes, we
      place the old framebuffer in the new_state object, so when the code
      performs cleanups in the new_state it will cleanup the old_fb and we
      will have the following scenario instead:
      - Async update, oldfb = NULL, newfb = fb1, prepare fb1, no cleanup
      - Async update, oldfb = fb1, newfb = fb2, prepare fb2, cleanup fb1
      - Non-async commit, oldfb = fb2, newfb = fb1, prepare fb1, cleanup fb2
      Where calls to prepare/cleanup are balanced.
      Cc: <stable@vger.kernel.org> # v4.14+
      Fixes: 25dc194b34dd ("drm: Block fb changes for async plane updates")
      Suggested-by: 's avatarBoris Brezillon <boris.brezillon@collabora.com>
      Signed-off-by: 's avatarHelen Koike <helen.koike@collabora.com>
      Reviewed-by: 's avatarBoris Brezillon <boris.brezillon@collabora.com>
      Reviewed-by: 's avatarNicholas Kazlauskas <nicholas.kazlauskas@amd.com>
      Signed-off-by: 's avatarBoris Brezillon <boris.brezillon@collabora.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190603165610.24614-6-helen.koike@collabora.comSigned-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Greg Kroah-Hartman's avatar
      Revert "Bluetooth: Align minimum encryption key size for LE and BR/EDR connections" · e9d38b08
      Greg Kroah-Hartman authored
      This reverts commit 2fa7a155 which is
      commit d5bb334a8e171b262e48f378bd2096c0ea458265 upstream.
      Lots of people have reported issues with this patch, and as there does
      not seem to be a fix going into Linus's kernel tree any time soon,
      revert the commit in the stable trees so as to get people's machines
      working properly again.
      Reported-by: 's avatarVasily Khoruzhick <anarsoul@gmail.com>
      Reported-by: 's avatarHans de Goede <hdegoede@redhat.com>
      Cc: Jeremy Cline <jeremy@jcline.org>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Cc: Johan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Phong Hoang's avatar
      pwm: Fix deadlock warning when removing PWM device · 83b6d80c
      Phong Hoang authored
      [ Upstream commit 347ab9480313737c0f1aaa08e8f2e1a791235535 ]
      This patch fixes deadlock warning if removing PWM device
      when CONFIG_PROVE_LOCKING is enabled.
      This issue can be reproceduced by the following steps on
      the R-Car H3 Salvator-X board if the backlight is disabled:
       # cd /sys/class/pwm/pwmchip0
       # echo 0 > export
       # ls
       device  export  npwm  power  pwm0  subsystem  uevent  unexport
       # cd device/driver
       # ls
       bind  e6e31000.pwm  uevent  unbind
       # echo e6e31000.pwm > unbind
      [   87.659974] ======================================================
      [   87.666149] WARNING: possible circular locking dependency detected
      [   87.672327] 5.0.0 #7 Not tainted
      [   87.675549] ------------------------------------------------------
      [   87.681723] bash/2986 is trying to acquire lock:
      [   87.686337] 000000005ea0e178 (kn->count#58){++++}, at: kernfs_remove_by_name_ns+0x50/0xa0
      [   87.694528]
      [   87.694528] but task is already holding lock:
      [   87.700353] 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c
      [   87.707405]
      [   87.707405] which lock already depends on the new lock.
      [   87.707405]
      [   87.715574]
      [   87.715574] the existing dependency chain (in reverse order) is:
      [   87.723048]
      [   87.723048] -> #1 (pwm_lock){+.+.}:
      [   87.728017]        __mutex_lock+0x70/0x7e4
      [   87.732108]        mutex_lock_nested+0x1c/0x24
      [   87.736547]        pwm_request_from_chip.part.6+0x34/0x74
      [   87.741940]        pwm_request_from_chip+0x20/0x40
      [   87.746725]        export_store+0x6c/0x1f4
      [   87.750820]        dev_attr_store+0x18/0x28
      [   87.754998]        sysfs_kf_write+0x54/0x64
      [   87.759175]        kernfs_fop_write+0xe4/0x1e8
      [   87.763615]        __vfs_write+0x40/0x184
      [   87.767619]        vfs_write+0xa8/0x19c
      [   87.771448]        ksys_write+0x58/0xbc
      [   87.775278]        __arm64_sys_write+0x18/0x20
      [   87.779721]        el0_svc_common+0xd0/0x124
      [   87.783986]        el0_svc_compat_handler+0x1c/0x24
      [   87.788858]        el0_svc_compat+0x8/0x18
      [   87.792947]
      [   87.792947] -> #0 (kn->count#58){++++}:
      [   87.798260]        lock_acquire+0xc4/0x22c
      [   87.802353]        __kernfs_remove+0x258/0x2c4
      [   87.806790]        kernfs_remove_by_name_ns+0x50/0xa0
      [   87.811836]        remove_files.isra.1+0x38/0x78
      [   87.816447]        sysfs_remove_group+0x48/0x98
      [   87.820971]        sysfs_remove_groups+0x34/0x4c
      [   87.825583]        device_remove_attrs+0x6c/0x7c
      [   87.830197]        device_del+0x11c/0x33c
      [   87.834201]        device_unregister+0x14/0x2c
      [   87.838638]        pwmchip_sysfs_unexport+0x40/0x4c
      [   87.843509]        pwmchip_remove+0xf4/0x13c
      [   87.847773]        rcar_pwm_remove+0x28/0x34
      [   87.852039]        platform_drv_remove+0x24/0x64
      [   87.856651]        device_release_driver_internal+0x18c/0x21c
      [   87.862391]        device_release_driver+0x14/0x1c
      [   87.867175]        unbind_store+0xe0/0x124
      [   87.871265]        drv_attr_store+0x20/0x30
      [   87.875442]        sysfs_kf_write+0x54/0x64
      [   87.879618]        kernfs_fop_write+0xe4/0x1e8
      [   87.884055]        __vfs_write+0x40/0x184
      [   87.888057]        vfs_write+0xa8/0x19c
      [   87.891887]        ksys_write+0x58/0xbc
      [   87.895716]        __arm64_sys_write+0x18/0x20
      [   87.900154]        el0_svc_common+0xd0/0x124
      [   87.904417]        el0_svc_compat_handler+0x1c/0x24
      [   87.909289]        el0_svc_compat+0x8/0x18
      [   87.913378]
      [   87.913378] other info that might help us debug this:
      [   87.913378]
      [   87.921374]  Possible unsafe locking scenario:
      [   87.921374]
      [   87.927286]        CPU0                    CPU1
      [   87.931808]        ----                    ----
      [   87.936331]   lock(pwm_lock);
      [   87.939293]                                lock(kn->count#58);
      [   87.945120]                                lock(pwm_lock);
      [   87.950599]   lock(kn->count#58);
      [   87.953908]
      [   87.953908]  *** DEADLOCK ***
      [   87.953908]
      [   87.959821] 4 locks held by bash/2986:
      [   87.963563]  #0: 00000000ace7bc30 (sb_writers#6){.+.+}, at: vfs_write+0x188/0x19c
      [   87.971044]  #1: 00000000287991b2 (&of->mutex){+.+.}, at: kernfs_fop_write+0xb4/0x1e8
      [   87.978872]  #2: 00000000f739d016 (&dev->mutex){....}, at: device_release_driver_internal+0x40/0x21c
      [   87.988001]  #3: 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c
      [   87.995481]
      [   87.995481] stack backtrace:
      [   87.999836] CPU: 0 PID: 2986 Comm: bash Not tainted 5.0.0 #7
      [   88.005489] Hardware name: Renesas Salvator-X board based on r8a7795 ES1.x (DT)
      [   88.012791] Call trace:
      [   88.015235]  dump_backtrace+0x0/0x190
      [   88.018891]  show_stack+0x14/0x1c
      [   88.022204]  dump_stack+0xb0/0xec
      [   88.025514]  print_circular_bug.isra.32+0x1d0/0x2e0
      [   88.030385]  __lock_acquire+0x1318/0x1864
      [   88.034388]  lock_acquire+0xc4/0x22c
      [   88.037958]  __kernfs_remove+0x258/0x2c4
      [   88.041874]  kernfs_remove_by_name_ns+0x50/0xa0
      [   88.046398]  remove_files.isra.1+0x38/0x78
      [   88.050487]  sysfs_remove_group+0x48/0x98
      [   88.054490]  sysfs_remove_groups+0x34/0x4c
      [   88.058580]  device_remove_attrs+0x6c/0x7c
      [   88.062671]  device_del+0x11c/0x33c
      [   88.066154]  device_unregister+0x14/0x2c
      [   88.070070]  pwmchip_sysfs_unexport+0x40/0x4c
      [   88.074421]  pwmchip_remove+0xf4/0x13c
      [   88.078163]  rcar_pwm_remove+0x28/0x34
      [   88.081906]  platform_drv_remove+0x24/0x64
      [   88.085996]  device_release_driver_internal+0x18c/0x21c
      [   88.091215]  device_release_driver+0x14/0x1c
      [   88.095478]  unbind_store+0xe0/0x124
      [   88.099048]  drv_attr_store+0x20/0x30
      [   88.102704]  sysfs_kf_write+0x54/0x64
      [   88.106359]  kernfs_fop_write+0xe4/0x1e8
      [   88.110275]  __vfs_write+0x40/0x184
      [   88.113757]  vfs_write+0xa8/0x19c
      [   88.117065]  ksys_write+0x58/0xbc
      [   88.120374]  __arm64_sys_write+0x18/0x20
      [   88.124291]  el0_svc_common+0xd0/0x124
      [   88.128034]  el0_svc_compat_handler+0x1c/0x24
      [   88.132384]  el0_svc_compat+0x8/0x18
      The sysfs unexport in pwmchip_remove() is completely asymmetric
      to what we do in pwmchip_add_with_polarity() and commit 0733424c
      ("pwm: Unexport children before chip removal") is a strong indication
      that this was wrong to begin with. We should just move
      pwmchip_sysfs_unexport() where it belongs, which is right after
      pwmchip_sysfs_unexport_children(). In that case, we do not need
      separate functions anymore either.
      We also really want to remove sysfs irrespective of whether or not
      the chip will be removed as a result of pwmchip_remove(). We can only
      assume that the driver will be gone after that, so we shouldn't leave
      any dangling sysfs files around.
      This warning disappears if we move pwmchip_sysfs_unexport() to
      the top of pwmchip_remove(), pwmchip_sysfs_unexport_children().
      That way it is also outside of the pwm_lock section, which indeed
      doesn't seem to be needed.
      Moving the pwmchip_sysfs_export() call outside of that section also
      seems fine and it'd be perfectly symmetric with pwmchip_remove() again.
      So, this patch fixes them.
      Signed-off-by: 's avatarPhong Hoang <phong.hoang.wz@renesas.com>
      [shimoda: revise the commit log and code]
      Fixes: 76abbdde ("pwm: Add sysfs interface")
      Fixes: 0733424c ("pwm: Unexport children before chip removal")
      Signed-off-by: 's avatarYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Tested-by: 's avatarHoan Nguyen An <na-hoan@jinso.co.jp>
      Reviewed-by: 's avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: 's avatarSimon Horman <horms+renesas@verge.net.au>
      Reviewed-by: 's avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: 's avatarThierry Reding <thierry.reding@gmail.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
  8. 11 Jun, 2019 8 commits
    • David Ahern's avatar
      ipv4: Define __ipv4_neigh_lookup_noref when CONFIG_INET is disabled · badd8e31
      David Ahern authored
      commit 9b3040a6aafd7898ece7fc7efcbca71e42aa8069 upstream.
      Define __ipv4_neigh_lookup_noref to return NULL when CONFIG_INET is disabled.
      Fixes: 4b2a2bfeb3f0 ("neighbor: Call __ipv4_neigh_lookup_noref in neigh_xmit")
      Reported-by: 's avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: 's avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Cc: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kirill Smelkov's avatar
      fuse: Add FOPEN_STREAM to use stream_open() · 585724f8
      Kirill Smelkov authored
      commit bbd84f33652f852ce5992d65db4d020aba21f882 upstream.
      Starting from commit 9c225f26 ("vfs: atomic f_pos accesses as per
      POSIX") files opened even via nonseekable_open gate read and write via lock
      and do not allow them to be run simultaneously. This can create read vs
      write deadlock if a filesystem is trying to implement a socket-like file
      which is intended to be simultaneously used for both read and write from
      filesystem client.  See commit 10dce8af3422 ("fs: stream_open - opener for
      stream-like files so that read and write can run simultaneously without
      deadlock") for details and e.g. commit 581d21a2 ("xenbus: fix deadlock
      on writes to /proc/xen/xenbus") for a similar deadlock example on
      To avoid such deadlock it was tempting to adjust fuse_finish_open to use
      stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
      but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
      and in particular GVFS which actually uses offset in its read and write
      so if we would do such a change it will break a real user.
      Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the
      opened handler is having stream-like semantics; does not use file position
      and thus the kernel is free to issue simultaneous read and write request on
      opened file handle.
      This patch together with stream_open() should be added to stable kernels
      starting from v3.14+. This will allow to patch OSSPD and other FUSE
      filesystems that provide stream-like files to return FOPEN_STREAM |
      FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all
      kernel versions. This should work because fuse_finish_open ignores unknown
      open flags returned from a filesystem and so passing FOPEN_STREAM to a
      kernel that is not aware of this flag cannot hurt. In turn the kernel that
      is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE
      is sufficient to implement streams without read vs write deadlock.
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: 's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: 's avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · b673f99c
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      commit 10dce8af34226d90fa56746a934f8da5dcdba3df upstream.
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      for historic context.
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      Let's fix this regression. The plan is:
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
         so if we would do such a change it will break a real user.
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: 's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Chris Wilson's avatar
      drm/i915: Fix I915_EXEC_RING_MASK · 66bc1b25
      Chris Wilson authored
      commit d90c06d57027203f73021bb7ddb30b800d65c636 upstream.
      This was supposed to be a mask of all known rings, but it is being used
      by execbuffer to filter out invalid rings, and so is instead mapping high
      unused values onto valid rings. Instead of a mask of all known rings,
      we need it to be the mask of all possible rings.
      Fixes: 549f7365 ("drm/i915: Enable SandyBridge blitter ring")
      Fixes: de1add36 ("drm/i915: Decouple execbuf uAPI from internal implementation")
      Signed-off-by: 's avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: <stable@vger.kernel.org> # v4.6+
      Reviewed-by: 's avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190301140404.26690-21-chris@chris-wilson.co.ukSigned-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Jiri Kosina's avatar
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · b9284404
      Jiri Kosina authored
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      As explained in
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      That results in triple fault shortly after cr3 is switched, and machine
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: 's avatarJiri Kosina <jkosina@suse.cz>
      Acked-by: 's avatarPavel Machek <pavel@ucw.cz>
      Reviewed-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: 's avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: 's avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kees Cook's avatar
      pstore: Convert buf_lock to semaphore · f72ecfe9
      Kees Cook authored
      commit ea84b580b95521644429cc6748b6c2bf27c8b0f3 upstream.
      Instead of running with interrupts disabled, use a semaphore. This should
      make it easier for backends that may need to sleep (e.g. EFI) when
      performing a write:
      |BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
      |in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
      |Preemption disabled at:
      |[<ffffffff99d60512>] pstore_dump+0x72/0x330
      |CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G      D           4.20.0-rc3 #45
      |Call Trace:
      | dump_stack+0x4f/0x6a
      | ___might_sleep.cold.91+0xd3/0xe4
      | __might_sleep+0x50/0x90
      | wait_for_completion+0x32/0x130
      | virt_efi_query_variable_info+0x14e/0x160
      | efi_query_variable_store+0x51/0x1a0
      | efivar_entry_set_safe+0xa3/0x1b0
      | efi_pstore_write+0x109/0x140
      | pstore_dump+0x11c/0x330
      | kmsg_dump+0xa4/0xd0
      | oops_exit+0x22/0x30
      Reported-by: 's avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Fixes: 21b3ddd3 ("efi: Don't use spinlocks for efi vars")
      Signed-off-by: 's avatarKees Cook <keescook@chromium.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Linus Torvalds's avatar
      rcu: locking and unlocking need to always be at least barriers · cfa2e34f
      Linus Torvalds authored
      commit 66be4e66a7f422128748e3c3ef6ee72b20a6197b upstream.
      Herbert Xu pointed out that commit bb73c52b ("rcu: Don't disable
      preemption for Tiny and Tree RCU readers") was incorrect in making the
      preempt_disable/enable() be conditional on CONFIG_PREEMPT_COUNT.
      If CONFIG_PREEMPT_COUNT isn't enabled, the preemption enable/disable is
      a no-op, but still is a compiler barrier.
      And RCU locking still _needs_ that compiler barrier.
      It is simply fundamentally not true that RCU locking would be a complete
      no-op: we still need to guarantee (for example) that things that can
      trap and cause preemption cannot migrate into the RCU locked region.
      The way we do that is by making it a barrier.
      See for example commit 386afc91 ("spinlocks and preemption points
      need to be at least compiler barriers") from back in 2013 that had
      similar issues with spinlocks that become no-ops on UP: they must still
      constrain the compiler from moving other operations into the critical
      Now, it is true that a lot of RCU operations already use READ_ONCE() and
      WRITE_ONCE() (which in practice likely would never be re-ordered wrt
      anything remotely interesting), but it is also true that that is not
      globally the case, and that it's not even necessarily always possible
      (ie bitfields etc).
      Reported-by: 's avatarHerbert Xu <herbert@gondor.apana.org.au>
      Fixes: bb73c52b ("rcu: Don't disable preemption for Tiny and Tree RCU readers")
      Cc: stable@kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Xin Long's avatar
      ipv6: fix the check before getting the cookie in rt6_get_cookie · 5a30ca96
      Xin Long authored
      [ Upstream commit b7999b07726c16974ba9ca3bb9fe98ecbec5f81c ]
      In Jianlin's testing, netperf was broken with 'Connection reset by peer',
      as the cookie check failed in rt6_check() and ip6_dst_check() always
      returned NULL.
      It's caused by Commit 93531c67 ("net/ipv6: separate handling of FIB
      entries from dst based routes"), where the cookie can be got only when
      'c1'(see below) for setting dst_cookie whereas rt6_check() is called
      when !'c1' for checking dst_cookie, as we can see in ip6_dst_check().
      Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check()
      (!c1) will check the 'from' cookie, this patch is to remove the c1 check
      in rt6_get_cookie(), so that the dst_cookie can always be set properly.
        (rt->rt6i_flags & RTF_PCPU || unlikely(!list_empty(&rt->rt6i_uncached)))
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: 's avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: 's avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  9. 09 Jun, 2019 6 commits
    • Miguel Ojeda's avatar
      include/linux/module.h: copy __init/__exit attrs to init/cleanup_module · 08aaa79b
      Miguel Ojeda authored
      commit a6e60d84989fa0e91db7f236eda40453b0e44afa upstream.
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target.
      In particular, it triggers for all the init/cleanup_module
      aliases in the kernel (defined by the module_init/exit macros),
      ending up being very noisy.
      These aliases point to the __init/__exit functions of a module,
      which are defined as __cold (among other attributes). However,
      the aliases themselves do not have the __cold attribute.
      Since the compiler behaves differently when compiling a __cold
      function as well as when compiling paths leading to calls
      to __cold functions, the warning is trying to point out
      the possibly-forgotten attribute in the alias.
      In order to keep the warning enabled, we decided to silence
      this case. Ideally, we would mark the aliases directly
      as __init/__exit. However, there are currently around 132 modules
      in the kernel which are missing __init/__exit in their init/cleanup
      functions (either because they are missing, or for other reasons,
      e.g. the functions being called from somewhere else); and
      a section mismatch is a hard error.
      A conservative alternative was to mark the aliases as __cold only.
      However, since we would like to eventually enforce __init/__exit
      to be always marked,  we chose to use the new __copy function
      attribute (introduced by GCC 9 as well to deal with this).
      With it, we copy the attributes used by the target functions
      into the aliases. This way, functions that were not marked
      as __init/__exit won't have their aliases marked either,
      and therefore there won't be a section mismatch.
      Note that the warning would go away marking either the extern
      declaration, the definition, or both. However, we only mark
      the definition of the alias, since we do not want callers
      (which only see the declaration) to be compiled as if the function
      was __cold (and therefore the paths leading to those calls
      would be assumed to be unlikely).
      Link: https://lore.kernel.org/lkml/20190123173707.GA16603@gmail.com/
      Link: https://lore.kernel.org/lkml/20190206175627.GA20399@gmail.com/Suggested-by: 's avatarMartin Sebor <msebor@gcc.gnu.org>
      Acked-by: 's avatarJessica Yu <jeyu@kernel.org>
      Signed-off-by: 's avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: 's avatarStefan Agner <stefan@agner.ch>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Miguel Ojeda's avatar
      Compiler Attributes: add support for __copy (gcc >= 9) · b00c958c
      Miguel Ojeda authored
      commit c0d9782f5b6d7157635ae2fd782a4b27d55a6013 upstream.
      From the GCC manual:
          The copy attribute applies the set of attributes with which function
          has been declared to the declaration of the function to which
          the attribute is applied. The attribute is designed for libraries
          that define aliases or function resolvers that are expected
          to specify the same set of attributes as their targets. The copy
          attribute can be used with functions, variables, or types. However,
          the kind of symbol to which the attribute is applied (either
          function or variable) must match the kind of symbol to which
          the argument refers. The copy attribute copies only syntactic and
          semantic attributes but not attributes that affect a symbol’s
          linkage or visibility such as alias, visibility, or weak.
          The deprecated attribute is also not copied.
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target, e.g.:
          void __cold f(void) {}
          void __alias("f") g(void);
          warning: 'g' specifies less restrictive attribute than
          its target 'f': 'cold' [-Wmissing-attributes]
      Using __copy(f) we can copy the __cold attribute from f to g:
          void __cold f(void) {}
          void __copy(f) __alias("f") g(void);
      This attribute is most useful to deal with situations where an alias
      is declared but we don't know the exact attributes the target has.
      For instance, in the kernel, the widely used module_init/exit macros
      define the init/cleanup_module aliases, but those cannot be marked
      always as __init/__exit since some modules do not have their
      functions marked as such.
      Suggested-by: 's avatarMartin Sebor <msebor@gcc.gnu.org>
      Reviewed-by: 's avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: 's avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: 's avatarStefan Agner <stefan@agner.ch>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Jiri Slaby's avatar
      memcg: make it work on sparse non-0-node systems · 1bd33537
      Jiri Slaby authored
      commit 3e8589963773a5c23e2f1fe4bcad0e9a90b7f471 upstream.
      We have a single node system with node 0 disabled:
        Scanning NUMA topology in Northbridge 24
        Number of physical nodes 2
        Skipping disabled node 0
        Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
        NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]
      This causes crashes in memcg when system boots:
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        RIP: 0010:list_lru_add+0x94/0x170
        Call Trace:
      It is reproducible as far as 4.12.  I did not try older kernels.  You have
      to have a new enough systemd, e.g.  241 (the reason is unknown -- was not
      investigated).  Cannot be reproduced with systemd 234.
      The system crashes because the size of lru array is never updated in
      memcg_update_all_list_lrus and the reads are past the zero-sized array,
      causing dereferences of random memory.
      The root cause are list_lru_memcg_aware checks in the list_lru code.  The
      test in list_lru_memcg_aware is broken: it assumes node 0 is always
      present, but it is not true on some systems as can be seen above.
      So fix this by avoiding checks on node 0.  Remember the memcg-awareness by
      a bool flag in struct list_lru.
      Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Signed-off-by: 's avatarJiri Slaby <jslaby@suse.cz>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: 's avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: 's avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: 's avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Rasmus Villemoes's avatar
      include/linux/bitops.h: sanitize rotate primitives · 03f6cbbd
      Rasmus Villemoes authored
      commit ef4d6f6b275c498f8e5626c99dbeefdc5027f843 upstream.
      The ror32 implementation (word >> shift) | (word << (32 - shift) has
      undefined behaviour if shift is outside the [1, 31] range.  Similarly
      for the 64 bit variants.  Most callers pass a compile-time constant
      (naturally in that range), but there's an UBSAN report that these may
      actually be called with a shift count of 0.
      Instead of special-casing that, we can make them DTRT for all values of
      shift while also avoiding UB.  For some reason, this was already partly
      done for rol32 (which was well-defined for [0, 31]).  gcc 8 recognizes
      these patterns as rotates, so for example
        __u32 rol32(__u32 word, unsigned int shift)
      	return (word << (shift & 31)) | (word >> ((-shift) & 31));
      compiles to
      0000000000000020 <rol32>:
        20:   89 f8                   mov    %edi,%eax
        22:   89 f1                   mov    %esi,%ecx
        24:   d3 c0                   rol    %cl,%eax
        26:   c3                      retq
      Older compilers unfortunately do not do as well, but this only affects
      the small minority of users that don't pass constants.
      Due to integer promotions, ro[lr]8 were already well-defined for shifts
      in [0, 8], and ro[lr]16 were mostly well-defined for shifts in [0, 16]
      (only mostly - u16 gets promoted to _signed_ int, so if bit 15 is set,
      word << 16 is undefined).  For consistency, update those as well.
      Link: http://lkml.kernel.org/r/20190410211906.2190-1-linux@rasmusvillemoes.dkSigned-off-by: 's avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: 's avatarIdo Schimmel <idosch@mellanox.com>
      Tested-by: 's avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: 's avatarWill Deacon <will.deacon@arm.com>
      Cc: Vadim Pasternak <vadimp@mellanox.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Chris Packham's avatar
      tipc: Avoid copying bytes beyond the supplied data · 39c2bc5a
      Chris Packham authored
      TLV_SET is called with a data pointer and a len parameter that tells us
      how many bytes are pointed to by data. When invoking memcpy() we need
      to careful to only copy len bytes.
      Previously we would copy TLV_LENGTH(len) bytes which would copy an extra
      4 bytes past the end of the data pointer which newer GCC versions
      complain about.
       In file included from test.c:17:
       In function 'TLV_SET',
           inlined from 'test' at test.c:186:5:
       warning: 'memcpy' forming offset [33, 36] is out of the bounds [0, 32]
       of object 'bearer_name' with type 'char[32]' [-Warray-bounds]
           memcpy(TLV_DATA(tlv_ptr), data, tlv_len);
       test.c: In function 'test':
       test.c::161:10: note:
       'bearer_name' declared here
           char bearer_name[TIPC_MAX_BEARER_NAME];
      We still want to ensure any padding bytes at the end are initialised, do
      this with a explicit memset() rather than copy bytes past the end of
      data. Apply the same logic to TCM_SET.
      Signed-off-by: 's avatarChris Packham <chris.packham@alliedtelesis.co.nz>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric Dumazet's avatar
      inet: switch IP ID generator to siphash · e10789ac
      Eric Dumazet authored
      [ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ]
      According to Amit Klein and Benny Pinkas, IP ID generation is too weak
      and might be used by attackers.
      Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
      having 64bit key and Jenkins hash is risky.
      It is time to switch to siphash and its 128bit keys.
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Reported-by: 's avatarAmit Klein <aksecurity@gmail.com>
      Reported-by: 's avatarBenny Pinkas <benny@pinkas.net>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  10. 31 May, 2019 6 commits
  11. 25 May, 2019 2 commits
    • Daniel Borkmann's avatar
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 9f229230
      Daniel Borkmann authored
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Stefan Mätje's avatar
      PCI: Work around Pericom PCIe-to-PCI bridge Retrain Link erratum · 914168e1
      Stefan Mätje authored
      commit 4ec73791a64bab25cabf16a6067ee478692e506d upstream.
      Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode
      (conventional PCI on primary side, PCIe on downstream side), the Retrain
      Link bit needs to be cleared manually to allow the link training to
      complete successfully.
      If it is not cleared manually, the link training is continuously restarted
      and no devices below the PCI-to-PCIe bridge can be accessed.  That means
      drivers for devices below the bridge will be loaded but won't work and may
      even crash because the driver is only reading 0xffff.
      See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for
      details.  Devices known as affected so far are: PI7C9X110, PI7C9X111SL,
      Add a new flag, clear_retrain_link, in struct pci_dev.  Quirks for affected
      devices set this bit.
      Note that pcie_retrain_link() lives in aspm.c because that's currently the
      only place we use it, but this erratum is not specific to ASPM, and we may
      retrain links for other reasons in the future.
      Signed-off-by: 's avatarStefan Mätje <stefan.maetje@esd.eu>
      [bhelgaas: apply regardless of CONFIG_PCIEASPM]
      Signed-off-by: 's avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: 's avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      CC: stable@vger.kernel.org
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>