      L2TP:Adjust intf MTU, add underlay L3, L2 hdrs. · b784e7eb
      Existing L2TP kernel code does not derive the optimal MTU for Ethernet
      pseudowires and instead leaves this to a userspace L2TP daemon or
      operator. If an MTU is not specified, the existing kernel code chooses
      an MTU that does not take account of all tunnel header overheads, which
      can lead to unwanted IP fragmentation. When L2TP is used without a
      control plane (userspace daemon), we would prefer that the kernel does a
      better job of choosing a default pseudowire MTU, taking account of all
      tunnel header overheads, including IP header options, if any. This patch
      addresses this.
      Change-set here uses the new kernel function, kernel_sock_ip_overhead(),
      to factor the outer IP overhead on the L2TP tunnel socket (including
      IP Options, if any) when calculating the default MTU for an Ethernet
      pseudowire, along with consideration of the inner Ethernet header.
      l2tp: take a reference on sessions used in genetlink handlers · 2777e2ab
      Callers of l2tp_nl_session_find() need to hold a reference on the
      returned session since there's no guarantee that it isn't going to
      disappear from under them.
      Relying on the fact that no l2tp netlink message may be processed
      concurrently isn't enough: sessions can be deleted by other means
      (e.g. by closing the PPPOL2TP socket of a ppp pseudowire).
      l2tp_nl_cmd_session_delete() is a bit special: it runs a callback
      function that may require a previous call to session->ref(). In
      particular, for ppp pseudowires, the callback is l2tp_session_delete(),
      which then calls pppol2tp_session_close() and dereferences the PPPOL2TP
      socket. The socket might already be gone at the moment
      l2tp_session_delete() calls session->ref(), so we need to take a
      reference during the session lookup. So we need to pass the do_ref
      variable down to l2tp_session_get() and l2tp_session_get_by_ifname().
      Since all callers have to be updated, l2tp_session_find_by_ifname() and
      l2tp_nl_session_find() are renamed to reflect their new behaviour.
      l2tp: hold session while sending creation notifications · 5e6a9e5a
      l2tp_session_find() doesn't take any reference on the returned session.
      Therefore, the session may disappear while sending the notification.
      Use l2tp_session_get() instead and decrement session's refcount once
      the notification is sent.
      l2tp: fix duplicate session creation · dbdbc73b
      l2tp_session_create() relies on its caller for checking for duplicate
      sessions. This is racy since a session can be concurrently inserted
      after the caller's verification.
      Fix this by letting l2tp_session_create() verify sessions uniqueness
      upon insertion. Callers need to be adapted to check for
      l2tp_session_create()'s return code instead of calling
      pppol2tp_connect() is a bit special because it has to work on existing
      sessions (if they're not connected) or to create a new session if none
      is found. When acting on a preexisting session, a reference must be
      held or it could go away on us. So we have to use l2tp_session_get()
      instead of l2tp_session_find() and drop the reference before exiting.
      l2tp: ensure session can't get removed during pppol2tp_session_ioctl() · 57377d63
      Holding a reference on session is required before calling
      pppol2tp_session_ioctl(). The session could get freed while processing the
      ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
      we also need to take a reference on it in l2tp_session_get().
      l2tp: fix race in l2tp_recv_common() · 61b9a047
      Taking a reference on sessions in l2tp_recv_common() is racy; this
      has to be done by the callers.
      To this end, a new function is required (l2tp_session_get()) to
      atomically lookup a session and take a reference on it. Callers then
      have to manually drop this reference.
      l2tp: Avoid schedule while atomic in exit_net · 12d656af
      While destroying a network namespace that contains a L2TP tunnel a
      "BUG: scheduling while atomic" can be observed.
      Enabling lockdep shows that this is happening because l2tp_exit_net()
      is calling l2tp_tunnel_closeall() (via l2tp_tunnel_delete()) from
      within an RCU critical section.
      l2tp_exit_net() takes rcu_read_lock_bh()
        << list_for_each_entry_rcu() >>
              synchronize_rcu() << Illegal inside RCU critical section >>
      BUG: sleeping function called from invalid context
      in_atomic(): 1, irqs_disabled(): 0, pid: 86, name: kworker/u16:2
      INFO: lockdep is turned off.
      CPU: 2 PID: 86 Comm: kworker/u16:2 Tainted: G        W  O    4.4.6-at1 #2
      Hardware name: Xen HVM domU, BIOS 4.6.1-xs125300 05/09/2016
      Workqueue: netns cleanup_net
       0000000000000000 ffff880202417b90 ffffffff812b0013 ffff880202410ac0
       ffffffff81870de8 ffff880202417bb8 ffffffff8107aee8 ffffffff81870de8
       0000000000000c51 0000000000000000 ffff880202417be0 ffffffff8107b024
      Call Trace:
       [<ffffffff812b0013>] dump_stack+0x85/0xc2
       [<ffffffff8107aee8>] ___might_sleep+0x148/0x240
       [<ffffffff8107b024>] __might_sleep+0x44/0x80
       [<ffffffff810b21bd>] synchronize_sched+0x2d/0xe0
       [<ffffffff8109be6d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff8105c7bb>] ? __local_bh_enable_ip+0x6b/0xc0
       [<ffffffff816a1b00>] ? _raw_spin_unlock_bh+0x30/0x40
       [<ffffffff81667482>] __l2tp_session_unhash+0x172/0x220
       [<ffffffff81667397>] ? __l2tp_session_unhash+0x87/0x220
       [<ffffffff8166888b>] l2tp_tunnel_closeall+0x9b/0x140
       [<ffffffff81668c74>] l2tp_tunnel_delete+0x14/0x60
       [<ffffffff81668dd0>] l2tp_exit_net+0x110/0x270
       [<ffffffff81668d5c>] ? l2tp_exit_net+0x9c/0x270
       [<ffffffff815001c3>] ops_exit_list.isra.6+0x33/0x60
       [<ffffffff81501166>] cleanup_net+0x1b6/0x280
      This bug can easily be reproduced with a few steps:
       $ sudo unshare -n bash  # Create a shell in a new namespace
       # ip link set lo up
       # ip addr add dev lo
       # ip l2tp add tunnel remote local tunnel_id 1 \
          peer_tunnel_id 1 udp_sport 50000 udp_dport 50000
       # ip l2tp add session name foo tunnel_id 1 session_id 1 \
          peer_session_id 1
       # ip link set foo up
       # exit  # Exit the shell, in turn exiting the namespace
       $ dmesg
       [942121.089216] BUG: scheduling while atomic: kworker/u16:3/13872/0x00000200
      To fix this, move the call to l2tp_tunnel_closeall() out of the RCU
      critical section, and instead call it from l2tp_tunnel_del_work(), which
      is running from the l2tp_wq workqueue.
      l2tp: fix address test in __l2tp_ip6_bind_lookup() · 31e2f21f
      The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
      matches if addr is NULL or if addr != laddr.
      But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
      the same address, so the ipv6_addr_equal() condition needs to be
      For better clarity and consistency with the rest of the expression, the
      (!X || X == Y) notation is used instead of !(X && X != Y).
      l2tp: fix lookup for sockets not bound to a device in l2tp_ip · df90e688
      When looking up an l2tp socket, we must consider a null netdevice id as
      wild card. There are currently two problems caused by
      __l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:
        * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
          never receives any packet. Since __l2tp_ip_bind_lookup() is called
          with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
          different from 'dif' so the socket doesn't match.
        * Two sockets, one bound to a device but not the other, can be bound
          to the same address. If the first socket binding to the address is
          the one that is also bound to a device, the second socket can bind
          to the same address without __l2tp_ip_bind_lookup() noticing the
      To fix this issue, we need to consider that any null device index, be
      it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
      We also need to pass the input device index to __l2tp_ip_bind_lookup()
      on reception so that sockets bound to a device never receive packets
      from other devices.
      This patch fixes l2tp_ip6 in the same way.
      l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind() · d5e3a190
      It's not enough to check for sockets bound to same address at the
      beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
      time, a socket with the same address could be bound before we take
      the l2tp lock again.
      This patch moves the lookup right before inserting the new socket, so
      that no change can ever happen to the list between address lookup and
      socket insertion.
      Care is taken to avoid side effects on the socket in case of failure.
      That is, modifications of the socket are done after the lookup, when
      binding is guaranteed to succeed, and before releasing the l2tp lock,
      so that concurrent lookups will always see fully initialised sockets.
      For l2tp_ip, 'ret' is set to -EINVAL before checking the SOCK_ZAPPED
      bit. Error code was mistakenly set to -EADDRINUSE on error by commit
      32c23116 ("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").
      Using -EINVAL restores original behaviour.
      For l2tp_ip6, the lookup is now always done with the correct bound
      device. Before this patch, when binding to a link-local address, the
      lookup was done with the original sk->sk_bound_dev_if, which was later
      overwritten with addr->l2tp_scope_id. Lookup is now performed with the
      final sk->sk_bound_dev_if value.
      Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
      dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
      already been checked at this point (this part of the code seems to have
      been copy-pasted from net/ipv6/raw.c).
      l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv() · a3c18422
      Socket must be held while under the protection of the l2tp lock; there
      is no guarantee that sk remains valid after the read_unlock_bh() call.
      Same issue for l2tp_ip and l2tp_ip6.
      l2tp: lock socket before checking flags in connect() · 0382a25a
      Socket flags aren't updated atomically, so the socket must be locked
      while reading the SOCK_ZAPPED flag.
      This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
      also brings error handling for __ip6_datagram_connect() failures.
      l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() · 32c23116
      Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
      Without lock, a concurrent call could modify the socket flags between
      the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
      a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
      would then leave a stale pointer there, generating use-after-free
      errors when walking through the list or modifying adjacent entries.
      BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8
      Write of size 8 by task syz-executor/10987
      CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
       ffff880031d97838 ffffffff829f835b ffff88001b5a1640 ffff8800081b0ec0
       ffff8800081b15a0 ffff8800081b6d20 ffff880031d97860 ffffffff8174d3cc
       ffff880031d978f0 ffff8800081b0e80 ffff88001b5a1640 ffff880031d978e0
      Call Trace:
       [<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15
       [<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
       [<     inline     >] print_address_description mm/kasan/report.c:194
       [<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
       [<     inline     >] kasan_report mm/kasan/report.c:303
       [<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329
       [<     inline     >] __write_once_size ./include/linux/compiler.h:249
       [<     inline     >] __hlist_del ./include/linux/list.h:622
       [<     inline     >] hlist_del_init ./include/linux/list.h:637
       [<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
       [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
       [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
       [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
       [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
       [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
       [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
       [<ffffffff813774f9>] task_work_run+0xf9/0x170
       [<ffffffff81324aae>] do_exit+0x85e/0x2a00
       [<ffffffff81326dc8>] do_group_exit+0x108/0x330
       [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
       [<ffffffff811b49af>] do_signal+0x7f/0x18f0
       [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
       [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
       [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
       [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
      Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448
      PID = 10987
       [ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
       [ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
       [ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0
       [ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20
       [ 1116.897025] [<     inline     >] slab_post_alloc_hook mm/slab.h:417
       [ 1116.897025] [<     inline     >] slab_alloc_node mm/slub.c:2708
       [ 1116.897025] [<     inline     >] slab_alloc mm/slub.c:2716
       [ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
       [ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326
       [ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388
       [ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182
       [ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153
       [ 1116.897025] [<     inline     >] sock_create net/socket.c:1193
       [ 1116.897025] [<     inline     >] SYSC_socket net/socket.c:1223
       [ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203
       [ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6
      PID = 10987
       [ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
       [ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
       [ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0
       [ 1116.897025] [<     inline     >] slab_free_hook mm/slub.c:1352
       [ 1116.897025] [<     inline     >] slab_free_freelist_hook mm/slub.c:1374
       [ 1116.897025] [<     inline     >] slab_free mm/slub.c:2951
       [ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
       [ 1116.897025] [<     inline     >] sk_prot_free net/core/sock.c:1369
       [ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444
       [ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452
       [ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460
       [ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471
       [ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589
       [ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243
       [ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
       [ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
       [ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
       [ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
       [ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
       [ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
       [ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170
       [ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00
       [ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330
       [ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
       [ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0
       [ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
       [ 1116.897025] [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
       [ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
       [ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
      Memory state around the buggy address:
       ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
       ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      The same issue exists with l2tp_ip_bind() and l2tp_ip_bind_table.
