1. 18 Oct, 2018 2 commits
    • Eric Dumazet's avatar
      inet: frags: break the 2GB limit for frags storage · 7f617068
      Eric Dumazet authored
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f617068
    • Eric Dumazet's avatar
      inet: frags: use rhashtables for reassembly units · 23ce9c5c
      Eric Dumazet authored
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23ce9c5c
  2. 13 Jun, 2018 1 commit
  3. 24 Nov, 2016 1 commit
  4. 08 Nov, 2016 1 commit
  5. 17 Oct, 2016 1 commit
  6. 14 Oct, 2016 1 commit
  7. 10 Oct, 2016 1 commit
  8. 28 Sep, 2016 1 commit
  9. 20 Sep, 2016 1 commit
  10. 19 Sep, 2016 1 commit
    • Mahesh Bandewar's avatar
      ipvlan: Introduce l3s mode · 4fbae7d8
      Mahesh Bandewar authored
      In a typical IPvlan L3 setup where master is in default-ns and
      each slave is into different (slave) ns. In this setup egress
      packet processing for traffic originating from slave-ns will
      hit all NF_HOOKs in slave-ns as well as default-ns. However same
      is not true for ingress processing. All these NF_HOOKs are
      hit only in the slave-ns skipping them in the default-ns.
      IPvlan in L3 mode is restrictive and if admins want to deploy
      iptables rules in default-ns, this asymmetric data path makes it
      impossible to do so.
      
      This patch makes use of the l3_rcv() (added as part of l3mdev
      enhancements) to perform input route lookup on RX packets without
      changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
      to change the skb->dev just before handing over skb to L4.
      Signed-off-by: 's avatarMahesh Bandewar <maheshb@google.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: 's avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      4fbae7d8
  11. 01 Sep, 2016 1 commit
    • David Howells's avatar
      rxrpc: Don't expose skbs to in-kernel users [ver #2] · d001648e
      David Howells authored
      Don't expose skbs to in-kernel users, such as the AFS filesystem, but
      instead provide a notification hook the indicates that a call needs
      attention and another that indicates that there's a new call to be
      collected.
      
      This makes the following possibilities more achievable:
      
       (1) Call refcounting can be made simpler if skbs don't hold refs to calls.
      
       (2) skbs referring to non-data events will be able to be freed much sooner
           rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
           will be able to consult the call state.
      
       (3) We can shortcut the receive phase when a call is remotely aborted
           because we don't have to go through all the packets to get to the one
           cancelling the operation.
      
       (4) It makes it easier to do encryption/decryption directly between AFS's
           buffers and sk_buffs.
      
       (5) Encryption/decryption can more easily be done in the AFS's thread
           contexts - usually that of the userspace process that issued a syscall
           - rather than in one of rxrpc's background threads on a workqueue.
      
       (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.
      
      To make this work, the following interface function has been added:
      
           int rxrpc_kernel_recv_data(
      		struct socket *sock, struct rxrpc_call *call,
      		void *buffer, size_t bufsize, size_t *_offset,
      		bool want_more, u32 *_abort_code);
      
      This is the recvmsg equivalent.  It allows the caller to find out about the
      state of a specific call and to transfer received data into a buffer
      piecemeal.
      
      afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
      logic between them.  They don't wait synchronously yet because the socket
      lock needs to be dealt with.
      
      Five interface functions have been removed:
      
      	rxrpc_kernel_is_data_last()
          	rxrpc_kernel_get_abort_code()
          	rxrpc_kernel_get_error_number()
          	rxrpc_kernel_free_skb()
          	rxrpc_kernel_data_consumed()
      
      As a temporary hack, sk_buffs going to an in-kernel call are queued on the
      rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
      in-kernel user.  To process the queue internally, a temporary function,
      temp_deliver_data() has been added.  This will be replaced with common code
      between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
      future patch.
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      d001648e
  12. 31 Aug, 2016 1 commit
  13. 30 Aug, 2016 2 commits
    • David Howells's avatar
      rxrpc: Pass struct socket * to more rxrpc kernel interface functions · 4de48af6
      David Howells authored
      Pass struct socket * to more rxrpc kernel interface functions.  They should
      be starting from this rather than the socket pointer in the rxrpc_call
      struct if they need to access the socket.
      
      I have left:
      
      	rxrpc_kernel_is_data_last()
      	rxrpc_kernel_get_abort_code()
      	rxrpc_kernel_get_error_number()
      	rxrpc_kernel_free_skb()
      	rxrpc_kernel_data_consumed()
      
      unmodified as they're all about to be removed (and, in any case, don't
      touch the socket).
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      4de48af6
    • David Howells's avatar
      rxrpc: Provide a way for AFS to ask for the peer address of a call · 8324f0bc
      David Howells authored
      Provide a function so that kernel users, such as AFS, can ask for the peer
      address of a call:
      
         void rxrpc_kernel_get_peer(struct rxrpc_call *call,
      			      struct sockaddr_rxrpc *_srx);
      
      In the future the kernel service won't get sk_buffs to look inside.
      Further, this allows us to hide any canonicalisation inside AF_RXRPC for
      when IPv6 support is added.
      
      Also propagate this through to afs_find_server() and issue a warning if we
      can't handle the address family yet.
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      8324f0bc
  14. 29 Aug, 2016 1 commit
  15. 26 Aug, 2016 1 commit
    • Ido Schimmel's avatar
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel authored
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: 's avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: 's avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      6bc506b4
  16. 25 Aug, 2016 1 commit
    • Vivien Didelot's avatar
      net: dsa: rename switch operations structure · 9d490b4e
      Vivien Didelot authored
      Now that the dsa_switch_driver structure contains only function pointers
      as it is supposed to, rename it to the more appropriate dsa_switch_ops,
      uniformly to any other operations structure in the kernel.
      
      No functional changes here, basically just the result of something like:
      s/dsa_switch_driver *drv/dsa_switch_ops *ops/g
      
      However keep the {un,}register_switch_driver functions and their
      dsa_switch_drivers list as is, since they represent the -- likely to be
      deprecated soon -- legacy DSA registration framework.
      
      In the meantime, also fix the following checks from checkpatch.pl to
      make it happy with this patch:
      
          CHECK: Comparison to NULL could be written "!ops"
          #403: FILE: net/dsa/dsa.c:470:
          +	if (ops == NULL) {
      
          CHECK: Comparison to NULL could be written "ds->ops->get_strings"
          #773: FILE: net/dsa/slave.c:697:
          +		if (ds->ops->get_strings != NULL)
      
          CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats"
          #824: FILE: net/dsa/slave.c:785:
          +	if (ds->ops->get_ethtool_stats != NULL)
      
          CHECK: Comparison to NULL could be written "ds->ops->get_sset_count"
          #835: FILE: net/dsa/slave.c:798:
          +		if (ds->ops->get_sset_count != NULL)
      
          total: 0 errors, 0 warnings, 4 checks, 784 lines checked
      Signed-off-by: 's avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Acked-by: 's avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      9d490b4e
  17. 24 Aug, 2016 1 commit
  18. 17 Aug, 2016 1 commit
  19. 13 Aug, 2016 1 commit
  20. 09 Aug, 2016 1 commit
  21. 06 Aug, 2016 1 commit
    • David Howells's avatar
      rxrpc: Fix races between skb free, ACK generation and replying · 372ee163
      David Howells authored
      Inside the kafs filesystem it is possible to occasionally have a call
      processed and terminated before we've had a chance to check whether we need
      to clean up the rx queue for that call because afs_send_simple_reply() ends
      the call when it is done, but this is done in a workqueue item that might
      happen to run to completion before afs_deliver_to_call() completes.
      
      Further, it is possible for rxrpc_kernel_send_data() to be called to send a
      reply before the last request-phase data skb is released.  The rxrpc skb
      destructor is where the ACK processing is done and the call state is
      advanced upon release of the last skb.  ACK generation is also deferred to
      a work item because it's possible that the skb destructor is not called in
      a context where kernel_sendmsg() can be invoked.
      
      To this end, the following changes are made:
      
       (1) kernel_rxrpc_data_consumed() is added.  This should be called whenever
           an skb is emptied so as to crank the ACK and call states.  This does
           not release the skb, however.  kernel_rxrpc_free_skb() must now be
           called to achieve that.  These together replace
           rxrpc_kernel_data_delivered().
      
       (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().
      
           This makes afs_deliver_to_call() easier to work as the skb can simply
           be discarded unconditionally here without trying to work out what the
           return value of the ->deliver() function means.
      
           The ->deliver() functions can, via afs_data_complete(),
           afs_transfer_reply() and afs_extract_data() mark that an skb has been
           consumed (thereby cranking the state) without the need to
           conditionally free the skb to make sure the state is correct on an
           incoming call for when the call processor tries to send the reply.
      
       (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
           has finished with a packet and MSG_PEEK isn't set.
      
       (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().
      
           Because of this, we no longer need to clear the destructor and put the
           call before we free the skb in cases where we don't want the ACK/call
           state to be cranked.
      
       (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
           than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
           the delivery function already), and the caller is now responsible for
           producing an abort if that was the last packet.
      
       (6) There are many bits of unmarshalling code where:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		switch (ret) {
      		case 0:		break;
      		case -EAGAIN:	return 0;
      		default:	return ret;
      		}
      
           is to be found.  As -EAGAIN can now be passed back to the caller, we
           now just return if ret < 0:
      
       		ret = afs_extract_data(call, skb, last, ...);
      		if (ret < 0)
      			return ret;
      
       (7) Checks for trailing data and empty final data packets has been
           consolidated as afs_data_complete().  So:
      
      		if (skb->len > 0)
      			return -EBADMSG;
      		if (!last)
      			return 0;
      
           becomes:
      
      		ret = afs_data_complete(call, skb, last);
      		if (ret < 0)
      			return ret;
      
       (8) afs_transfer_reply() now checks the amount of data it has against the
           amount of data desired and the amount of data in the skb and returns
           an error to induce an abort if we don't get exactly what we want.
      
      Without these changes, the following oops can occasionally be observed,
      particularly if some printks are inserted into the delivery path:
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
      CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G            E   4.7.0-fsdevel+ #1303
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Workqueue: kafsd afs_async_workfn [kafs]
      task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
      RIP: 0010:[<ffffffff8108fd3c>]  [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
      RSP: 0018:ffff88040c073bc0  EFLAGS: 00010002
      RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
      RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
      FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
      Stack:
       0000000000000006 000000000be04930 0000000000000000 ffff880400000000
       ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
       ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
      Call Trace:
       [<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
       [<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
       [<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
       [<ffffffff810915f4>] lock_acquire+0x122/0x1b6
       [<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
       [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
       [<ffffffff814c928f>] skb_dequeue+0x18/0x61
       [<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
       [<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
       [<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
       [<ffffffff81063a3a>] process_one_work+0x29d/0x57c
       [<ffffffff81064ac2>] worker_thread+0x24a/0x385
       [<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
       [<ffffffff810696f5>] kthread+0xf3/0xfb
       [<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
       [<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      372ee163
  22. 15 Jul, 2016 2 commits
  23. 14 Jul, 2016 1 commit
  24. 13 Jul, 2016 1 commit
  25. 28 Jun, 2016 1 commit
    • Giuseppe CAVALLARO's avatar
      drivers: net: stmmac: reworking the PCS code. · 70523e63
      Giuseppe CAVALLARO authored
      The 3.xx and 4.xx synopsys gmacs have a very similar
      PCS embedded module and they share almost the same registers:
      for example:
        AN_Control, AN_Status, AN_Advertisement, AN_Link_Partner_Ability,
        AN_Expansion, TBI_Extended_Status.
      
      Just the RGMII/SMII Control/Status register differs.
      
      So This patch aims to reorganize and enhance the PCS support.
      It removes the existent support from the dwmac1000/dwmac4_core.c
      moving basic PCS functions inside a new file called: stmmac_pcs.h.
      
      The patch also reviews the available APIs to be better shared among
      different hardware and easily enhanced to support new features.
      Signed-off-by: 's avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      70523e63
  26. 24 Jun, 2016 1 commit
    • Florian Westphal's avatar
      netfilter: conntrack: allow increasing bucket size via sysctl too · 3183ab89
      Florian Westphal authored
      No need to restrict this to module parameter.
      
      We export a copy of the real hash size -- when user alters the value we
      allocate the new table, copy entries etc before we update the real size
      to the requested one.
      
      This is also needed because the real size is used by concurrent readers
      and cannot be changed without synchronizing the conntrack generation
      seqcnt.
      
      We only allow changing this value from the initial net namespace.
      
      Tested using http-client-benchmark vs. httpterm with concurrent
      
      while true;do
       echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
      done
      Signed-off-by: 's avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: 's avatarPablo Neira Ayuso <pablo@netfilter.org>
      3183ab89
  27. 17 Jun, 2016 1 commit
  28. 07 Jun, 2016 1 commit
    • Eric Dumazet's avatar
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet authored
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      edb09eb1
  29. 30 May, 2016 1 commit
  30. 25 May, 2016 3 commits
  31. 17 May, 2016 1 commit
  32. 09 May, 2016 1 commit
  33. 06 May, 2016 1 commit
  34. 04 May, 2016 1 commit
  35. 28 Apr, 2016 1 commit