1. 18 Oct, 2018 2 commits
    • Eric Dumazet's avatar
      inet: frags: break the 2GB limit for frags storage · 7f617068
      Eric Dumazet authored
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      Current memory tracking is using one atomic_t and integers.
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      <frag DDOS>
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric Dumazet's avatar
      inet: frags: use rhashtables for reassembly units · 23ce9c5c
      Eric Dumazet authored
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      Then there is the problem of sharing this hash table for all netns.
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 24 Aug, 2016 1 commit
  3. 30 May, 2016 1 commit
  4. 26 Apr, 2016 1 commit
  5. 25 Apr, 2016 1 commit
  6. 11 Apr, 2016 1 commit
    • David Ahern's avatar
      net: ipv4: Consider failed nexthops in multipath routes · a6db4494
      David Ahern authored
      Multipath route lookups should consider knowledge about next hops and not
      select a hop that is known to be failed.
                           [h2]                   [h3]
                            |                      |
                           3|                     3|
                          [SP1]                  [SP2]--+
                           1  2                   1     2
                           |  |     /-------------+     |
                           |   \   /                    |
                           |     X                      |
                           |    / \                     |
                           |   /   \---------------\    |
                           1  2                     1   2
       [TOR1] 3-----------------3 [TOR2]
                           4                         4
                            \                       /
                              \                    /
                               \                  /
                                -------|   |-----/
                                       1   2
      host h1 with IP has 2 paths to host h3 at
          root@h1:~# ip ro ls
 dev swp1  proto kernel  scope link  src

                  nexthop via  dev swp1 weight 1
                  nexthop via  dev swp1 weight 1
      If the link between tor3 and tor1 is down and the link between tor1
      and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
      in h1 are alternating between the 2 routes: ping gets one and
      ssh gets the other. Connections that attempt to use the nexthop fail since that neighbor is not reachable:
          root@h1:~# ip neigh show
 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
 dev swp1  FAILED
      The failed path can be avoided by considering known neighbor information
      when selecting next hops. If the neighbor lookup fails we have no
      knowledge about the nexthop, so give it a shot. If there is an entry
      then only select the nexthop if the state is sane. This is similar to
      what fib_detect_death does.
      To maintain backward compatibility use of the neighbor information is
      based on a new sysctl, fib_multipath_use_neigh.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 22 Mar, 2016 2 commits
  8. 26 Feb, 2016 1 commit
    • David Ahern's avatar
      net: ipv6: Make address flushing on ifdown optional · f1705ec1
      David Ahern authored
      Currently, all ipv6 addresses are flushed when the interface is configured
      down, including global, static addresses:
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          << nothing; all addresses have been flushed>>
      Add a new sysctl to make this behavior optional. The new setting defaults to
      flush all addresses to maintain backwards compatibility. When the set global
      addresses with no expire times are not flushed on an admin down. The sysctl
      is per-interface or system-wide for all interfaces
          $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
          $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
      Will keep addresses on eth1 on an admin down.
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
              inet6 2100:1::2/120 scope global tentative
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
                 valid_lft forever preferred_lft forever
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 11 Feb, 2016 4 commits
  10. 21 Jan, 2016 1 commit
  11. 18 Dec, 2015 1 commit
    • David Ahern's avatar
      net: Allow accepted sockets to be bound to l3mdev domain · 6dd9a14e
      David Ahern authored
      Allow accepted sockets to derive their sk_bound_dev_if setting from the
      l3mdev domain in which the packets originated. A sysctl setting is added
      to control the behavior which is similar to sk_mark and
      This effectively allow a process to have a "VRF-global" listen socket,
      with child sockets bound to the VRF device in which the packet originated.
      A similar behavior can be achieved using sk_mark, but a solution using marks
      is incomplete as it does not handle duplicate addresses in different L3
      domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
      domain provides a complete solution.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 16 Dec, 2015 1 commit
  13. 09 Nov, 2015 1 commit
  14. 21 Oct, 2015 2 commits
    • Yuchung Cheng's avatar
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng authored
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Yuchung Cheng's avatar
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng authored
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 14 Oct, 2015 1 commit
  16. 13 Oct, 2015 1 commit
    • Paolo Abeni's avatar
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni authored
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      The host will then ignore said redirect, due to RFC 1122 section,
      and will continue to use the wrong next-op.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 29 Sep, 2015 1 commit
    • Steffen Klassert's avatar
      xfrm: Let the flowcache handle its size by default. · c386578f
      Steffen Klassert authored
      The xfrm flowcache size is limited by the flowcache limit
      (4096 * number of online cpus) and the xfrm garbage collector
      threshold (2 * 32768), whatever is reached first. This means
      that we can hit the garbage collector limit only on systems
      with more than 16 cpus. On such systems we simply refuse
      new allocations if we reach the limit, so new flows are dropped.
      On syslems with 16 or less cpus, we hit the flowcache limit.
      In this case, we shrink the flow cache instead of refusing new
      We increase the xfrm garbage collector threshold to INT_MAX
      to get the same behaviour, independent of the number of cpus.
      The xfrm garbage collector threshold can still be set below
      the flowcache limit to reduce the memory usage of the flowcache.
      Tested-by: default avatarDan Streetman <dan.streetman@canonical.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
  18. 31 Aug, 2015 1 commit
    • Philip Downey's avatar
      IGMP: Document igmp_link_local_mcast_reports · 87583ebb
      Philip Downey authored
      Document the addition of a new sysctl variable which controls the
      generation of IGMP reports for link local multicast groups in the
      224.0.0.X range.
      IGMP reports for local multicast groups can now be optionally
      inhibited by setting the value to zero e.g.:
      echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports
      To retain backwards compatibility the previous behaviour is retained
      by default on system boot or reverted by setting the value back to
      Signed-off-by: default avatarPhilip Downey <pdowney@brocade.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  19. 25 Aug, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: refine pacing rate determination · 43e122b0
      Eric Dumazet authored
      When TCP pacing was added back in linux-3.12, we chose
      to apply a fixed ratio of 200 % against current rate,
      to allow probing for optimal throughput even during
      slow start phase, where cwnd can be doubled every other gRTT.
      At Google, we found it was better applying a different ratio
      while in Congestion Avoidance phase.
      This ratio was set to 120 %.
      We've used the normal tcp_in_slow_start() helper for a while,
      then tuned the condition to select the conservative ratio
      as soon as cwnd >= ssthresh/2 :
      - After cwnd reduction, it is safer to ramp up more slowly,
        as we approach optimal cwnd.
      - Initial ramp up (ssthresh == INFINITY) still allows doubling
        cwnd every other RTT.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 12 Aug, 2015 1 commit
  21. 01 Aug, 2015 2 commits
  22. 30 Jul, 2015 1 commit
    • Hangbin Liu's avatar
      net/ipv6: add sysctl option accept_ra_min_hop_limit · 8013d1d7
      Hangbin Liu authored
      Commit 6fd99094 ("ipv6: Don't reduce hop limit for an interface")
      disabled accept hop limit from RA if it is smaller than the current hop
      limit for security stuff. But this behavior kind of break the RFC definition.
      RFC 4861, 6.3.4.  Processing Received Router Advertisements
         A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
         and Retrans Timer) may contain a value denoting that it is
         unspecified.  In such cases, the parameter should be ignored and the
         host should continue using whatever value it is already using.
         If the received Cur Hop Limit value is non-zero, the host SHOULD set
         its CurHopLimit variable to the received value.
      So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
      hop limit value they can accept from RA. And set default to 1 to meet RFC
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Acked-by: default avatarYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  23. 22 Jul, 2015 1 commit
  24. 10 Jul, 2015 1 commit
    • Tom Herbert's avatar
      ipv6: Nonlocal bind · 35a256fe
      Tom Herbert authored
      Add support to allow non-local binds similar to how this was done for IPv4.
      Non-local binds are very useful in emulating the Internet in a box, etc.
      This add the ip_nonlocal_bind sysctl under ipv6.
      Set up nonlocal binding and receive routing on a host, e.g.:
      ip -6 rule add from ::/0 iif eth0 lookup 200
      ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
      sysctl -w net.ipv6.ip_nonlocal_bind=1
      Set up routing to 2001:0:0:1::/64 on peer to go to first host
      ping6 -I 2001:0:0:1::1 peer-address -- to verify
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 27 May, 2015 1 commit
    • Eric Dumazet's avatar
      tcp/dccp: try to not exhaust ip_local_port_range in connect() · 07f4c900
      Eric Dumazet authored
      A long standing problem on busy servers is the tiny available TCP port
      range (/proc/sys/net/ipv4/ip_local_port_range) and the default
      sequential allocation of source ports in connect() system call.
      If a host is having a lot of active TCP sessions, chances are
      very high that all ports are in use by at least one flow,
      and subsequent bind(0) attempts fail, or have to scan a big portion of
      space to find a slot.
      In this patch, I changed the starting point in __inet_hash_connect()
      so that we try to favor even [1] ports, leaving odd ports for bind()
      We still perform a sequential search, so there is no guarantee, but
      if connect() targets are very different, end result is we leave
      more ports available to bind(), and we spread them all over the range,
      lowering time for both connect() and bind() to find a slot.
      This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
      is even, ie if start/end values have different parity.
      Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
      32768 - 60999 (instead of 32768 - 61000)
      There is no change on security aspects here, only some poor hashing
      schemes could be eventually impacted by this change.
      [1] : The odd/even property depends on ip_local_port_range values parity
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  26. 19 May, 2015 1 commit
    • Daniel Borkmann's avatar
      tcp: add rfc3168, section fallback · 49213555
      Daniel Borkmann authored
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
       1) Normal ECN-capable path:
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
       2) Path with broken middlebox, when client has fallback:
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: default avatarBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  27. 04 May, 2015 1 commit
    • Tom Herbert's avatar
      ipv6: Flow label state ranges · 82a584b7
      Tom Herbert authored
      This patch divides the IPv6 flow label space into two ranges:
      0-7ffff is reserved for flow label manager, 80000-fffff will be
      used for creating auto flow labels (per RFC6438). This only affects how
      labels are set on transmit, it does not affect receive. This range split
      can be disbaled by systcl.
      IPv6 flow labels have been an unmitigated disappointment thus far
      in the lifetime of IPv6. Support in HW devices to use them for ECMP
      is lacking, and OSes don't turn them on by default. If we had these
      we could get much better hashing in IPv6 networks without resorting
      to DPI, possibly eliminating some of the motivations to to define new
      encaps in UDP just for getting ECMP.
      Unfortunately, the initial specfications of IPv6 did not clarify
      how they are to be used. There has always been a vague concept that
      these can be used for ECMP, flow hashing, etc. and we do now have a
      good standard how to this in RFC6438. The problem is that flow labels
      can be either stateful or stateless (as in RFC6438), and we are
      presented with the possibility that a stateless label may collide
      with a stateful one.  Attempts to split the flow label space were
      rejected in IETF. When we added support in Linux for RFC6438, we
      could not turn on flow labels by default due to this conflict.
      This patch splits the flow label space and should give us
      a path to enabling auto flow labels by default for all IPv6 packets.
      This is an API change so we need to consider compatibility with
      existing deployment. The stateful range is chosen to be the lower
      values in hopes that most uses would have chosen small numbers.
      Once we resolve the stateless/stateful issue, we can proceed to
      look at enabling RFC6438 flow labels by default (starting with
      scaled testing).
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  28. 24 Mar, 2015 1 commit
  29. 21 Mar, 2015 1 commit
  30. 06 Mar, 2015 1 commit
  31. 08 Feb, 2015 1 commit
    • Neal Cardwell's avatar
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell authored
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      This patch includes:
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: default avatarAvery Fay <avery@mixpanel.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  32. 25 Jan, 2015 1 commit
  33. 12 Jan, 2015 1 commit