1. 19 Jan, 2014 1 commit
  2. 06 Dec, 2013 1 commit
  3. 23 Nov, 2013 1 commit
  4. 18 Nov, 2013 1 commit
  5. 08 Oct, 2013 1 commit
    • Shawn Bohrer's avatar
      net: ipv4 only populate IP_PKTINFO when needed · fbf8866d
      Shawn Bohrer authored
      The since the removal of the routing cache computing
      fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
      packet received.  This has introduced a performance regression for some
      UDP workloads.
      This change skips populating the packet info for sockets that do not have
      IP_PKTINFO set.
      Benchmark results from a netperf UDP_RR test:
      Before 89789.68 transactions/s
      After  90587.62 transactions/s
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.63us RTT
      After  12.48us RTT
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 28 Sep, 2013 1 commit
    • Francesco Fusco's avatar
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco authored
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      Signed-off-by: default avatarFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 24 Sep, 2013 1 commit
  8. 19 Sep, 2013 1 commit
    • Ansis Atteka's avatar
      ip: generate unique IP identificator if local fragmentation is allowed · 703133de
      Ansis Atteka authored
      If local fragmentation is allowed, then ip_select_ident() and
      ip_select_ident_more() need to generate unique IDs to ensure
      correct defragmentation on the peer.
      For example, if IPsec (tunnel mode) has to encrypt large skbs
      that have local_df bit set, then all IP fragments that belonged
      to different ESP datagrams would have used the same identificator.
      If one of these IP fragments would get lost or reordered, then
      peer could possibly stitch together wrong IP fragments that did
      not belong to the same datagram. This would lead to a packet loss
      or data corruption.
      Signed-off-by: default avatarAnsis Atteka <aatteka@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 29 Aug, 2013 1 commit
  10. 15 Aug, 2013 1 commit
  11. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
              list_for_each_entry(pos, head, member)
      The hlist ones were greedy and wanted an extra parameter:
              hlist_for_each_entry(tpos, pos, head, member)
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      Besides the semantic patch, there was some manual work required:
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      -T b;
          <+... when != b
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      - b,
      c) S
      for_each_busy_worker(a, c,
      - b,
      d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      -(a, b)
      + sk_for_each_from(a) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d, e) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 18 Feb, 2013 2 commits
  13. 21 Jan, 2013 1 commit
  14. 25 Sep, 2012 1 commit
  15. 24 Sep, 2012 1 commit
    • Eric Dumazet's avatar
      net: use a per task frag allocator · 5640f768
      Eric Dumazet authored
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      This page is used to build fragments for skbs.
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  16. 22 Sep, 2012 1 commit
  17. 15 Aug, 2012 1 commit
  18. 12 Jul, 2012 1 commit
  19. 15 Jun, 2012 1 commit
    • David S. Miller's avatar
      ipv4: Handle PMTU in all ICMP error handlers. · 36393395
      David S. Miller authored
      With ip_rt_frag_needed() removed, we have to explicitly update PMTU
      information in every ICMP error handler.
      Create two helper functions to facilitate this.
      1) ipv4_sk_update_pmtu()
         This updates the PMTU when we have a socket context to
         work with.
      2) ipv4_update_pmtu()
         Raw version, used when no socket context is available.  For this
         interface, we essentially just pass in explicit arguments for
         the flow identity information we would have extracted from the
         And you'll notice that ipv4_sk_update_pmtu() is simply implemented
         in terms of ipv4_update_pmtu()
      Note that __ip_route_output_key() is used, rather than something like
      ip_route_output_flow() or ip_route_output_key().  This is because we
      absolutely do not want to end up with a route that does IPSEC
      encapsulation and the like.  Instead, we only want the route that
      would get us to the node described by the outermost IP header.
      Reported-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 15 Apr, 2012 1 commit
  21. 12 Mar, 2012 1 commit
    • Joe Perches's avatar
      net: Convert printks to pr_<level> · 058bd4d2
      Joe Perches authored
      Use a more current kernel messaging style.
      Convert a printk block to print_hex_dump.
      Coalesce formats, align arguments.
      Use %s, __func__ instead of embedding function names.
      Some messages that were prefixed with <foo>_close are
      now prefixed with <foo>_fini.  Some ah4 and esp messages
      are now not prefixed with "ip ".
      The intent of this patch is to later add something like
        #define pr_fmt(fmt) "IPv4: " fmt.
      to standardize the output messages.
      Text size is trivially reduced. (x86-32 allyesconfig)
      $ size net/ipv4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
       887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  22. 08 Feb, 2012 1 commit
    • Erich E. Hoover's avatar
      ipv4: Implement IP_UNICAST_IF socket option. · 76e21053
      Erich E. Hoover authored
      The IP_UNICAST_IF feature is needed by the Wine project.  This patch
      implements the feature by setting the outgoing interface in a similar
      fashion to that of IP_MULTICAST_IF.  A separate option is needed to
      handle this feature since the existing options do not provide all of
      the characteristics required by IP_UNICAST_IF, a summary is provided
      * SO_BINDTODEVICE requires administrative privileges, IP_UNICAST_IF
      does not.  From reading some old mailing list articles my
      understanding is that SO_BINDTODEVICE requires administrative
      privileges because it can override the administrator's routing
      * The SO_BINDTODEVICE option restricts both outbound and inbound
      traffic, IP_UNICAST_IF only impacts outbound traffic.
      * Since IP_PKTINFO and IP_UNICAST_IF are independent options,
      implementing IP_UNICAST_IF with IP_PKTINFO will likely break some
      * Implementing IP_UNICAST_IF on top of IP_PKTINFO significantly
      complicates the Wine codebase and reduces the socket performance
      (doing this requires a lot of extra communication between the
      "server" and "user" layers).
      * bind() does not work on broadcast packets, IP_UNICAST_IF is
      specifically intended to work with broadcast packets.
      * Like SO_BINDTODEVICE, bind() restricts both outbound and inbound
      Signed-off-by: default avatarErich E. Hoover <ehoover@mines.edu>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  23. 18 Nov, 2011 1 commit
    • Herbert Xu's avatar
      ipv4: Remove all uses of LL_ALLOCATED_SPACE · 66088243
      Herbert Xu authored
      ipv4: Remove all uses of LL_ALLOCATED_SPACE
      The macro LL_ALLOCATED_SPACE was ill-conceived.  It applies the
      alignment to the sum of needed_headroom and needed_tailroom.  As
      the amount that is then reserved for head room is needed_headroom
      with alignment, this means that the tail room left may be too small.
      This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv4
      with the macro LL_RESERVED_SPACE and direct reference to
      This also fixes the problem with needed_headroom changing between
      allocating the skb and reserving the head room.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  24. 09 Nov, 2011 1 commit
    • Eric Dumazet's avatar
      ipv4: PKTINFO doesnt need dst reference · d826eb14
      Eric Dumazet authored
      Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
      > At least, in recent kernels we dont change dst->refcnt in forwarding
      > patch (usinf NOREF skb->dst)
      > One particular point is the atomic_inc(dst->refcnt) we have to perform
      > when queuing an UDP packet if socket asked PKTINFO stuff (for example a
      > typical DNS server has to setup this option)
      > I have one patch somewhere that stores the information in skb->cb[] and
      > avoid the atomic_{inc|dec}(dst->refcnt).
      OK I found it, I did some extra tests and believe its ready.
      [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
      When a socket uses IP_PKTINFO notifications, we currently force a dst
      reference for each received skb. Reader has to access dst to get needed
      information (rt_iif & rt_spec_dst) and must release dst reference.
      We also forced a dst reference if skb was put in socket backlog, even
      without IP_PKTINFO handling. This happens under stress/load.
      We can instead store the needed information in skb->cb[], so that only
      softirq handler really access dst, improving cache hit ratios.
      This removes two atomic operations per packet, and false sharing as
      On a benchmark using a mono threaded receiver (doing only recvmsg()
      calls), I can reach 720.000 pps instead of 570.000 pps.
      IP_PKTINFO is typically used by DNS servers, and any multihomed aware
      UDP application.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 31 Oct, 2011 1 commit
  26. 08 Aug, 2011 1 commit
  27. 26 Jul, 2011 1 commit
  28. 01 Jul, 2011 1 commit
  29. 24 May, 2011 1 commit
    • Dan Rosenberg's avatar
      net: convert %p usage to %pK · 71338aa7
      Dan Rosenberg authored
      The %pK format specifier is designed to hide exposed kernel pointers,
      specifically via /proc interfaces.  Exposing these pointers provides an
      easy target for kernel write vulnerabilities, since they reveal the
      locations of writable structures containing easily triggerable function
      pointers.  The behavior of %pK depends on the kptr_restrict sysctl.
      If kptr_restrict is set to 0, no deviation from the standard %p behavior
      occurs.  If kptr_restrict is set to 1, the default, if the current user
      (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
      (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
       If kptr_restrict is set to 2, kernel pointers using %pK are printed as
      0's regardless of privileges.  Replacing with 0's was chosen over the
      default "(null)", which cannot be parsed by userland %p, which expects
      The supporting code for kptr_restrict and %pK are currently in the -mm
      tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
      pointers to the syslog are not covered, since this would eliminate useful
      information for postmortem debugging and the reading of the syslog is
      already optionally protected by the dmesg_restrict sysctl.
      Signed-off-by: default avatarDan Rosenberg <drosenberg@vsecurity.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Thomas Graf <tgraf@infradead.org>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Kees Cook <kees.cook@canonical.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  30. 09 May, 2011 2 commits
  31. 28 Apr, 2011 1 commit
    • Eric Dumazet's avatar
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet authored
      We lack proper synchronization to manipulate inet->opt ip_options
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      Another thread can change inet->opt pointer and free old one under us.
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  32. 22 Apr, 2011 1 commit
  33. 31 Mar, 2011 2 commits
  34. 28 Mar, 2011 1 commit
  35. 12 Mar, 2011 3 commits