• Eric Dumazet's avatar
    net: allow skb->head to be a page fragment · d3836f21
    Eric Dumazet authored
    skb->head is currently allocated from kmalloc(). This is convenient but
    has the drawback the data cannot be converted to a page fragment if
    needed.
    
    We have three spots were it hurts :
    
    1) GRO aggregation
    
     When a linear skb must be appended to another skb, GRO uses the
    frag_list fallback, very inefficient since we keep all struct sk_buff
    around. So drivers enabling GRO but delivering linear skbs to network
    stack aren't enabling full GRO power.
    
    2) splice(socket -> pipe).
    
     We must copy the linear part to a page fragment.
     This kind of defeats splice() purpose (zero copy claim)
    
    3) TCP coalescing.
    
     Recently introduced, this permits to group several contiguous segments
    into a single skb. This shortens queue lengths and save kernel memory,
    and greatly reduce probabilities of TCP collapses. This coalescing
    doesnt work on linear skbs (or we would need to copy data, this would be
    too slow)
    
    Given all these issues, the following patch introduces the possibility
    of having skb->head be a fragment in itself. We use a new skb flag,
    skb->head_frag to carry this information.
    
    build_skb() is changed to accept a frag_size argument. Drivers willing
    to provide a page fragment instead of kmalloc() data will set a non zero
    value, set to the fragment size.
    
    Then, on situations we need to convert the skb head to a frag in itself,
    we can check if skb->head_frag is set and avoid the copies or various
    fallbacks we have.
    
    This means drivers currently using frags could be updated to avoid the
    current skb->head allocation and reduce their memory footprint (aka skb
    truesize). (thats 512 or 1024 bytes saved per skb). This also makes
    bpf/netfilter faster since the 'first frag' will be part of skb linear
    part, no need to copy data.
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Maciej Żenczykowski <maze@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Cc: Tom Herbert <therbert@google.com>
    Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
    Cc: Ben Hutchings <bhutchings@solarflare.com>
    Cc: Matt Carlson <mcarlson@broadcom.com>
    Cc: Michael Chan <mchan@broadcom.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    d3836f21
Name
Last commit
Last update
Documentation Loading commit data...
arch Loading commit data...
block Loading commit data...
crypto Loading commit data...
drivers Loading commit data...
firmware Loading commit data...
fs Loading commit data...
include Loading commit data...
init Loading commit data...
ipc Loading commit data...
kernel Loading commit data...
lib Loading commit data...
mm Loading commit data...
net Loading commit data...
samples Loading commit data...
scripts Loading commit data...
security Loading commit data...
sound Loading commit data...
tools Loading commit data...
usr Loading commit data...
virt/kvm Loading commit data...
.gitignore Loading commit data...
.mailmap Loading commit data...
COPYING Loading commit data...
CREDITS Loading commit data...
Kbuild Loading commit data...
Kconfig Loading commit data...
MAINTAINERS Loading commit data...
Makefile Loading commit data...
README Loading commit data...
REPORTING-BUGS Loading commit data...