• Ilpo Järvinen's avatar
    tcp: Try to restore large SKBs while SACK processing · 832d11c5
    Ilpo Järvinen authored
    During SACK processing, most of the benefits of TSO are eaten by
    the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
    Then we're in problems when cleanup work for them has to be done
    when a large cumulative ACK comes. Try to return back to pre-split
    state already while more and more SACK info gets discovered by
    combining newly discovered SACK areas with the previous skb if
    that's SACKed as well.
    This approach has a number of benefits:
    1) The processing overhead is spread more equally over the RTT
    2) Write queue has less skbs to process (affect everything
       which has to walk in the queue past the sacked areas)
    3) Write queue is consistent whole the time, so no other parts
       of TCP has to be aware of this (this was not the case with
       some other approach that was, well, quite intrusive all
    4) Clean_rtx_queue can release most of the pages using single
       put_page instead of previous PAGE_SIZE/mss+1 calls
    In case a hole is fully filled by the new SACK block, we attempt
    to combine the next skb too which allows construction of skbs
    that are even larger than what tso split them to and it handles
    hole per on every nth patterns that often occur during slow start
    overshoot pretty nicely. Though this to be really useful also
    a retransmission would have to get lost since cumulative ACKs
    advance one hole at a time in the most typical case.
    TODO: handle upwards only merging. That should be rather easy
    when segment is fully sacked but I'm leaving that as future
    work item (it won't make very large difference anyway since
    this current approach already covers quite a lot of normal
    I was earlier thinking of some sophisticated way of tracking
    timestamps of the first and the last segment but later on
    realized that it won't be that necessary at all to store the
    timestamp of the last segment. The cases that can occur are
    basically either:
      1) ambiguous => no sensible measurement can be taken anyway
      2) non-ambiguous is due to reordering => having the timestamp
         of the last segment there is just skewing things more off
         than does some good since the ack got triggered by one of
         the holes (besides some substle issues that would make
         determining right hole/skb even harder problem). Anyway,
         it has nothing to do with this change then.
    I choose to route some abnormal looking cases with goto noop,
    some could be handled differently (eg., by stopping the
    walking at that skb but again). In general, they either
    shouldn't happen at all or are rare enough to make no difference
    in practice.
    In theory this change (as whole) could cause some macroscale
    regression (global) because of cache misses that are taken over
    the round-trip time but it gets very likely better because of much
    less (local) cache misses per other write queue walkers and the
    big recovery clearing cumulative ack.
    Worth to note that these benefits would be very easy to get also
    without TSO/GSO being on as long as the data is in pages so that
    we can merge them. Currently I won't let that happen because
    DSACK splitting at fragment that would mess up pcounts due to
    sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
    avoided, we have some conditions that can be made less strict.
    TODO: I will probably have to convert the excessive pointer
    passing to struct sacktag_state... :-)
    My testing revealed that considerable amount of skbs couldn't
    be shifted because they were cloned (most likely still awaiting
    tx reclaim)...
    [The rest is considering future work instead since I got
    repeatably EFAULT to tcpdump's recvfrom when I added
    pskb_expand_head to deal with clones, so I separated that
    into another, later patch]
    ...To counter that, I gave up on the fifth advantage:
    5) When growing previous SACK block, less allocs for new skbs
       are done, basically a new alloc is needed only when new hole
       is detected and when the previous skb runs out of frags space
    ...which now only happens of if reclaim is fast enough to dispose
    the clone before the SACK block comes in (the window is RTT long),
    otherwise we'll have to alloc some.
    With clones being handled I got these numbers (will be somewhat
    worse without that), taken with fine-grained mibs:
                      TCPSackShifted 398
                       TCPSackMerged 877
                TCPSackShiftFallback 320
                 TCPSACKCOLLAPSEHOLE 12
    Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
tcp.h 40.1 KB