1. 23 Apr, 2014 1 commit
  2. 14 Apr, 2014 1 commit
    • Daniel Borkmann's avatar
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann authored
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c482cdc
  3. 31 Mar, 2014 4 commits
    • Alexei Starovoitov's avatar
      net: filter: rework/optimize internal BPF interpreter's instruction set · bd4cf0ed
      Alexei Starovoitov authored
      This patch replaces/reworks the kernel-internal BPF interpreter with
      an optimized BPF instruction set format that is modelled closer to
      mimic native instruction sets and is designed to be JITed with one to
      one mapping. Thus, the new interpreter is noticeably faster than the
      current implementation of sk_run_filter(); mainly for two reasons:
      
      1. Fall-through jumps:
      
        BPF jump instructions are forced to go either 'true' or 'false'
        branch which causes branch-miss penalty. The new BPF jump
        instructions have only one branch and fall-through otherwise,
        which fits the CPU branch predictor logic better. `perf stat`
        shows drastic difference for branch-misses between the old and
        new code.
      
      2. Jump-threaded implementation of interpreter vs switch
         statement:
      
        Instead of single table-jump at the top of 'switch' statement,
        gcc will now generate multiple table-jump instructions, which
        helps CPU branch predictor logic.
      
      Note that the verification of filters is still being done through
      sk_chk_filter() in classical BPF format, so filters from user- or
      kernel space are verified in the same way as we do now, and same
      restrictions/constraints hold as well.
      
      We reuse current BPF JIT compilers in a way that this upgrade would
      even be fine as is, but nevertheless allows for a successive upgrade
      of BPF JIT compilers to the new format.
      
      The internal instruction set migration is being done after the
      probing for JIT compilation, so in case JIT compilers are able to
      create a native opcode image, we're going to use that, and in all
      other cases we're doing a follow-up migration of the BPF program's
      instruction set, so that it can be transparently run in the new
      interpreter.
      
      In short, the *internal* format extends BPF in the following way (more
      details can be taken from the appended documentation):
      
        - Number of registers increase from 2 to 10
        - Register width increases from 32-bit to 64-bit
        - Conditional jt/jf targets replaced with jt/fall-through
        - Adds signed > and >= insns
        - 16 4-byte stack slots for register spill-fill replaced
          with up to 512 bytes of multi-use stack space
        - Introduction of bpf_call insn and register passing convention
          for zero overhead calls from/to other kernel functions
        - Adds arithmetic right shift and endianness conversion insns
        - Adds atomic_add insn
        - Old tax/txa insns are replaced with 'mov dst,src' insn
      
      Performance of two BPF filters generated by libpcap resp. bpf_asm
      was measured on x86_64, i386 and arm32 (other libpcap programs
      have similar performance differences):
      
      fprog #1 is taken from Documentation/networking/filter.txt:
      tcpdump -i eth0 port 22 -dd
      
      fprog #2 is taken from 'man tcpdump':
      tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
         ((tcp[12]&0xf0)>>2)) != 0)' -dd
      
      Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
      same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
      smaller is better:
      
      --x86_64--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF      90       101        192       202
      new BPF      31        71         47        97
      old BPF jit  12        34         17        44
      new BPF jit TBD
      
      --i386--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     107       136        227       252
      new BPF      40       119         69       172
      
      --arm32--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     202       300        475       540
      new BPF     180       270        330       470
      old BPF jit  26       182         37       202
      new BPF jit TBD
      
      Thus, without changing any userland BPF filters, applications on
      top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
      classifier, netfilter's xt_bpf, team driver's load-balancing mode,
      and many more will have better interpreter filtering performance.
      
      While we are replacing the internal BPF interpreter, we also need
      to convert seccomp BPF in the same step to make use of the new
      internal structure since it makes use of lower-level API details
      without being further decoupled through higher-level calls like
      sk_unattached_filter_{create,destroy}(), for example.
      
      Just as for normal socket filtering, also seccomp BPF experiences
      a time-to-verdict speedup:
      
      05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
      
        seccomp_rule_add_exact(ctx,...
        seccomp_rule_add_exact(ctx,...
      
        rc = seccomp_load(ctx);
      
        for (i = 0; i < 10000000; i++)
           syscall(199, 100);
      
      'short filter' has 2 rules
      'large filter' has 200 rules
      
      'short filter' performance is slightly better on x86_64/i386/arm32
      'large filter' is much faster on x86_64 and i386 and shows no
                     difference on arm32
      
      --x86_64-- short filter
      old BPF: 2.7 sec
       39.12%  bench  libc-2.15.so       [.] syscall
        8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
        6.31%  bench  [kernel.kallsyms]  [k] system_call
        5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
        3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
        3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
      new BPF: 2.58 sec
       42.05%  bench  libc-2.15.so       [.] syscall
        6.91%  bench  [kernel.kallsyms]  [k] system_call
        6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
        5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
      
      --arm32-- short filter
      old BPF: 4.0 sec
       39.92%  bench  [kernel.kallsyms]  [k] vector_swi
       16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
       14.66%  bench  libc-2.17.so       [.] syscall
        5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
      new BPF: 3.7 sec
       35.93%  bench  [kernel.kallsyms]  [k] vector_swi
       21.89%  bench  libc-2.17.so       [.] syscall
       13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
        6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit
      
      --x86_64-- large filter
      old BPF: 8.6 seconds
          73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
          10.70%    bench  libc-2.15.so       [.] syscall
           5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
           1.97%    bench  [kernel.kallsyms]  [k] system_call
      new BPF: 5.7 seconds
          66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
          16.75%    bench  libc-2.15.so       [.] syscall
           3.31%    bench  [kernel.kallsyms]  [k] system_call
           2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
      
      --i386-- large filter
      old BPF: 5.4 sec
      new BPF: 3.8 sec
      
      --arm32-- large filter
      old BPF: 13.5 sec
       73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
       10.29%  bench  [kernel.kallsyms]  [k] vector_swi
        6.46%  bench  libc-2.17.so       [.] syscall
        2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
      new BPF: 13.5 sec
       76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
       10.98%  bench  [kernel.kallsyms]  [k] vector_swi
        5.87%  bench  libc-2.17.so       [.] syscall
        1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.93%  bench  [kernel.kallsyms]  [k] sys_getuid
      
      BPF filters generated by seccomp are very branchy, so the new
      internal BPF performance is better than the old one. Performance
      gains will be even higher when BPF JIT is committed for the
      new structure, which is planned in future work (as successive
      JIT migrations).
      
      BPF has also been stress-tested with trinity's BPF fuzzer.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: linux-kernel@vger.kernel.org
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd4cf0ed
    • Daniel Borkmann's avatar
      net: filter: move filter accounting to filter core · fbc907f0
      Daniel Borkmann authored
      This patch basically does two things, i) removes the extern keyword
      from the include/linux/filter.h file to be more consistent with the
      rest of Joe's changes, and ii) moves filter accounting into the filter
      core framework.
      
      Filter accounting mainly done through sk_filter_{un,}charge() take
      care of the case when sockets are being cloned through sk_clone_lock()
      so that removal of the filter on one socket won't result in eviction
      as it's still referenced by the other.
      
      These functions actually belong to net/core/filter.c and not
      include/net/sock.h as we want to keep all that in a central place.
      It's also not in fast-path so uninlining them is fine and even allows
      us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward
      declaration.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbc907f0
    • Daniel Borkmann's avatar
      net: filter: keep original BPF program around · a3ea269b
      Daniel Borkmann authored
      In order to open up the possibility to internally transform a BPF program
      into an alternative and possibly non-trivial reversible representation, we
      need to keep the original BPF program around, so that it can be passed back
      to user space w/o the need of a complex decoder.
      
      The reason for that use case resides in commit a8fc9277 ("sk-filter:
      Add ability to get socket filter program (v2)"), that is, the ability
      to retrieve the currently attached BPF filter from a given socket used
      mainly by the checkpoint-restore project, for example.
      
      Therefore, we add two helpers sk_{store,release}_orig_filter for taking
      care of that. In the sk_unattached_filter_create() case, there's no such
      possibility/requirement to retrieve a loaded BPF program. Therefore, we
      can spare us the work in that case.
      
      This approach will simplify and slightly speed up both, sk_get_filter()
      and sock_diag_put_filterinfo() handlers as we won't need to successively
      decode filters anymore through sk_decode_filter(). As we still need
      sk_decode_filter() later on, we're keeping it around.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3ea269b
    • Daniel Borkmann's avatar
      net: filter: add jited flag to indicate jit compiled filters · f8bbbfc3
      Daniel Borkmann authored
      This patch adds a jited flag into sk_filter struct in order to indicate
      whether a filter is currently jited or not. The size of sk_filter is
      not being expanded as the 32 bit 'len' member allows upper bits to be
      reused since a filter can currently only grow as large as BPF_MAXINSNS.
      
      Therefore, there's enough room also for other in future needed flags to
      reuse 'len' field if necessary. The jited flag also allows for having
      alternative interpreter functions running as currently, we can only
      detect jit compiled filters by testing fp->bpf_func to not equal the
      address of sk_run_filter().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8bbbfc3
  4. 22 Jan, 2014 1 commit
    • Daniel Borkmann's avatar
      net: filter: let bpf_tell_extensions return SKF_AD_MAX · 37692299
      Daniel Borkmann authored
      Michal Sekletar added in commit ea02f941 ("net: introduce
      SO_BPF_EXTENSIONS") a facility where user space can enquire
      the BPF ancillary instruction set, which is imho a step into
      the right direction for letting user space high-level to BPF
      optimizers make an informed decision for possibly using these
      extensions.
      
      The original rationale was to return through a getsockopt(2)
      a bitfield of which instructions are supported and which
      are not, as of right now, we just return 0 to indicate a
      base support for SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET.
      Limitations of this approach are that this API which we need
      to maintain for a long time can only support a maximum of 32
      extensions, and needs to be additionally maintained/updated
      when each new extension that comes in.
      
      I thought about this a bit more and what we can do here to
      overcome this is to just return SKF_AD_MAX. Since we never
      remove any extension since we cannot break user space and
      always linearly increase SKF_AD_MAX on each newly added
      extension, user space can make a decision on what extensions
      are supported in the whole set of extensions and which aren't,
      by just checking which of them from the whole set have an
      offset < SKF_AD_MAX of the underlying kernel.
      
      Since SKF_AD_MAX must be updated each time we add new ones,
      we don't need to introduce an additional enum and got
      maintenance for free. At some point in time when
      SO_BPF_EXTENSIONS becomes ubiquitous for most kernels, then
      an application can simply make use of this and easily be run
      on newer or older underlying kernels without needing to be
      recompiled, of course. Since that is for 3.14, it's not too
      late to do this change.
      
      Cc: Michal Sekletar <msekleta@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarMichal Sekletar <msekleta@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37692299
  5. 19 Jan, 2014 1 commit
    • Michal Sekletar's avatar
      net: introduce SO_BPF_EXTENSIONS · ea02f941
      Michal Sekletar authored
      For user space packet capturing libraries such as libpcap, there's
      currently only one way to check which BPF extensions are supported
      by the kernel, that is, commit aa1113d9 ("net: filter: return
      -EINVAL if BPF_S_ANC* operation is not supported"). For querying all
      extensions at once this might be rather inconvenient.
      
      Therefore, this patch introduces a new option which can be used as
      an argument for getsockopt(), and allows one to obtain information
      about which BPF extensions are supported by the current kernel.
      
      As David Miller suggests, we do not need to define any bits right
      now and status quo can just return 0 in order to state that this
      versions supports SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET. Later
      additions to BPF extensions need to add their bits to the
      bpf_tell_extensions() function, as documented in the comment.
      Signed-off-by: default avatarMichal Sekletar <msekleta@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Reviewed-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea02f941
  6. 07 Oct, 2013 1 commit
    • Alexei Starovoitov's avatar
      net: fix unsafe set_memory_rw from softirq · d45ed4a4
      Alexei Starovoitov authored
      on x86 system with net.core.bpf_jit_enable = 1
      
      sudo tcpdump -i eth1 'tcp port 22'
      
      causes the warning:
      [   56.766097]  Possible unsafe locking scenario:
      [   56.766097]
      [   56.780146]        CPU0
      [   56.786807]        ----
      [   56.793188]   lock(&(&vb->lock)->rlock);
      [   56.799593]   <Interrupt>
      [   56.805889]     lock(&(&vb->lock)->rlock);
      [   56.812266]
      [   56.812266]  *** DEADLOCK ***
      [   56.812266]
      [   56.830670] 1 lock held by ksoftirqd/1/13:
      [   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
      [   56.849757]
      [   56.849757] stack backtrace:
      [   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
      [   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
      [   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
      [   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
      [   56.923006] Call Trace:
      [   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
      [   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
      [   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
      [   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
      [   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
      [   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
      [   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
      [   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
      [   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
      [   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
      [   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
      [   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
      [   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
      [   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
      [   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
      [   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
      [   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
      [   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
      [   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
      [   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
      [   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
      [   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
      
      cannot reuse jited filter memory, since it's readonly,
      so use original bpf insns memory to hold work_struct
      
      defer kfree of sk_filter until jit completed freeing
      
      tested on x86_64 and i386
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d45ed4a4
  7. 11 Jun, 2013 1 commit
  8. 20 May, 2013 1 commit
  9. 01 May, 2013 1 commit
    • Xi Wang's avatar
      filter: fix va_list build error · 20074f35
      Xi Wang authored
      This patch fixes the following build error.
      
      In file included from include/linux/filter.h:52:0,
                       from arch/arm/net/bpf_jit_32.c:14:
      include/linux/printk.h:54:2: error: unknown type name ‘va_list’
      include/linux/printk.h:105:21: error: unknown type name ‘va_list’
      include/linux/printk.h:108:30: error: unknown type name ‘va_list’
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20074f35
  10. 30 Mar, 2013 1 commit
  11. 21 Mar, 2013 1 commit
    • Daniel Borkmann's avatar
      filter: bpf_jit_comp: refactor and unify BPF JIT image dump output · 79617801
      Daniel Borkmann authored
      If bpf_jit_enable > 1, then we dump the emitted JIT compiled image
      after creation. Currently, only SPARC and PowerPC has similar output
      as in the reference implementation on x86_64. Make a small helper
      function in order to reduce duplicated code and make the dump output
      uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM
      flen, pass and proglen are currently not shown, but would be
      interesting to know as well), also for future BPF JIT implementations
      on other archs.
      
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Matt Evans <matt@ozlabs.org>
      Cc: Eric Dumazet <eric.dumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79617801
  12. 20 Mar, 2013 1 commit
  13. 01 Nov, 2012 1 commit
    • Pavel Emelyanov's avatar
      sk-filter: Add ability to get socket filter program (v2) · a8fc9277
      Pavel Emelyanov authored
      The SO_ATTACH_FILTER option is set only. I propose to add the get
      ability by using SO_ATTACH_FILTER in getsockopt. To be less
      irritating to eyes the SO_GET_FILTER alias to it is declared. This
      ability is required by checkpoint-restore project to be able to
      save full state of a socket.
      
      There are two issues with getting filter back.
      
      First, kernel modifies the sock_filter->code on filter load, thus in
      order to return the filter element back to user we have to decode it
      into user-visible constants. Fortunately the modification in question
      is interconvertible.
      
      Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
      speed up the run-time division by doing kernel_k = reciprocal(user_k).
      Bad news is that different user_k may result in same kernel_k, so we
      can't get the original user_k back. Good news is that we don't have
      to do it. What we need to is calculate a user2_k so, that
      
        reciprocal(user2_k) == reciprocal(user_k) == kernel_k
      
      i.e. if it's re-loaded back the compiled again value will be exactly
      the same as it was. That said, the user2_k can be calculated like this
      
        user2_k = reciprocal(kernel_k)
      
      with an exception, that if kernel_k == 0, then user2_k == 1.
      
      The optlen argument is treated like this -- when zero, kernel returns
      the amount of sock_fprog elements in filter, otherwise it should be
      large enough for the sock_fprog array.
      
      changes since v1:
      * Declared SO_GET_FILTER in all arch headers
      * Added decode of vlan-tag codes
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8fc9277
  14. 31 Oct, 2012 1 commit
  15. 13 Oct, 2012 1 commit
  16. 24 Sep, 2012 1 commit
  17. 10 Sep, 2012 1 commit
  18. 14 Apr, 2012 2 commits
  19. 03 Apr, 2012 2 commits
  20. 19 Oct, 2011 1 commit
  21. 26 Jul, 2011 1 commit
  22. 23 May, 2011 1 commit
  23. 28 Apr, 2011 1 commit
    • Eric Dumazet's avatar
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet authored
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a14842f
  24. 08 Dec, 2010 1 commit
  25. 06 Dec, 2010 1 commit
    • Eric Dumazet's avatar
      filter: add SKF_AD_RXHASH and SKF_AD_CPU · da2033c2
      Eric Dumazet authored
      Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism,
      to be able to build advanced filters.
      
      This can help spreading packets on several sockets with a fast
      selection, after RPS dispatch to N cpus for example, or to catch a
      percentage of flows in one queue.
      
      tcpdump -s 500 "cpu = 1" :
      
      [0] ld CPU
      [1] jeq #1  jt 2  jf 3
      [2] ret #500
      [3] ret #0
      
      # take 12.5 % of flows (average)
      tcpdump -s 1000 "rxhash & 7 = 2" :
      
      [0] ld RXHASH
      [1] and #7
      [2] jeq #2  jt 3  jf 4
      [3] ret #1000
      [4] ret #0
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Rui <wirelesser@gmail.com>
      Acked-by: default avatarChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da2033c2
  26. 19 Nov, 2010 1 commit
    • Eric Dumazet's avatar
      filter: optimize sk_run_filter · 93aaae2e
      Eric Dumazet authored
      Remove pc variable to avoid arithmetic to compute fentry at each filter
      instruction. Jumps directly manipulate fentry pointer.
      
      As the last instruction of filter[] is guaranteed to be a RETURN, and
      all jumps are before the last instruction, we dont need to check filter
      bounds (number of instructions in filter array) at each iteration, so we
      remove it from sk_run_filter() params.
      
      On x86_32 remove f_k var introduced in commit 57fe93b3
      (filter: make sure filters dont read uninitialized memory)
      
      Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to
      avoid too many ifdefs in this code.
      
      This helps compiler to use cpu registers to hold fentry and A
      accumulator.
      
      On x86_32, this saves 401 bytes, and more important, sk_run_filter()
      runs much faster because less register pressure (One less conditional
      branch per BPF instruction)
      
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         2948       0       0    2948     b84 net/core/filter.o
         3349       0       0    3349     d15 net/core/filter_pre.o
      
      on x86_64 :
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         5173       0       0    5173    1435 net/core/filter.o
         5224       0       0    5224    1468 net/core/filter_pre.o
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93aaae2e
  27. 18 Nov, 2010 1 commit
  28. 26 Jun, 2010 1 commit
    • Hagen Paul Pfeifer's avatar
      net: optimize Berkeley Packet Filter (BPF) processing · 01f2f3f6
      Hagen Paul Pfeifer authored
      Gcc is currenlty not in the ability to optimize the switch statement in
      sk_run_filter() because of dense case labels. This patch replace the
      OR'd labels with ordered sequenced case labels. The sk_chk_filter()
      function is modified to patch/replace the original OPCODES in a
      ordered but equivalent form. gcc is now in the ability to transform the
      switch statement in sk_run_filter into a jump table of complexity O(1).
      
      Until this patch gcc generates a sequence of conditional branches (O(n) of 567
      byte .text segment size (arch x86_64):
      
      7ff: 8b 06                 mov    (%rsi),%eax
      801: 66 83 f8 35           cmp    $0x35,%ax
      805: 0f 84 d0 02 00 00     je     adb <sk_run_filter+0x31d>
      80b: 0f 87 07 01 00 00     ja     918 <sk_run_filter+0x15a>
      811: 66 83 f8 15           cmp    $0x15,%ax
      815: 0f 84 c5 02 00 00     je     ae0 <sk_run_filter+0x322>
      81b: 77 73                 ja     890 <sk_run_filter+0xd2>
      81d: 66 83 f8 04           cmp    $0x4,%ax
      821: 0f 84 17 02 00 00     je     a3e <sk_run_filter+0x280>
      827: 77 29                 ja     852 <sk_run_filter+0x94>
      829: 66 83 f8 01           cmp    $0x1,%ax
      [...]
      
      With the modification the compiler translate the switch statement into
      the following jump table fragment:
      
      7ff: 66 83 3e 2c           cmpw   $0x2c,(%rsi)
      803: 0f 87 1f 02 00 00     ja     a28 <sk_run_filter+0x26a>
      809: 0f b7 06              movzwl (%rsi),%eax
      80c: ff 24 c5 00 00 00 00  jmpq   *0x0(,%rax,8)
      813: 44 89 e3              mov    %r12d,%ebx
      816: e9 43 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
      81b: 41 89 dc              mov    %ebx,%r12d
      81e: e9 3b 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
      
      Furthermore, I reordered the instructions to reduce cache line misses by
      order the most common instruction to the start.
      Signed-off-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01f2f3f6
  29. 22 Apr, 2010 1 commit
  30. 04 Nov, 2009 1 commit
  31. 20 Oct, 2009 2 commits
  32. 20 Nov, 2008 1 commit
    • Pablo Neira Ayuso's avatar
      filter: add SKF_AD_NLATTR_NEST to look for nested attributes · d214c753
      Pablo Neira Ayuso authored
      SKF_AD_NLATTR allows us to find the first matching attribute in a
      stream of netlink attributes from one offset to the end of the
      netlink message. This is not suitable to look for a specific
      matching inside a set of nested attributes.
      
      For example, in ctnetlink messages, if we look for the CTA_V6_SRC
      attribute in a message that talks about an IPv4 connection,
      SKF_AD_NLATTR returns the offset of CTA_STATUS which has the same
      value of CTA_V6_SRC but outside the nest. To differenciate
      CTA_STATUS and CTA_V6_SRC, we would have to make assumptions on the
      size of the attribute and the usual offset, resulting in horrible
      BSF code.
      
      This patch adds SKF_AD_NLATTR_NEST, which is a variant of
      SKF_AD_NLATTR, that looks for an attribute inside the limits of
      a nested attributes, but not further.
      
      This patch validates that we have enough room to look for the
      nested attributes - based on a suggestion from Patrick McHardy.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d214c753
  33. 10 Apr, 2008 2 commits
    • Patrick McHardy's avatar
      [SKFILTER]: Add SKF_ADF_NLATTR instruction · 4738c1db
      Patrick McHardy authored
      SKF_ADF_NLATTR searches for a netlink attribute, which avoids manually
      parsing and walking attributes. It takes the offset at which to start
      searching in the 'A' register and the attribute type in the 'X' register
      and returns the offset in the 'A' register. When the attribute is not
      found it returns zero.
      
      A top-level attribute can be located using a filter like this
      (example for nfnetlink, using struct nfgenmsg):
      
      	...
      	{
      		/* A = offset of first attribute */
      		.code	= BPF_LD | BPF_IMM,
      		.k	= sizeof(struct nlmsghdr) + sizeof(struct nfgenmsg)
      	},
      	{
      		/* X = CTA_PROTOINFO */
      		.code	= BPF_LDX | BPF_IMM,
      		.k	= CTA_PROTOINFO,
      	},
      	{
      		/* A = netlink attribute offset */
      		.code	= BPF_LD | BPF_B | BPF_ABS,
      		.k	= SKF_AD_OFF + SKF_AD_NLATTR
      	},
      	{
      		/* Exit if not found */
      		.code   = BPF_JMP | BPF_JEQ | BPF_K,
      		.k	= 0,
      		.jt	= <error>
      	},
      	...
      
      A nested attribute below the CTA_PROTOINFO attribute would then
      be parsed like this:
      
      	...
      	{
      		/* A += sizeof(struct nlattr) */
      		.code	= BPF_ALU | BPF_ADD | BPF_K,
      		.k	= sizeof(struct nlattr),
      	},
      	{
      		/* X = CTA_PROTOINFO_TCP */
      		.code	= BPF_LDX | BPF_IMM,
      		.k	= CTA_PROTOINFO_TCP,
      	},
      	{
      		/* A = netlink attribute offset */
      		.code	= BPF_LD | BPF_B | BPF_ABS,
      		.k	= SKF_AD_OFF + SKF_AD_NLATTR
      	},
      	...
      
      The data of an attribute can be loaded into 'A' like this:
      
      	...
      	{
      		/* X = A (attribute offset) */
      		.code	= BPF_MISC | BPF_TAX,
      	},
      	{
      		/* A = skb->data[X + k] */
      		.code 	= BPF_LD | BPF_B | BPF_IND,
      		.k	= sizeof(struct nlattr),
      	},
      	...
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4738c1db
    • Stephen Hemminger's avatar
      socket: sk_filter deinline · 43db6d65
      Stephen Hemminger authored
      The sk_filter function is too big to be inlined. This saves 2296 bytes
      of text on allyesconfig.
      Signed-off-by: default avatarStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43db6d65