1. 17 Dec, 2018 1 commit
  2. 30 Aug, 2017 1 commit
    • Daniel Borkmann's avatar
      bpf: fix bpf_trace_printk on 32 bit archs · d6a6b6b4
      Daniel Borkmann authored
      
      [ Upstream commit 88a5c690 ]
      
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Tested-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6a6b6b4
  3. 10 Sep, 2016 2 commits
    • Daniel Borkmann's avatar
      bpf: add BPF_CALL_x macros for declaring helpers · f3694e00
      Daniel Borkmann authored
      This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
      to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
      that are used today. Motivation for this is to hide all the register handling
      and all necessary casts from the user, so that it is done automatically in the
      background when adding a BPF_CALL_<n>() call.
      
      This makes current helpers easier to review, eases to write future helpers,
      avoids getting the casting mess wrong, and allows for extending all helpers at
      once (f.e. build time checks, etc). It also helps detecting more easily in
      code reviews that unused registers are not instrumented in the code by accident,
      breaking compatibility with existing programs.
      
      BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
      fundamental differences, for example, for generating the actual helper function
      that carries all u64 regs, we need to fill unused regs, so that we always end up
      with 5 u64 regs as an argument.
      
      I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
      they look all as expected. No sparse issue spotted. We let this also sit for a
      few days with Fengguang's kbuild test robot, and there were no issues seen. On
      s390, it barked on the "uses dynamic stack allocation" notice, which is an old
      one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
      to the call wrapper, just telling that the perf raw record/frag sits on stack
      (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
      and they were fine as well. All eBPF helpers are now converted to use these
      macros, getting rid of a good chunk of all the raw castings.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3694e00
    • Daniel Borkmann's avatar
      bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macros · f035a515
      Daniel Borkmann authored
      Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
      which otherwise often result in overly long bytes_to_bpf_size(sizeof())
      and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
      helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
      check in convert_bpf_extensions(), but we should rather make that generic
      as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
      users to detect any rewriter size issues at compile time. Note, there are
      currently none, but we want to assert that it stays this way.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f035a515
  4. 02 Sep, 2016 1 commit
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type · 0515e599
      Alexei Starovoitov authored
      Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
      HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
      correspondingly in uapi/linux/perf_event.h)
      
      The program visible context meta structure is
      struct bpf_perf_event_data {
          struct pt_regs regs;
           __u64 sample_period;
      };
      which is accessible directly from the program:
      int bpf_prog(struct bpf_perf_event_data *ctx)
      {
        ... ctx->sample_period ...
        ... ctx->regs.ip ...
      }
      
      The bpf verifier rewrites the accesses into kernel internal
      struct bpf_perf_event_data_kern which allows changing
      struct perf_sample_data without affecting bpf programs.
      New fields can be added to the end of struct bpf_perf_event_data
      in the future.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0515e599
  5. 13 Aug, 2016 2 commits
  6. 26 Jul, 2016 1 commit
    • Sargun Dhillon's avatar
      bpf: Add bpf_probe_write_user BPF helper to be called in tracers · 96ae5227
      Sargun Dhillon authored
      This allows user memory to be written to during the course of a kprobe.
      It shouldn't be used to implement any kind of security mechanism
      because of TOC-TOU attacks, but rather to debug, divert, and
      manipulate execution of semi-cooperative processes.
      
      Although it uses probe_kernel_write, we limit the address space
      the probe can write into by checking the space with access_ok.
      We do this as opposed to calling copy_to_user directly, in order
      to avoid sleeping. In addition we ensure the threads's current fs
      / segment is USER_DS and the thread isn't exiting nor a kernel thread.
      
      Given this feature is meant for experiments, and it has a risk of
      crashing the system, and running programs, we print a warning on
      when a proglet that attempts to use this helper is installed,
      along with the pid and process name.
      Signed-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96ae5227
  7. 20 Jul, 2016 1 commit
  8. 15 Jul, 2016 3 commits
    • Daniel Borkmann's avatar
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann authored
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      555c8a86
    • Daniel Borkmann's avatar
      bpf, perf: split bpf_perf_event_output · 8e7a3920
      Daniel Borkmann authored
      Split the bpf_perf_event_output() helper as a preparation into
      two parts. The new bpf_perf_event_output() will prepare the raw
      record itself and test for unknown flags from BPF trace context,
      where the __bpf_perf_event_output() does the core work. The
      latter will be reused later on from bpf_event_output() directly.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e7a3920
    • Daniel Borkmann's avatar
      perf, events: add non-linear data support for raw records · 7e3f977e
      Daniel Borkmann authored
      This patch adds support for non-linear data on raw records. It
      extends raw records to have one or multiple fragments that will
      be written linearly into the ring slot, where each fragment can
      optionally have a custom callback handler to walk and extract
      complex, possibly non-linear data.
      
      If a callback handler is provided for a fragment, then the new
      __output_custom() will be used instead of __output_copy() for
      the perf_output_sample() part. perf_prepare_sample() does all
      the size calculation only once, so perf_output_sample() doesn't
      need to redo the same work anymore, meaning real_size and padding
      will be cached in the raw record. The raw record becomes 32 bytes
      in size without holes; to not increase it further and to avoid
      doing unnecessary recalculations in fast-path, we can reuse
      next pointer of the last fragment, idea here is borrowed from
      ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
      path for PERF_SAMPLE_RAW minimal.
      
      This facility is needed for BPF's event output helper as a first
      user that will, in a follow-up, add an additional perf_raw_frag
      to its perf_raw_record in order to be able to more efficiently
      dump skb context after a linear head meta data related to it.
      skbs can be non-linear and thus need a custom output function to
      dump buffers. Currently, the skb data needs to be copied twice;
      with the help of __output_custom() this work only needs to be
      done once. Future users could be things like XDP/BPF programs
      that work on different context though and would thus also have
      a different callback function.
      
      The few users of raw records are adapted to initialize their frag
      data from the raw record itself, no change in behavior for them.
      The code is based upon a PoC diff provided by Peter Zijlstra [1].
      
        [1] http://thread.gmane.org/gmane.linux.network/421294Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e3f977e
  9. 09 Jul, 2016 1 commit
  10. 30 Jun, 2016 3 commits
  11. 16 Jun, 2016 3 commits
    • Daniel Borkmann's avatar
      bpf, maps: flush own entries on perf map release · 3b1efb19
      Daniel Borkmann authored
      The behavior of perf event arrays are quite different from all
      others as they are tightly coupled to perf event fds, f.e. shown
      recently by commit e03e7ee3 ("perf/bpf: Convert perf_event_array
      to use struct file") to make refcounting on perf event more robust.
      A remaining issue that the current code still has is that since
      additions to the perf event array take a reference on the struct
      file via perf_event_get() and are only released via fput() (that
      cleans up the perf event eventually via perf_event_release_kernel())
      when the element is either manually removed from the map from user
      space or automatically when the last reference on the perf event
      map is dropped. However, this leads us to dangling struct file's
      when the map gets pinned after the application owning the perf
      event descriptor exits, and since the struct file reference will
      in such case only be manually dropped or via pinned file removal,
      it leads to the perf event living longer than necessary, consuming
      needlessly resources for that time.
      
      Relations between perf event fds and bpf perf event map fds can be
      rather complex. F.e. maps can act as demuxers among different perf
      event fds that can possibly be owned by different threads and based
      on the index selection from the program, events get dispatched to
      one of the per-cpu fd endpoints. One perf event fd (or, rather a
      per-cpu set of them) can also live in multiple perf event maps at
      the same time, listening for events. Also, another requirement is
      that perf event fds can get closed from application side after they
      have been attached to the perf event map, so that on exit perf event
      map will take care of dropping their references eventually. Likewise,
      when such maps are pinned, the intended behavior is that a user
      application does bpf_obj_get(), puts its fds in there and on exit
      when fd is released, they are dropped from the map again, so the map
      acts rather as connector endpoint. This also makes perf event maps
      inherently different from program arrays as described in more detail
      in commit c9da161c ("bpf: fix clearing on persistent program
      array maps").
      
      To tackle this, map entries are marked by the map struct file that
      added the element to the map. And when the last reference to that map
      struct file is released from user space, then the tracked entries
      are purged from the map. This is okay, because new map struct files
      instances resp. frontends to the anon inode are provided via
      bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
      for retrieving a pinned map, but also when an initial instance is
      created via map_create(). The rest is resolved by the vfs layer
      automatically for us by keeping reference count on the map's struct
      file. Any concurrent updates on the map slot are fine as well, it
      just means that perf_event_fd_array_release() needs to delete less
      of its own entires.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b1efb19
    • Alexei Starovoitov's avatar
      bpf, trace: check event type in bpf_perf_event_read · ad572d17
      Alexei Starovoitov authored
      similar to bpf_perf_event_output() the bpf_perf_event_read() helper
      needs to check the type of the perf_event before reading the counter.
      
      Fixes: a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad572d17
    • Alexei Starovoitov's avatar
      bpf: fix matching of data/data_end in verifier · 19de99f7
      Alexei Starovoitov authored
      The ctx structure passed into bpf programs is different depending on bpf
      program type. The verifier incorrectly marked ctx->data and ctx->data_end
      access based on ctx offset only. That caused loads in tracing programs
      int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
      to be incorrectly marked as PTR_TO_PACKET which later caused verifier
      to reject the program that was actually valid in tracing context.
      Fix this by doing program type specific matching of ctx offsets.
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Reported-by: default avatarSasha Goldshtein <goldshtn@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19de99f7
  12. 07 Jun, 2016 1 commit
  13. 20 Apr, 2016 2 commits
    • Daniel Borkmann's avatar
      bpf: add event output helper for notifications/sampling/logging · bd570ff9
      Daniel Borkmann authored
      This patch adds a new helper for cls/act programs that can push events
      to user space applications. For networking, this can be f.e. for sampling,
      debugging, logging purposes or pushing of arbitrary wake-up events. The
      idea is similar to a43eec30 ("bpf: introduce bpf_perf_event_output()
      helper") and 39111695 ("samples: bpf: add bpf_perf_event_output example").
      
      The eBPF program utilizes a perf event array map that user space populates
      with fds from perf_event_open(), the eBPF program calls into the helper
      f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
      so that the raw data is pushed into the fd f.e. at the map index of the
      current CPU.
      
      User space can poll/mmap/etc on this and has a data channel for receiving
      events that can be post-processed. The nice thing is that since the eBPF
      program and user space application making use of it are tightly coupled,
      they can define their own arbitrary raw data format and what/when they
      want to push.
      
      While f.e. packet headers could be one part of the meta data that is being
      pushed, this is not a substitute for things like packet sockets as whole
      packet is not being pushed and push is only done in a single direction.
      Intention is more of a generically usable, efficient event pipe to applications.
      Workflow is that tc can pin the map and applications can attach themselves
      e.g. after cls/act setup to one or multiple map slots, demuxing is done by
      the eBPF program.
      
      Adding this facility is with minimal effort, it reuses the helper
      introduced in a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      and we get its functionality for free by overloading its BPF_FUNC_ identifier
      for cls/act programs, ctx is currently unused, but will be made use of in
      future. Example will be added to iproute2's BPF example files.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd570ff9
    • Daniel Borkmann's avatar
      bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output · 1e33759c
      Daniel Borkmann authored
      Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
      per-CPU ring buffers and the eBPF program pushes the data into the current
      CPU's ring buffer which saves us an extra helper function call in eBPF.
      Also, make sure to properly reserve the remaining flags which are not used.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e33759c
  14. 19 Apr, 2016 1 commit
    • Arnd Bergmann's avatar
      bpf: avoid warning for wrong pointer cast · 266a0a79
      Arnd Bergmann authored
      Two new functions in bpf contain a cast from a 'u64' to a
      pointer. This works on 64-bit architectures but causes a warning
      on all 32-bit architectures:
      
      kernel/trace/bpf_trace.c: In function 'bpf_perf_event_output_tp':
      kernel/trace/bpf_trace.c:350:13: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
        u64 ctx = *(long *)r1;
      
      This changes the cast to first convert the u64 argument into a uintptr_t,
      which is guaranteed to be the same size as a pointer.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 9940d67c ("bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs")
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      266a0a79
  15. 15 Apr, 2016 1 commit
  16. 08 Apr, 2016 2 commits
  17. 08 Mar, 2016 1 commit
  18. 20 Feb, 2016 1 commit
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6
      Alexei Starovoitov authored
      add new map type to store stack traces and corresponding helper
      bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
      @ctx: struct pt_regs*
      @map: pointer to stack_trace map
      @flags: bits 0-7 - numer of stack frames to skip
              bit 8 - collect user stack instead of kernel
              bit 9 - compare stacks by hash only
              bit 10 - if two different stacks hash into the same stackid
                       discard old
              other bits - reserved
      Return: >= 0 stackid on success or negative error
      
      stackid is a 32-bit integer handle that can be further combined with
      other data (including other stackid) and used as a key into maps.
      
      Userspace will access stackmap using standard lookup/delete syscall commands to
      retrieve full stack trace for given stackid.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5a3b1f6
  19. 29 Jan, 2016 1 commit
  20. 23 Dec, 2015 1 commit
  21. 27 Oct, 2015 2 commits
  22. 22 Oct, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: introduce bpf_perf_event_output() helper · a43eec30
      Alexei Starovoitov authored
      This helper is used to send raw data from eBPF program into
      special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
      User space needs to perf_event_open() it (either for one or all cpus) and
      store FD into perf_event_array (similar to bpf_perf_event_read() helper)
      before eBPF program can send data into it.
      
      Today the programs triggered by kprobe collect the data and either store
      it into the maps or print it via bpf_trace_printk() where latter is the debug
      facility and not suitable to stream the data. This new helper replaces
      such bpf_trace_printk() usage and allows programs to have dedicated
      channel into user space for post-processing of the raw data collected.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a43eec30
  23. 28 Aug, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: add support for %s specifier to bpf_trace_printk() · 8d3b7dce
      Alexei Starovoitov authored
      %s specifier makes bpf program and kernel debugging easier.
      To make sure that trace_printk won't crash the unsafe string
      is copied into stack and unsafe pointer is substituted.
      
      The following C program:
       #include <linux/fs.h>
      int foo(struct pt_regs *ctx, struct filename *filename)
      {
        void *name = 0;
      
        bpf_probe_read(&name, sizeof(name), &filename->name);
        bpf_trace_printk("executed %s\n", name);
        return 0;
      }
      
      when attached to kprobe do_execve()
      will produce output in /sys/kernel/debug/tracing/trace_pipe :
          make-13492 [002] d..1  3250.997277: : executed /bin/sh
            sh-13493 [004] d..1  3250.998716: : executed /usr/bin/gcc
           gcc-13494 [002] d..1  3250.999822: : executed /usr/lib/gcc/x86_64-linux-gnu/4.7/cc1
           gcc-13495 [002] d..1  3251.006731: : executed /usr/bin/as
           gcc-13496 [002] d..1  3251.011831: : executed /usr/lib/gcc/x86_64-linux-gnu/4.7/collect2
      collect2-13497 [000] d..1  3251.012941: : executed /usr/bin/ld
      Suggested-by: default avatarBrendan Gregg <brendan.d.gregg@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d3b7dce
  24. 10 Aug, 2015 1 commit
  25. 15 Jun, 2015 3 commits
  26. 01 Jun, 2015 1 commit
  27. 21 May, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov authored
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04fd61ab