Blog

My thoughts and experiments.

© 2023. Dmitry Dolgov All rights reserved.

Running fast and slow: experiments with BPF programs performance

1. Introduction

My own personal white spot regarding BPF subsystem in Linux kernel was always programs performance and an overall introspection. Or to formulate it more specifically, I wasn’t sure if there is any difference in how we reason about an abstract program performance versus a BPF program? Could we use the same technics and approaches?

You may wonder why even bother when BPF programs are so small and fast? Generally speaking you would be right, but there are cases when BPF programs are not small any more and placed on the hot execution path, e.g. if we talk about a security system monitoring syscalls. In such situations even small overhead is drastically multiplied and accumulated, and it only makes sense to fully understand the system performance to avoid nasty surprises.

It seems many other people also would like to know more about this topic, thus want to share results of my investigation.

2. Current state of things

Whenever we analyse performance of some system, it’s always useful to get an understanding of which features and parameters could improve efficiency or make it worse. What is important to keep in mind when writing BPF programs? Looking around I found a couple of interesting examples.

2.1 BPF Instruction Set

It turns out that there are several versions of BPF instruction set available, namely v1, v2 and v3. Unsurprisingly they feature different set of supported instructions, which could affect how the final program is performing. An example from the documentation:

Q: Why BPF_JLT and BPF_JLE instructions were not
   introduced in the beginning?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because classic BPF didn't have them and BPF authors
felt that compiler workaround would be acceptable.
Turned out that programs lose performance due to lack of
these compare instructions and they were added.

This is a quite interesting example of differences between two ISA versions. First, it shows the historical perspective and answers the question of how the versions were evolving over time. Second, it mentions two particular instructions BFP_JLT (jump if less than) and BPF_JLE (jump if less than or equal), which were added into v2. Those are not something you would simply miss among dozens of other instructions, those are basic comparison instructions! For example, you write something like:

if (delta < ts)

In this case ideally compiler would like to use BPF_JLT in order to produce an optimal program, but it’s not available in BPF ISA v1. Instead, the compiler has to do some extra mile and reverse the condition using only “jump if greater than” instruction:

; if (delta < ts)
  cmp    %rsi,%rdi
; jump if above or equal
  jae    0x0000000000000068

As soon as we switch BPF ISA to v2 the same code will produce something more expected:

; if (delta < ts)
  cmp    %rdi,%rsi
; jump is below or equal
  jbe    0x0000000000000065

In the case above both versions are equivalent, but if we’re going to compare the value with a constant, the resulting implementation will require one more register load. Interestingly enough I wasn’t able to produce such example exactly, the compiler was stubbornly generating it only one way around for both versions, but you can find an example in this blogpost. But probably more dramatic changes one can notice between v1 and v3, which adds 32-bit variants of existing conditional 64-bit jumps. Using v1 to compare a variable with a constant will produce the following result, when we have to clean the 32 most-significant bits:

if (delta < 107)
; if (delta < 107)
  15: (bf) r2 = r1
  16: (67) r2 <<= 32
  17: (c7) r2 s>>= 32
  18: (65) if r2 s> 0x6a goto pc+3
; if (delta < 107)
  31:   mov    %r13d,%edi
  34:   shl    $0x20,%rdi
  38:   sar    $0x20,%rdi
  3c:   cmp    $0x6a,%rdi
  40:   jg     0x0000000000000048

Whereas using v3 there is no need to do so:

; if (pid < 107)
  15: (66) if w1 s> 0x6a goto pc+3
; if (pid < 107)
  31:   cmp    $0x6a,%r13d
  35:   jg     0x000000000000003d

Those two flavours come with different performance characteristics, the former version of the program has to do a bit more work to include the workaround. It also makes the program bigger, using instruction cache less efficiently. Again, you can get more numbers and the support matrix for Linux and LLVM version in this great blogpost.

Unsurprisingly the default BPF ISA being used normally is the lowest one, v1 also known as “generic”, which means one has to configure it explicitly to use something higher:

$ llc probe.bc -mcpu=v2 -march=bpf -filetype=obj -o probe.o

With clang one can use -mllvm -mcpu=v2 to do the same.

Few closing notes for this section. Instead of pinning one particular version one could use probe value to tell the compiler to pick up highest available ISA for current machine you’re compiling the program on (see the corresponding part of the implementation). Keep in mind it could be a different machine, not the one which is going to run the program. I’ve got quite curious about this feature, but for whatever reason it’s not always work in those examples I was experimenting with, still defaulting to v1 where v2 was available. And last but not least: needless to say that the best practice is always use the latest available instructions set, which is v3 at the moment.

2.2 Batching of map operations

Maps are one of the core parts of BPF subsystem, and they could be used quite extensively. One way of making data processing a bit faster (almost independently of the context) is to batch operations. Sure enough there is something like that for BPF maps as well, for example a generic support for lookups and modifications. One could use a corresponding subcommands:

BPF_MAP_LOOKUP_BATCH
BPF_MAP_LOOKUP_AND_DELETE_BATCH
BPF_MAP_UPDATE_BATCH
BPF_MAP_DELETE_BATCH

and parameters:

struct { /* struct used by BPF_MAP_*_BATCH commands */
    __aligned_u64   in_batch;       /* start batch,
                                     * NULL to start from beginning
                                     */
    __aligned_u64   out_batch;      /* output: next start batch */
    __aligned_u64   keys;
    __aligned_u64   values;
    __u32           count;          /* input/output:
                                     * input: # of key/value
                                     * elements
                                     * output: # of filled elements
                                     */
    __u32           map_fd;
    __u64           elem_flags;
    __u64           flags;
} batch;

An alternative way would be to use libbpf support.

Batching BPF map operations one can save on user/kernel space interaction. The ballpark numbers mentioned in the mailing list were visible improvement for ~1M record with the batch size 10, 1000, etc. For more details check out the original patch.

2.3 Bloom filter map

Continue the topic of diverse BPF map features we find out something very curious among the typical structures like hash, array, queue – a bloom filter map. To remind you the classical definition:

    A Bloom filter is a space-efficient probabilistic data structure
    that is used to test whether an element is a member of a set.
    False positive matches are possible, but false negatives are not
    – in other words, a query returns either "possibly in set" or
    "definitely not in set". Elements can be added to the set,
    but not removed (though this can be addressed with the counting
    Bloom filter variant); the more items added, the larger the
    probability of false positives.

Independently of the context this is a great instrument in our hands, which gives us possibility to trade off size of the data structure (we now have to maintain both the hash map and the filter) for more efficient lookups. In case of BPF normally checking the bloom filter before accessing the hash map would improve the overall performance, since a costly hash map lookup will be avoided if the element doesn’t exist. We can turn this on when needed by specifying the new map type:

bpf_map_create(BPF_MAP_TYPE_BLOOM_FILTER,
    NULL, 0, sizeof(value), 100, NULL);

2.4 Task local storage

But there is more to it! Looking closely we also find something called task local storage, which gives us another trade-off to play with for one particular use case. Imagine you would like to write a BPF program that works with information about processes. To store anything you would normally create a hash map and use PID as the key, then you realize that large hash maps could be slow. To prevent that (and some other inconveniences) you can use a task local storage instead, which is, surprise surprise, a storage local to the owning task. And this means we’re getting a performance boost accessing such local data on the account of its allocation in a different place. The API is quite straightforward:

ptr = bpf_task_storage_get(&start, t, 0, BPF_LOCAL_STORAGE_GET_F_CREATE);

Interesting enough that this feature is not unique one, there are bpf_local_storage exists for sockets and inodes. For tasks, it was originally implemented only for BPF LSM (Security Audit and Enforcement using BPF), and then extended for tracing programs as well.

2.5 BPF program pack allocator

One more topic somewhat related to the memory allocation is the program pack allocator. It turns out that originally every single BPF program was consuming one memory page, no matter how small the program is. Which means that if you end up having a log of such small programs on your server you can observe high instruction TLB (iTLB) pressure in the system, degrading the overall performance. To solve this bpf_prog_allocator was introduced, with the idea to pack programs together on a page when allocating them. This happens transparently, nothing has to be done or configured, and the only thing you may need to think about is whether there are available huge pages on your machines (yes, the packer will try to pack on a huge page as well). For more details check out the LWN article.

2.6 BPF 2 BPF

Experimenting with the code layout is generally important for performance, and BPF is not an exception. The thing I didn’t know was that prior to Linux kernel 4.16 and LLVM 6.0 one always had to inline any reusable shared pieces of code in a BPF program:

#ifndef __inline
# define __inline                         \
   inline __attribute__((always_inline))
#endif

static __inline int test_bpf2bpf(void) {}

If we look at this via bpftool this would obviously produce a single chunk of code:

# 0xffffffffc1513a68:
  nopl   0x0(%rax,%rax,1)
  xor    %eax,%eax
  push   %rbp
# [...]

What’s the problem here? The same as with regular code inlining, it’s not always the best choice: it eliminates a call, but increases the code size using instruction cache less efficiently. Fortunately with modern versions of the kernel you have an option to decide inline or not inline, where the latter will show you two chunks of code with bpftool:

# 0xffffffffc15b810c:
  nopl   0x0(%rax,%rax,1)
  xor    %eax,%eax
# [...]
  callq  0x0000000000002370

# 0xffffffffc15ba47c:
  nopl   0x0(%rax,%rax,1)
  xchg   %ax,%ax
# [...]

3. How to analyze BPF performance?

In the previous sections we’ve learned there are many interesting techniques on the table to improve performance of your programs. But how to evaluate them, how to introspect the program and understand what those approaches have changed?

3.1 Talking to the compiler

Things are getting started even before the program is executed. What if we ask the compiler for an opinion about performance of our program? It’s not so strange as it sounds, clang can generate an Optimization Report when compiling a program:

$ clang -O2
    -Rpass=.*
    -Rpass-analysis=.*
    -Rpass-missed=.*
    # ...

The report will contain quite many potential missed optimizations like this one from Global Value Numering pass:

remark: load of type i32
not eliminated [-Rpass-missed=gvn]

This record says that the compiler found a load operation that potentially could be eliminated. That’s it, the compiler thinks the load is unnecessary, but doesn’t have enough information to prove this for all possible use cases. In my experiments this type of records were most frequent in the report, and I speculate that this happens because in BPF programs we often pass some values like an execution context via an opaque pointer:

static __always_inline int bpf_example_fn(void *ctx)

In such situations the compiler has an opportunity to apply certain eliminations, but they are not allowed in case of overlapping pointers (i.e. there are multiple pointers referring to the same memory). This is a hand-waving explanation, but one thing points in its favour – I’ve manage to make compiler to take some number of those missed optimizations by scattering restrict keyword around:

static __always_inline int bpf_example_fn(void * restrict ctx)

This keyword is the way of letting compiler known that there are no other overlapping pointers exist.

3.2 Aggregated counters

Well, we’re finally running our BPF program. What is the first thing that comes in mind for regular applications at this stage? Right, we check some global counters like uptime and average load. Could we do this for the BPF program?

Fortunately for us Linux kernel collects BPF execution counters, when instructed to do so:

$ sysctl -w kernel.bpf_stats_enabled=1

Those counters could be read via bpftool:

$ bpftool prog

379: raw_tracepoint [...] run_time_ns 35875602162 run_cnt 160512637

Another hacky way of getting how many times the BPF program was invoked would be to use memory events with perf. Those events allow monitoring access to certain address in memory, and finding out at which address the BPF program was loaded we can use this feature to get the number of hits (of course you can use record command instead of trace):

$ cat /proc/kallsyms | grep bpf_prog
[...]
ffffffffc0201990 t bpf_prog_31e86e7ee100ebfd_test [bpf]

$ perf trace -e mem:0xffffffffc0201990:x
[ trigger the BPF program in another session ]
18446744073790.551 ls/242 mem:0xffffffffc0201990:x()

Those numbers are overall counters for the program, which is useful to get some ballpark numbers about the performance to make some sanity check or verify the results obtained using other methods.

3.3 Manual instrumentation

The simplest next step in understanding what’s going on inside the BPF program would be to instrument it manually. I mean, nobody will prevent you from modifying your own program, right?

You could write something like this inside the program:

bpf_trace_printk("Timestamp: %lld", ts);

Then fetch the timestamps from either a trace_pipe, or using bpftool:

$ cat /sys/kernel/debug/tracing/trace_pipe
$ bpftool prog tracelog

This is rather flexible and easy approach, but unfortunately it introduces a lot of overhead due to constant user- to kernel space communication, which makes it not quite practical in most of the situations. If something like this is really needed, instead of using trace_pipe one can introduce the map to buffer data on the kernel side, then periodically read this map.

3.4 Top-down approach

Now lets step back and think. What cool kids are using to analyse the application performance? For example the BPF program pack allocator from the previous section, how people have identified that the high iTLB cache pressure is the issue?

Normally there are two options:

Top-Down analysis method (see Intel® 64 and IA-32 Architectures Optimization Reference Manual, Appendix B.1 ) helps to identify performance bottlenecks correlating to major functional blocks of modern out-of-order microarchitectures. At the first level they’re falling into four categories (see more details in Performance analysis and tuning on modern CPUS book):

Intel® 64 and IA-32 Architectures Optimization Reference Manual, Appendix B.1

Now, armed with this method, if we face a system with multitude of BPF programs loaded, we can start categorizing bottlenecks and find out that the system is “Front End Bound”. Digging few levels deeper we can discover that the reason is iTLB cache pressure, which contributes to the “Front End Bound” load. The only question is how to actually do that?

Well, theoretically perf does support the first level of Top-Down approach and can collect stats for specified BPF programs:

$ perf stat -b <prog id> --topdown

This machinery works thanks to fentry/fexit attachment points, that allows to attach an instrumentation BPF program at the start/end of another BPF program. It’s worth talking bit more about this approach, as it works in quite an interesting way. Here is what happens with the BPF program when one has attached an fentry program to it:

# original program instructions
nopl   0x0(%rax,%rax,1
xchg   %ax,%ax
push   %rbp
mov    %rsp,%rbp
sub    0x20,%rsp
...

# instructions after attaching fentry
callq  0xffffffffffe0096c
xchg   %ax,%ax
push   %rbp
mov    %rsp,%rbp
sub    0x20,%rsp
...

Kernel modifies the BPF program prologue to execute the fentry, which is a remarkably flexible and powerful approach. There are of course certain limitations one have to keep in mind: the target BPF program has to be compiled with BTF, and one cannot attach fentry/fexit BPF programs to another programs of the same type (see the verifier for both here and here).

At the end of the day using this approach should give us something like this:

Performance counter stats for 'system wide':

             retiring  bad speculation  frontend bound  backend bound
S0-D0-C0  1  64.2%             8.3%           19.1%          8.5%

Unfortunately this didn’t work for me, perf was always returning nulls. Digging a bit deeper I’ve noticed that reading the actual perf events was returning empty result, still not sure why (if you have any ideas, let me know).

Trying to figure out if it’s an intrinsic limitation or simply a strange bug in my setup I’ve picked up a bpftool prog profile command. It’s a great instrument for our purposes which works using fentry/fexit as well, allowing us to collect certain set of metrics:

bpftool prog profile <PROG> [duration <sec>] <METRICS>

The metrics are one of following:

This is a lot of information we could collect doing something like this:

$ bpftool prog profile id <prog id> duration 10 cycles instructions

                11 run_cnt
            258161 cycles
             50634 instructions        #     0.20 insns per cycle

Alas, it doesn’t help in case of Top-Down approach. As an experiment I’ve added few perf counters from Top-Down to bpftool and it was able to do the thing and read those counters (the full implementation would be of course much more involving). At the end it means one can successfully apply Top-Down approach in case of a BPF program, which is great news.

3.5 Profiling of BPF programs

This brings us forward in our quest of understanding what’s going on inside a BPF program, but we’re still missing one point. To get most out of it we need not only to sample certain events between two points in the program (e.g. fentry/fexit), but also be able to profile the program in general, i.e. sample needed events and attribute them to the corresponding parts of the program. Can we do that?

The question has two parts: can we do stack sampling, and is it possible to correlate results with the BPF program code? Surely we can do stack sampling, as our BPF program is sort of a “dynamic” extension of the kernel, sampling the kernel will sample the program as well. How to relate it to the code? Fortunately for us perf is smart enough to do that for us if the BPF program build with BTF support. Here is an example of a perf report collecting cycles without actually retired uops:

$ perf report
  Percent | uops_retired.stall_cycles
          :
          : if (duration_ns < min_duration_ns)
     0.00 :    9f:movabs $0xffffc9000009e000,%rdi
     0.00 :    a9:mov    0x0(%rdi),%rsi
          :
          : e = bpf_ringbuf_reserve(...)
    21.74 :    ad:movabs $0xffff888103e70e00,%rdi
     0.00 :    b7:mov    $0xa8,%esi
     0.00 :    bc:xor    %edx,%edx
     0.00 :    be:callq  0xffffffffc0f9fbb8

From this report we see there is a hot spot of stall events around bpf_ringbuf_reserve, so we may want to look deeper into this helper to understand what’s going on.

But there is a small catch, I’ve mentioned before that we sample our BPF program by sampling the kernel. Now, if you have relatively busy system that actually does something useful except only running some BPF stuff, you will get a lot of data this way, like tons and tons of samples, only fraction of which is actually of any interest to you. To make it manageable one can try to capture only kernel part of stacks, maybe isolate the BPF program on a single core if the environment allows that, but it still would be quite a lot.

At some point I’ve realized there is a --filter perf option to do an efficient filtering of the data using Intel PT, and in theory one could do something like this:

# with the program name
$ perf record -e intel_pt// --filter 'filter bpf_prog_9baac7ecffdb457d'

# or with the raw address
$ perf record -e intel_pt// --filter 'start 0xffffffffc1544172'

Unfortunately it seems this is not supported for BPF:

failed to set filter "start 0xffffffffc1544172" on event intel_pt// with 95 (Operation not supported)

Nevertheless, having both Top-Down approach and profiling for BPF opens quite interesting opportunities. Let’s take a look at the previous example with the program allocator packer. The idea was to reduce overhead (in the form of iTLB pressure) via packing BPF programs together on a page, without knowing anything else about those programs. Now let’s make one more step forward and try to apply this in the real world, where quite often it’s not a single BPF program, but a chain of programs processing different part of an event and connected together via tail calls. Packing such programs together may produce a situation when some more frequently invoked (hot) pieces are located on the same page with less demanded (cold):

Such placement would not be as optimal as having most of the hot programs together, increasing number of pages that have to be frequently reached to execute the full chain of processing. But to define an “optimal” placing we need to know what are the hot and cold programs, so we end up with applying both approaches mentioned above: first identify the largest bottleneck (iTLB), then use profiling to find out how to pack everything more efficiently. Note, that nothing like this is of course implemented – although one can try to manipulate the order in which programs are loaded to influence the final placement.

4. Modeling of BPF programs

We were talking about great many details that could affect performance of a BPF program. There are even more factors (not necessarily BPF only) we were not talking about. In this complex picture of many actors interacting in different ways, how to get a prediction about performance of your program? An intuition may fail you, because sometimes improvements in one area reveal an unexpected bottleneck in another. One can benchmark the program under the required conditions, but often it’s resource intensive to cover many cases. And obviously it could be done only when the program is already written, and it’s even more problematic to fix a performance issue. The last option left on the table is to model the program behaviour and experiment with the model instead of an actual implementation. This would be much faster, but the catch is that the results are only as good as the model allows it.

"The Thrilling Adventures of Lovelace and Babbage", Sydney Padua, 2015

Nevertheless, even simplest models could give very valuable feedback. Inspired by the article above from Marc Brooker I wanted to see what it takes to simulate a simple BPF program execution. As an example I took runqslower program from the Linux kernel repository, that uses a BPF program to trace scheduling delays. Its implementation was recently switched to use task local storage instead of a hash map, which is an interesting study case. To make it even simpler for my purposes I’ve tried to simulate not the full difference between two implementations (task local storage is simply faster), but how much effect memory access in the hash map (slower with more elements in the map) has on the final performance.

In few words the simulation goes like this:

The state machine from the task point of view looks like this:

Every event from the state machine needs to have an associated latency distribution, which would tell how much time the system has spent in this state. For simplicity, I’ve taken the normal distribution with the following mean/stddev:

Now a warning. No, not like this – a WARNING: It’s hard to come up with real numbers – pretty much like in quantum physics if we try to measure something it will be skewed. It means that the values above should be considered only as an approximation, roughly measured on my laptop under certain conditions (to be more precise when runqslower measures latencies slower than 100ns and some processes are getting spawn in a loop). This is the part when we do quality trade-offs about the model.

Another point is the distribution type. Of course using normal distribution is unrealistic, as most likely those have to be some long-tailed distribution if we’re dealing with a queue-like process or an exponential distribution if some events are happening independently at a certain constant average rate. This is one of the parts that makes our model “simple”.

None of those points stops us from experimenting, or should prevent you from improving the model to be more realistic.

We encode few more important factors in the model, together with the state machine itself:

Here is how my final implementation looks like: runqslower-simulation. Warning, it’s the lowest quality Haskell code you’ve seen, riddled with bugs and inconsistencies, but I hope it shows the point. To summarize:

Having everything in place we’ve defined the model, now let’s talk about the input and the output. The initial conditions to start the simulation are completely up to us, so let’s set it to a bunch of processes starting together and doing something for a certain amount of time. The output of the model for us would be records of time spent doing actual job (positive payload) versus time spent running the BPF parts (bookkeeping overhead). The results would look like this:

As expected on the graph we observe that hash map performs slower than task local storage, and even can derive at which point in time this difference become significant. The dynamics of this model could be directed via number of processes and their lifetime, producing results in the phase space that are either converging to null (when there is not enough work to do for BPF, and process runtime dominates), exploding into the infinity (when there are too many short-lived processes and BPF overhead is getting more and more significant), or oscillating around certain value (when both BPF and process runtime are balancing each other, producing stable system).

5. Summary

In this blog post we were talking great deal about performance of BPF programs, the current state of things and various analysis approaches starting from simplest to most intricate. I’m glad to see that BPF subsystem lately started to provide flexible set of features to tune efficiency for different use cases, and maturing introspection techniques. Those experiment I’ve conducted along the lines were driven mostly by curiosity, but I hope the results will help you to reason about your BPF programs with more confidence.

Acknowlegements

Huge thanks to Yonghong Song, Joanne Koong and Artem Savkov for reviews and commentaries!

comments powered by Disqus