Using Ebpf to Measure CPU Mode Switch Overhead Incured by Making System Call

Calculate system time using rdtsc

Don't do that -using yourself directly the RDTSC machine instruction- (because your OS scheduler could reschedule other threads or processes at arbitrary moments, or slow down the clock). Use a function provided by your library or OS.

My main objective is to avoid the need to perform system call every time I want to know the system time

On Linux, read time(7) then use clock_gettime(2) which is really quick (and does not involve any slow system call) thanks to vdso(7).

On a C++11 compliant implementation, simply use the standard <chrono> header. And standard C has clock(3) (giving microsecond precision). Both would use on Linux good enough time measurement functions (so indirectly vdso)

Last time I measured clock_gettime it often took less than 4 nanoseconds per call.

Calculate system time using rdtsc

Don't do that -using yourself directly the RDTSC machine instruction- (because your OS scheduler could reschedule other threads or processes at arbitrary moments, or slow down the clock). Use a function provided by your library or OS.

My main objective is to avoid the need to perform system call every time I want to know the system time

On Linux, read time(7) then use clock_gettime(2) which is really quick (and does not involve any slow system call) thanks to vdso(7).

On a C++11 compliant implementation, simply use the standard <chrono> header. And standard C has clock(3) (giving microsecond precision). Both would use on Linux good enough time measurement functions (so indirectly vdso)

Last time I measured clock_gettime it often took less than 4 nanoseconds per call.

Is it possible to tail call eBPF codes that use different modes?

No, it is not.

Have a look at kernel commit 04fd61ab36ec, which introduced tail calls: the comment in the first piece of code (in internal kernel header bpf.h), defining the struct bpf_array, sets a owner_prog_type member, and explains the following in a comment:

/* 'ownership' of prog_array is claimed by the first program that
* is going to use this map or by the first program which FD is stored
* in the map to make sure that all callers and callees have the same
* prog_type and JITed flag
*/

So once the program type associated with a BPF program array, used for tail calls, has been defined, it is not possible to use it with other program types. Which makes sense, since different program types work with different context (packet data VS traced function context VS ...), can use different helpers, have return functions with different meanings, necessitate different checks from the verifier, ... So it's hard to see how jumping from one type to another would work. How could you start with processing a network packet, and all of a sudden jump to a piece of code that is supposed to trace some internals of the kernel? :)

Note that it is also impossible to mix JIT-ed and non-JIT-ed programs, as indicated by the owner_jited of the struct.

Dynamically Change eBPF map size

No, at the moment you cannot “resize” an eBPF map after it has been created.

However, the size of the map in the kernel may vary over time.

  • Some maps are pre-allocated, because their type requires so (e.g. arrays) or because this was required by the user at map creation time, by providing the relevant flag. These maps are allocated as soon as they are created, and occupy a space equal to (key_size + value_size) * max_entries.
  • Some other maps are not pre-allocated, and will grow over time. This is the case for hash maps for example: They will take more space in kernel space as new entries are added. However, they will only grow up to the maximum number of entries provided during their creation, and it is NOT possible to update this maximum number of entries after that.

Regarding the bpf_map__resize() function from libbpf, it is a user space function that can be used to update the number of entries for a map, before this map is created in the kernel:

int bpf_map__set_max_entries(struct bpf_map *map, __u32 max_entries)
{
if (map->fd >= 0)
return -EBUSY;
map->def.max_entries = max_entries;
return 0;
}

int bpf_map__resize(struct bpf_map *map, __u32 max_entries)
{
if (!map || !max_entries)
return -EINVAL;

return bpf_map__set_max_entries(map, max_entries);
}

If we already created the map (if we have a file descriptor to that map), the operation fails.

ebpf: intercepting function calls

No, kprobes BPF programs have only read access to the syscall parameters and return value, they cannot modify registers and therefore cannot intercept function calls. This is a limitation imposed by the BPF verifier.

Kernel modules, however, can intercept function calls using kprobes.

Test that an integer is different from two other integers in eBPF without branch opcodes

You should be able to do this using bitwise OR, XOR, shifts and integer multiplication. I assume your variables are all __s32 or __u32, cast them to __u64 before proceeding to avoid problems (otherwise cast every operand of the multiplications below to __u64).

Clearly a != b can become a ^ b. The && is a bit trickier, but can be translated into a multiplication (where if any operand is 0 the result is 0). The first part of your condition then becomes:

// (new_ruid != old_euid && new_ruid != old_ruid)
__u64 x = (new_ruid ^ old_euid) * (new_ruid ^ old_euid);

However for the second part we have an overflow problem since there are 3 conditions. You can avoid it by "compressing" the result of the first two into the lower 32 bits, since you don't really care about the multiplication, just about its "truthiness":

// (new_euid != old_euid && new_euid != old_ruid && new_euid != old_suid)

__u64 y = (new_euid ^ old_euid) * (new_euid ^ old_ruid);
y = (y >> 32) | (y & 0xffffffff);
y *= (new_euid ^ old_suid);

And finally just OR the two parts for the result. You can also "compress" again to the lower 32 bits if you want a __u32:

__u64 res = x | y;
// or
__u64 tmp = x | y;
__u32 res = (tmp >> 32) | (tmp & 0xffffffff);

All of the above combined compiles without any branch for me regardless of optimization level.

share information between function(BPF/XDP)

So, as you know eBPF programs can be loaded into the kernel at different locations. XDP programs are loaded just after the network driver and just before the network stack. At this point the kernel doesn't know for which process a packet might be since it will figure all of that out in the network stack.

The hello program you are showing is an example of a kprobe(kernel probe). It attaches to whatever kernel function you specify, but it is a tracing tool, can't make changes.

Also, some helper functions like bpf_get_current_pid_tgid are program type dependent. bpf_get_current_pid_tgid only works in kprobes, uprobes, tracepoint programs (perf programs), the may actually also work in socket and cGroup programs, the issue is that there is not a very clear list or overview of which work where, these are two good but non-comprehensive links:

  • https://blogs.oracle.com/linux/post/bpf-in-depth-bpf-helper-functions
  • https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md#program-types

In the end it comes down to logic. The kernel can only give you access to data and actions it has access to itself. So if you want to do network related things based on process ID's you might need to use an eBPF program attached at a location where such info is available(keep in mind that this is obviously also slower).

So depending on what exactly you want to do you have a few options:

  • Attach an eBPF program to a network socket(BPF_PROG_TYPE_SOCKET_FILTER) so you can filter packets on the socket level. This does require the program that creates the socket to attach the program to it.
  • Use a cGroup and BPF_PROG_TYPE_CGROUP_SKB program to block packets. Since you attach the program to the cGroup, this doesn't require cooperation from the program.
  • Use an TC program(BPF_PROG_TYPE_SCHED_ACT), on this level a packet is already parsed, but you still need to match it to a process
  • Use an XDP program(BPF_PROG_TYPE_XDP) can still be used, this does require you to parse all layers of the network packet(Ethernet, VLAN, IP, UDP/TCP), and then manually extract the protocol, Destination IP, and Destination port. Just like in the TC program you then need to match it to an pid using a lookup table.

When going the XDP or TC route you need to create this lookup table. As far as I know you can't access the table of the kernel via helper functions. A few approaches are:

  • parsing the output of netstat -lpn(protocol, destination ip, destination port and PID) and setting the data in a map to be used by a program
  • Getting the same data but directly from /sys or /proc(I don't know where the data is stored exactly)
  • Recording which PIDs have which sockets during creation(using a second program(kprobe/tracepoint)) and setting this data in a map shared by both the XDP/TC program and the trace program. (not quite sure how to share maps between programs in BCC, but it is certainly possible when using libbpf)

I have a function call in one program and this function is depreciated.Is there any newer version that I can use in my code | perf_buffer__new in ebpf

1. you are explicitly using perf_buffer__new_deprecated in your code - don't do this: Use perf_buffer_new instead. You should never call a function that already has 'deprecated' in it's name.

2. Take a look in the header:
libbpf/libbpf.h

perf_buffer_new is defined like this:

#define perf_buffer__new(...) ___libbpf_overload(___perf_buffer_new, __VA_ARGS__)

#define ___perf_buffer_new6(map_fd, page_cnt, sample_cb, lost_cb, ctx, opts) \
perf_buffer__new(map_fd, page_cnt, sample_cb, lost_cb, ctx, opts)

#define ___perf_buffer_new3(map_fd, page_cnt, opts) \
perf_buffer__new_deprecated(map_fd, page_cnt, opts)

So there are 2 functions:

  • Old: pef_buffer_new with 3 arguments
  • New: perf_buffer_new with 6 arguments.

With the macros, libbpf makes old code compile, too, while telling you to change your function call.
You are using the old version right now (with 3 arguments). Switch to the new version with 6 arguments, as the 3-arguments-variant will be removed.

The new function (see libbpf/libbpf.h):

/**
* @brief **perf_buffer__new()** creates BPF perfbuf manager for a specified
* BPF_PERF_EVENT_ARRAY map
* @param map_fd FD of BPF_PERF_EVENT_ARRAY BPF map that will be used by BPF
* code to send data over to user-space
* @param page_cnt number of memory pages allocated for each per-CPU buffer
* @param sample_cb function called on each received data record
* @param lost_cb function called when record loss has occurred
* @param ctx user-provided extra context passed into *sample_cb* and *lost_cb*
* @return a new instance of struct perf_buffer on success, NULL on error with
* *errno* containing an error code
*/
LIBBPF_API struct perf_buffer *
perf_buffer__new(int map_fd, size_t page_cnt,
perf_buffer_sample_fn sample_cb, perf_buffer_lost_fn lost_cb, void *ctx,
const struct perf_buffer_opts *opts);

You can find the definitions for sample_cb and lost_cb in the header as well:
From above, we know sample_cb has the type perf_buffer_sample_fn. For the other callback, it is similar.
Both are defined in libbpf.h:

typedef void (*perf_buffer_sample_fn)(void *ctx, int cpu,
void *data, __u32 size);
typedef void (*perf_buffer_lost_fn)(void *ctx, int cpu, __u64 cnt);

See libbpf/libbpf.h

So a valid callback function could be
void myCallbackForNewData(void* ctx, int cpu, void*data, __u32 size) {}
Be aware that ctx* has nothing to do with BPF - it is something you can freely define in perf_buffer__new. This is useful if you use the same handler for multiple perf_buffers. Otherwise, you can just enter NULL.

ebpf: how to use BPF_FUNC_trace_printk in eBPF assembly program

Read the friendly manual :)

I don't believe you are calling the bpf_trace_printk() helper correctly (BPF_FUNC_trace_prink is just an integer, by the way). Its signature, commented in the kernel UAPI header bpf.h or in the bpf-helpers man page, is as follows:

long bpf_trace_printk(const char *fmt, u32 fmt_size, ...);

This means that the first argument must be a constant, null-terminated format string, not an integer like you do.

What does clang do?

I understand you are attaching your eBPF programs to sockets and cannot compile the whole program from C. However, why not compile that specific part as a generic networking eBPF program to see what the bytecode should look like? Let's write the C code:

#include <linux/bpf.h>

static long (*bpf_trace_printk)(const char *fmt, __u32 fmt_size, ...) = (void *) BPF_FUNC_trace_printk;

int printk_proto(struct __sk_buff *skb) {
char fmt[] = "%d\n";

bpf_trace_printk(fmt, sizeof(fmt), skb->protocol);

return 0;
}

Compile to an object file. For the record this would not load, unless we provide both a valid licence string (because bpf_trace_prink() needs a GPL-compatible program) and a compatible program type at load time. But it does not matter in our case, we just want to look at the generated instructions.

$ clang -O2 -g -emit-llvm -c prink_protocol.c  -o - | \
llc -march=bpf -mcpu=probe -filetype=obj -o prink_protocol.o

Dump the bytecode:

$ llvm-objdump -d prink_protocol.o 

prink_protocol.o: file format elf64-bpf

Disassembly of section .text:

0000000000000000 <printk_proto>:
0: b4 02 00 00 25 64 0a 00 w2 = 680997
1: 63 2a fc ff 00 00 00 00 *(u32 *)(r10 - 4) = r2
2: 61 13 10 00 00 00 00 00 r3 = *(u32 *)(r1 + 16)
3: bf a1 00 00 00 00 00 00 r1 = r10
4: 07 01 00 00 fc ff ff ff r1 += -4
5: b4 02 00 00 04 00 00 00 w2 = 4
6: 85 00 00 00 06 00 00 00 call 6
7: b4 00 00 00 00 00 00 00 w0 = 0
8: 95 00 00 00 00 00 00 00 exit

We can see that on the first two instructions, the program writes the format string (in little endian) onto the stack: 680997 is 0x000a6425, \0\nd%. r2 still contains the length for the format string. The protocol value is stored in r3, the third argument for the call to bpf_trace_prink().



Related Topics



Leave a reply



Submit