Xdp Program Ip Link Error: Prog Section Rejected: Operation Not Permitted

XDP program ip link error: Prog section rejected: Operation not permitted

eBPF: Operation not permitted

There are several possible causes for a permission error (-EPERM returned by bpf(), which you can observe with strace -e bpf <command>) when working with eBPF. But no so many. Usually, they fall under one of the following items:

User does not have the required capabilities (CAP_SYS_ADMIN, CAP_NET_ADMIN, ... typically depending on the types of the programs being used). This is usually solved by running as root, who has all necessary capabilities. In your case you run with sudo, so you are covered.
Creating the BPF object (new map, or loading a program) would exceed the limit for the amount of memory that can be locked in the kernel by the user. This is usually solved (for root) by using ulimit -l <something_big> in the terminal, or setrlimit() in a C program. Very unlikely in your case, your program is very small and you did not mention having a lot of BPF objects loaded on your system.
There are a few more possibilies, like trying to write on maps that are “frozen” or read-only etc., or trying to use function calls for non-root users. These are usually for more advanced use cases and should not be hit with a program as simple as yours.

Lockdown, Secure Boot, EFI and (unfortunate) backports for `bpf()` restrictions

But the problem that you seem to be hitting could be related to something else. “Lockdown” is a security module that was merged into Linux 5.5 kernel. It aims at preventing users to modify the running Linux image. It turns out that several distributions decided to backport Lockdown to their kernels, and sometimes they picked patches that predated the final version that was merged to mainline Linux.

Ubuntu and Fedora, for example, have a bunch of custom patches to backport that feature to the kernels used in Disco/19.04 and Eoan/19.10 (kernel 5.3 for the latter, I don't remember for Disco). It includes a patch that completely disables the bpf() system call when Lockdown is activated, meaning that creating maps or loading BPF programs is not possible. Also, they enabled Lockdown by default when Secure Boot is activated, which, I think, is the default for machines booting with EFI.

See also this blog post: a good way to check if Lockdown is affecting your BPF usage is to try and load minimal programs, or to run dmesg | grep Lockdown to see if it says something like:

Lockdown: systemd: BPF is restricted; see man kernel_lockdown.7

So for Ubuntu 19.04 and 19.10, for example, you have to disable Lockdown to work with eBPF. This may be done with a physical stroke of the SysRq key + x (I have not tested), but NOT by writing to /proc/sysrq-trigger (Ubuntu disabled it for this operation). Alternatively, you can disable Secure Boot (in the BIOS or with mokutil, search for the relevant options on the Internet, and do not forget to check the security implications).

Note that Linux kernel 5.4 or newest has the mainline restrictions for bpf(), which do not deactivate the system call, so Focal/20.04 and newest will not be affected. Upgrading to a new kernel might thus be another workaround. I filed a ticket a few days ago to ask for this change to be backported (instead of deactivating bpf()) and the work is in progress, so by the time new readers look at the answer Lockdown impact on eBPF might well be mitigated (Edit: Should be fixed on Ubuntu 19.10 with kernel 5.3.0-43). Not sure how other distros handle this. And it will still have strong implications for tracing with eBPF, though.

Need help in XDP program failing to load with error R7 offset is outside of the packet

TL;DR. You are hitting a corner-case limitation of the verifier. Changing the end of the for loop to the following may help.

#define MAX_PACKET_OFF 0xffff
...
nh->pos += size;
if (nh->pos > MAX_PACKET_OFF)
     return INV_RET_U32;
if (nh->pos >= data_end)
    return INV_RET_U32;

The full explanation is a bit long, see below.

Verifier error explanation

2945: (bf) r1 = r7
2946: (07) r1 += 4
2947: (2d) if r1 > r6 goto pc-2888
 R0=map_value(id=0,off=0,ks=4,vs=2,imm=0) R1=pkt(id=68,off=30,r=0,umin_value=20,umax_value=73851,var_off=(0x0; 0xffffffff)) R2=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R6=pkt_end(id=0,off=0,imm=0) R7=pkt(id=68,off=26,r=0,umin_value=20,umax_value=73851,var_off=(0x0; 0xffffffff)) R8=pkt(id=65,off=26,r=55,umin_value=20,umax_value=8316,var_off=(0x0; 0xffffffff)) R9=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R10=fp0 fp-8=mmmm???? fp-24=pkt fp-32=mmmmmmmm fp-40=inv fp-48=pkt
2948: (71) r1 = *(u8 *)(r7 +0)
invalid access to packet, off=26 size=1, R7(id=68,off=26,r=0)
R7 offset is outside of the packet

The verifier errors because it thinks R7 is outside the packet's known bounds. It tells us you're trying to make an access of size 1B at offset 26 into the packet pointer, but the packet has a known size of 0 (r=0, for range=0).

Maximum packet size limitation

That's weird because you did check the packet bounds. On instruction 2947, the packet pointer R1 is compared to R6, the pointer to the end of the packet. So following that check, the known minimum size of R1 should be updated, but it remains 0 (r=0).

That is happening because you are hitting a corner-case limitation of the verifier:

if (dst_reg->umax_value > MAX_PACKET_OFF ||
    dst_reg->umax_value + dst_reg->off > MAX_PACKET_OFF)
    /* Risk of overflow.  For instance, ptr + (1<<63) may be less
     * than pkt_end, but that's because it's also less than pkt.
     */
    return;

As explained in the comment, this check is here to prevent overflows. Since R1's unsigned maximum value is 73851 (umax_value=73851), the condition is true and the packet's known size is not updated.

A way to prevent this from happening might be to ensure there's an additional bounds check on R1. For example:

#define MAX_PACKET_OFF 0xffff
...
if (nh->pos + size > MAX_PACKET_OFF)
     return INV_RET_U32;

Why is R1's unsigned maximum value so high?

R1 comes from R7, which is initialized on those instructions:

2934: (79) r1 = *(u64 *)(r10 -32)
2935: (79) r2 = *(u64 *)(r10 -40)
2936: (0f) r2 += r1
; if (nh->pos + size < data_end)
2937: (57) r2 &= 65535
2938: (bf) r7 = r8
2939: (0f) r7 += r2
; if (nh->pos + size < data_end)
2940: (3d) if r7 >= r6 goto pc-2881
 R0=map_value(id=0,off=0,ks=4,vs=2,imm=0) R1_w=inv(id=0) R2_w=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R6=pkt_end(id=0,off=0,imm=0) R7_w=pkt(id=68,off=26,r=0,umin_value=20,umax_value=73851,var_off=(0x0; 0xffffffff)) R8=pkt(id=65,off=26,r=55,umin_value=20,umax_value=8316,var_off=(0x0; 0xffffffff)) R9=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R10=fp0 fp-8=mmmm???? fp-24=pkt fp-32=mmmmmmmm fp-40=inv fp-48=pkt

Two values are retrieved from the stack, at offsets -32 and -40. Those two values added hold variable size. Since size is a __u16, it is ANDed with 65535 (the maximum __u16 value). So the verifier identifies R2 has having maximum value 65535.

When R2 is added to R7, R7's maximum value of course becomes larger than MAX_PACKET_OFF = 65535.

Shouldn't the verifier understand that size < 516?

The following code ensures size will never be larger than 516 (512 + 4 in the worst case):

__u16 size = parse_sctp_chunk_size(nh->pos, data_end);
if (size > 512)
  return INV_RET_U32;

//Adjust for padding
size += (size % 4) == 0 ? 0 : 4 - size % 4;

So why is the verifier loosing track of that?

Part of variable size is saved on the stack, at offset -32, here:

2782: (69) r2 = *(u16 *)(r8 +2)
2783: (dc) r2 = be16 r2
2784: (7b) *(u64 *)(r10 -32) = r2
; if (size > 512)
2785: (25) if r2 > 0x200 goto pc-2726
 R0=map_value(id=0,off=0,ks=4,vs=2,imm=0) R1_w=inv(id=66,umax_value=255,var_off=(0x0; 0xff)) R2_w=inv(id=0,umax_value=512,var_off=(0x0; 0xffffffff)) R6=pkt_end(id=0,off=0,imm=0) R7=pkt(id=65,off=27,r=30,umin_value=20,umax_value=8316,var_off=(0x0; 0xffffffff)) R8=pkt(id=65,off=26,r=30,umin_value=20,umax_value=8316,var_off=(0x0; 0xffffffff)) R9=invP(id=0,umax_value=516,var_off=(0x0; 0xffff),s32_max_value=65535,u32_max_value=65535) R10=fp0 fp-8=mmmm???? fp-24=pkt fp-32_w=mmmmmmmm fp-40=inv fp-48=pkt

Unfortunately, the value is saved on the stack before the comparison with 512 happens. Therefore, the verifier doesn't know that the value saved on the stack is smaller than 512. We can see that because of the fp-32_w=mmmmmmmm. The ms means MISC; that is, the value could be anything from the verifier's point of view.

I believe this limitation of the verifier was removed in recent Linux versions.

Why does the issue only appear with 32 iterations?

I suspect that the variable size is only saved on the stack if the program becomes really large. As long as the variable is not saved on the stack, the verifier doesn't lose track of its maximum value 516.

XDP program not capturing all ingress packets

Here's the updated function, as per Andrew's comments. Main issue was with if (data_end >= (void *) (eth + sizeof(struct ethhdr))), which results in overshooting the packet. I should have been casting to char *. Using data by itself is not as per standard, but works in clang because it adds bytes to a void *, not bytes*sizeof(some pointer).

SEC("collect_ips")
int xdp_ip_stats_prog(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct ethhdr *eth = data;
    struct iphdr *iph = (char *) data + sizeof(struct ethhdr);

    // Without adding sizeof(struct iphdr) results in xdp-loader complaining about "invalid access to packet"
    if (data_end >= (void *) ((char *) data + sizeof(struct ethhdr) + sizeof(struct iphdr))) {
        if (eth->h_proto == htons(ETH_P_IP)) {
            struct addr_desc_struct addr_desc = {.src_ip = iph->saddr};
            long init_val = 1;
            long *value = bpf_map_lookup_elem(&addr_map, &addr_desc);

            if (value) {
                __sync_fetch_and_add(value, 1);
            } else {
                bpf_map_update_elem(&addr_map, &addr_desc, &init_val, BPF_ANY);
            }
        }
    }

    return XDP_PASS;
}

BPF / XDP: 'bpftool batch file' returns 'Error: reading batch file failed: Operation not permitted'

TL;DR: Your map update works fine. The message is a bug in bpftool.

Bpftool updates the maps just as you would expect; and then, after processing all the batch file, it checks errno. If errno is 0, it supposes that everything went fine, and it's good. If not, it prints strerror(errno) so you can see what went wrong when processing the file.

errno being set is not due to your map updates. I'm not entirely sure of what's happening to it. The bug was seemingly introduced with commit cf9bf714523d ("tools: bpftool: Allow unprivileged users to probe features"), where we manipulate process capabilities with libcap. Having a call to cap_get_proc() in feature.c is apparently enough for the executable to pick it up and to run some checks on capabilities that are supported, or not, on the system even if we're not doing any probing. I'm observing the following calls with strace:

# strace -e prctl ./bpftool batch file /tmp/batch
prctl(PR_CAPBSET_READ, CAP_MAC_OVERRIDE) = 1
prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, CAP_CHECKPOINT_RESTORE) = 1
prctl(PR_CAPBSET_READ, 0x2c /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x2a /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x29 /* CAP_??? */) = -1 EINVAL (Invalid argument)
Error: reading batch file failed: Operation not permitted
+++ exited with 255 +++

This seems to be coming from cap_get_bound() in libcap, where the -1 returned is negated and passed to errno, thus becoming 1, Operation not permitted. I'm not sure what the capability numbers passed to prctl() correspond to.

I'm not sure what's the cleanest way to fix this. A simple workaround consists in resetting errno at the beginning of the main() function; we can submit that and see if the reviewers have a better idea. Let me know if you would like to send a patch yourself, otherwise I'll do it when I have a moment.

[EDIT August 2022] Fixed in libcap 2.63 and also in bpf-next for bpftool.

Unable to unload BPF program

eBPF programs only unload when there are no more references to it(File descriptors, pins), but network links also hold their own references. So to unload the program, you first have to detach it from your network link.

You can do so by setting the program fd to -1:

err = netlink.LinkSetXdpFd(link, -1)
if err != nil {
    log.Fatalln("netlink.LinkSetXdpFd:", err)
}

bpf_trace_printk causes program not loaded in kernel -- libbpf: Program 'xdp' contains unrecognized relo data pointing to section 6

TL;DR

Do not call your helper like this:

    // BAD
    bpf_trace_printk("hello\n", sizeof("hello\n"));

or like this:

    // BAD
    const char *msg = "hello\n";
    bpf_trace_printk(msg, sizeof("hello\n"));

But instead, declare your string as a dynamic array of characters:

    // GOOD
    char msg[] = "hello\n";
    bpf_trace_printk(msg, sizeof(msg));

This will prevent clang from creating a relocation that libbpf cannot handle.

Explanations

Let's have a look at the object file, when passing the string directly:

#include <linux/bpf.h>
#include "bpf_helper_defs.h"

int foo(void)
{
    bpf_trace_printk("hello\n", sizeof("hello\n"));
    return 0;
}

When doing this, clang puts the string into a read-only section, and requests a relocation. We can observe this with llvm-objdump. Let's inspect the relocations and disassemble the program:

$ clang -O2 -emit-llvm -c foo.c -o - | llc -march=bpf -filetype=obj -o foo.
$ llvm-objdump -r foo.o                

foo.o:  file format elf64-bpf

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE                     VALUE
0000000000000000 R_BPF_64_64              .rodata.str1.1

RELOCATION RECORDS FOR [.eh_frame]:
OFFSET           TYPE                     VALUE
000000000000001c R_BPF_64_ABS64           .text
$ lvm-objdump --section=.text -D foo.o

foo.o:  file format elf64-bpf

Disassembly of section .text:

0000000000000000 <foo>:
       0:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
       2:       b7 02 00 00 07 00 00 00 r2 = 7
       3:       85 00 00 00 06 00 00 00 call 6
       4:       b7 00 00 00 00 00 00 00 r0 = 0
       5:       95 00 00 00 00 00 00 00 exit

We note that the .text section, containing the program, starts with a single load r1 = 0: the register r1, containing the first argument to pass to the call to bpf_trace_printk() (call 6), is not set until this relocation happens.

But libbpf does not support this kind of relocations, this is why you get your error message:

Program 'xdp' contains unrecognized relo data pointing to section 6

The same can be observed with:

#include <linux/bpf.h>
#include "bpf_helper_defs.h"

int foo(void)
{
    const char* msg = "hello\n";
    bpf_trace_printk(msg, sizeof("hello\n"));
    return 0;
}

This is equivalent, clang creates a relocation too.

However, we can instead declare the string as a dynamic array of characters:

#include <linux/bpf.h>
#include "bpf_helper_defs.h"

int foo(void)
{
    char msg[] = "hello\n";
    bpf_trace_printk(msg, sizeof("hello\n"));
    return 0;
}

In that case, the array goes to the stack. No relocation happens. The .rodata.str1.1 section is not present in the file. We can check what llvm-objdump says:

$ clang -O2 -emit-llvm -c foo.c -o - | llc -march=bpf -filetype=obj -o foo.o
$ llvm-objdump -r foo.o                                         

foo.o:  file format elf64-bpf

RELOCATION RECORDS FOR [.eh_frame]:
OFFSET           TYPE                     VALUE
000000000000001c R_BPF_64_ABS64           .text
$ lvm-objdump --section=.text -D foo.o               

foo.o:  file format elf64-bpf

Disassembly of section .text:

0000000000000000 <foo>:
       0:       b7 01 00 00 6f 0a 00 00 r1 = 2671
       1:       6b 1a fc ff 00 00 00 00 *(u16 *)(r10 - 4) = r1
       2:       b7 01 00 00 68 65 6c 6c r1 = 1819043176
       3:       63 1a f8 ff 00 00 00 00 *(u32 *)(r10 - 8) = r1
       4:       b7 01 00 00 00 00 00 00 r1 = 0
       5:       73 1a fe ff 00 00 00 00 *(u8 *)(r10 - 2) = r1
       6:       bf a1 00 00 00 00 00 00 r1 = r10
       7:       07 01 00 00 f8 ff ff ff r1 += -8
       8:       b7 02 00 00 07 00 00 00 r2 = 7
       9:       85 00 00 00 06 00 00 00 call 6
      10:       b7 00 00 00 00 00 00 00 r0 = 0
      11:       95 00 00 00 00 00 00 00 exit

Here, we fill the stack (r10 is the stack pointer) with the characters of the string (68 65 6c 6c 6f 0a 00 00 is hello\n\0\0). Everything is processed in BPF, there is no relocation involved. And this works just fine.

Could we do better? Yes, with Linux 5.2 and older, we can avoid having the array on the stack by declaring the string as:

    static const char msg[] = "hello\n";

This time, this results in a relocation to section .rodata, but one that libbpf does handle, through the support of static variables. More details are available here.

Generally speaking, BPF tips & tricks: the guide to bpf_trace_printk() and bpf_printk() is an excellent reference on the bpf_trace_printk() helper.

Xdp Program Ip Link Error: Prog Section Rejected: Operation Not Permitted