Bpftrace: One Tool to Rule Them All

August 3, 2019

Categories: Technical Tags: Linux

Worse is Better

Unix and C are the ultimate computer viruses.

You may not feel like reading after that. But I was skimming through the classic UNIX-HATERS Handbook, so that ship has sailed already.

Actually, this quote is originally from Richard Gabriel's classic The Rise of Worse is Better, in which he dubbed Unix and C's philosophy of prioritising simplicity and practicality of implementation to be worse than say contemporary Lisp Machine's focus on perfection and the "right thing" in terms of interface even when it is too complex and impractical. But then he noted that, worse-is-better might just be actually better as it has superior survival characteristics. So the use of the term "virus" was literal in the sense that, as long as the program is working, it doesn't matter if it's merely good enough, it will spread by virtue of being the first one in the scene compared to a product that's inwardly seeking perfection before the big reveal.

This long tangent is no novel insight, however it is interesting to look back now and note how Linux took the server world by a storm, leaving slightly late BSDs in the dust even though in many ways the latter was more "sound". The overall gap maybe too big for this to be an worthwhile comparison now, but many cool features Linux is getting these days are ones BSDs did better. Container, SELinux? FreeBSD and OpenBSD were always ahead of the curve there, not sure Linux will ever be as secure as OpenBSD. Then you have epoll which is broken compared to kqueue. DTrace was an awe-inspiring feature of the Solaris with no sane counterpart in Linux. Until now that is, Linux might be pulling ahead with eBPF finally.

What is eBPF

Well, initially there was just BPF (Berkeley Packet Filter) which was actually a BSD thing. It is basically a VM for a machine language right inside the kernel, and many implementations even use JIT compiler. Now post Spectre/Meltdown this might seem like a scary concept from a security point of view, but the prospect of being able to run user defined programs in kernel space makes a lot of sense with regards to performance since it can avoid lots of expensive context switching.

However, initially the scope of it was only confined to packet filtering (e.g. tcpdump). Even so, the Netfilter mechanism was always way more popular in Linux (firewall frontends such as iptables, nftables). But lately, people has realised the potential of it and began to expand it beyond the networking sub-system, e.g. secure computing and dynamic tracing. The latter is very appealing because it means one can trace with very little performance penalty, possibly while code runs in production! It can do that because it unifies the new BPF system with already existing tracing capabilities such as kprobes, uprobes, tracepoints to great effect.

Front-ends: bcc, bpftrace

But raw eBPF is very low-level. Like assembly, it's not meant to be hand-written. So straight away it was not possible for people to actually use it, rather we had to wait for higher level frontends to be ready. First of the line was BCC (BPF Compiler Collection), with Python and Lua bindings. It's okay, lots of amazing PoC tools written in it already, but still way too verbose and nowhere near as easy as the fabled DTrace.

But that wait is close to being over, thanks to bpftrace (an AWK inspired small DSL like DTrace). As for what's the advantage of this new approach, there are many existing tools in userspace like strace, ltrace, extrace, blktrace etc. Now, these are very specialized in their scope, not to mention the ptrace based approach is really slow.

For example, let's talk about a tool called forkstat that helped me out a lot over the years. What it does is really simple, shows you by whom and when certain syscalls of interest (fork, exec, exit) are triggered, and it helped me to find lots of misbehaving or misconfigured programs. After a while, I felt like trying to understand how it works, so I looked into the source code, and it was blinding. I realised that when kernel developers aren't busy bit bashing, they are busy developing one-off, arcane and incomprehensible protocols. Besides, there are a number of problems with it. The kernel doesn't really want to put much stock to differentiate a thread and a process. One shares address-space and resources where the other doesn't, what's the big deal? In fact both are spawned using the same syscall, named clone; the difference is in various passed flags to it. These days, most programs are multi-threaded. So you get loads of irrelevant information when all you want to see is new process creation (with userspace PID and TID conflated which is really not good for my OCD). Sure, you can turn off clone() with a flag, but they will show up all the same in the exit() log. How do I fix that? I can't since I don't even understand the code. Turns out, what I want to do can be trivially achieved. Bpftrace comes with a very script (called exectrace) that looks like:

#!/usr/bin/env bpftrace

	printf("%-5s %s\n", "PID", "ARGS");

	printf("%-5d ", pid);

As you can see, the DSL is very simple and AWK inspired. At first you specify the probe, then optional predicates/filters, and finally an action block. You also have built-in variables (pid, tid, comm (process name), func, probe etc.) that take on appropriate values. And you can have a number of these probes.

But this program is not complete, exec may fail, and what about exit? This is my improvement:

#!/usr/bin/env bpftrace

tracepoint:syscalls:sys_enter_execve { 
  @start[pid] = comm; 

/@start[pid] != comm/ 
  time("%H:%M:%S ");
  printf("%d (exec) %s -> %s\n", pid, @start[pid], comm); 

/@start[pid] != ""/
  time("%H:%M:%S ");
  printf("%d (exit) %s\n", pid, comm);

See, you can have terse map like data-structure to save key-value pairs, which is another AWK-ism. Let's see how it works, I hooked it up and pressed hotkey that automatically takes screen-shot and uploads the image.

Attaching 3 probes...
16:13:36 1488 (exec) sxhkd -> bash
16:13:36 1488 (exec) bash -> tekaim
16:13:36 1489 (exec) tekaim -> sh
16:13:36 1490 (exec) sh -> maim
16:13:36 1490 (exit) maim
16:13:36 1489 (exit) sh
16:13:36 1491 (exec) tekaim -> sh
16:13:36 1492 (exec) sh -> curl
16:13:38 1492 (exit) curl
16:13:38 1494 (exec) sh -> rm
16:13:38 1494 (exit) rm
16:13:38 1491 (exit) sh
16:13:38 1495 (exec) tekaim -> sh
16:13:38 1497 (exec) sh -> xclip
16:13:38 1497 (exit) xclip
16:13:38 1499 (exec) sh -> notify-send
16:13:38 1499 (exit) notify-send
16:13:38 1495 (exit) sh
16:13:38 1488 (exit) tekaim

Yup, that looks exactly how it happened.

Anyway, the power of Bpftrace is that, with this same unified interface for many different probes, you now have a tool that can replace many other specialised ones. Such as, gethostlatency.bt looks like:

#!/usr/bin/env bpftrace
	printf("Tracing getaddr/gethost calls... Hit Ctrl-C to end.\n");
	printf("%-9s %-6s %-16s %6s %s\n", "TIME", "PID", "COMM", "LATms", "HOST");

	@start[tid] = nsecs;
	@name[tid] = arg0;

	$latms = (nsecs - @start[tid]) / 1000000;
	time("%H:%M:%S  ");
	printf("%-6d %-16s %6d %s\n", pid, comm, $latms, str(@name[tid]));

And this essentially implements dig! (well a teensy bit). It's quite fun to play with various other programs it already come with. I realised, my cron wakes up far too many times than it needs to. Browser is always polling for fonts and fontconfig rules change. When you run a command in your shell, it doesn't exactly go through the items in path to find a match, rather appends the command to every directory in $PATH and then stats it to see if it's valid!