Security/Sandbox/Seccomp

< Security‎ | Sandbox

Contents

What is Seccomp

Intro to seccomp and seccomp-bpf

Seccomp stands for secure computing mode. It's a simple sandboxing tool in the Linux kernel, available since Linux version 2.6.12. When enabling seccomp, the process enters a "secure mode" where a very small number of system calls are available (exit(), read(), write(), sigreturn()). Writing code to work in this environment is difficult; for example, dynamic memory allocation (using brk() or mmap(), either directly or to implement malloc()) is not possible.

Seccomp-BPF is a more recent extension to seccomp, which allows filtering system calls with BPF (Berkeley Packet Filter) programs. These filters can be used to allow or deny an arbitrary set of system calls, as well as filter on system call arguments (numeric values only; pointer arguments can't be dereferenced). Additionally, instead of simply terminating the process, the filter can raise a signal, which allows the signal handler to simulate the effect of a disallowed system call (or simply gather more information on the failure for debugging purposes). Seccomp-bpf is available since Linux version 3.5 and is usable on the ARM architecture since Linux version 3.10. Several backports are available for earlier kernel versions.

We have backports for 3.0.x kernels, 3.4 kernels, and 2.6.29 kernels (see bug 790923 and its children). No backport is necessary for kernels 3.10 and above. These configuration options are required to be present in the kernel's config at compile time:

 CONFIG_SECCOMP=y
 CONFIG_SECCOMP_FILTER=y

How do I call seccomp-bpf ?

Seccomp-bpf is turned on through the prctl() system call (process control), like this:

 #include <sys/prctl.h>
 #include <linux/seccomp.h>
 [...]
 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_prog)

bpf_prog is a BPF structure which contains the rules used by seccomp-bpf — i.e., which system calls are allowed or not. To ensure that you can't execute this call again with a more permissive filter program (bpf_prog), there is an additional call to make, "no new privileges", which ensures it's only possible to tighten the filter, never to extend it. This means you could first remove access to one system call, then later on in the process lifetime, remove access to more system calls, for example. Here's the same code, with the no new privileges call:

 #include <sys/prctl.h>
 #include <linux/seccomp.h>
 [...]
 prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_prog)

Construct a basic filter

The filter program can be constructed using BPF filter macros, which are listed in linux's filter.h. Here's a list of commonly used macros for seccomp-bpf:

 #include <linux/filter.h>
 [...]
 #define syscall_nr (offsetof(struct seccomp_data, nr))
 #define arch_nr (offsetof(struct seccomp_data, arch))
 
 #define VALIDATE_ARCHITECTURE \
     BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), \
     BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), \
     BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
 
 #define EXAMINE_SYSCALL \
     BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr)
 
 #define ALLOW_SYSCALL(name) \
     BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \
     BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
 
 #define KILL_PROCESS \
     BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)

In this example, you could have a filter that validates that the syscall is for the correct architecture (rather than trying to invoke an OS-emulation feature or 32/64-bit translation, where the syscall numbers could have different meanings), then check the syscall number against each entry on a whitelist and allow the call if it matches, and finally kill the process if it didn't match any whitelist entry.

Advanced use cases

The BPF language doesn't allow loops (branches can only skip instructions, not jump back), but more complex computations than a simple whitelist check are possible. For example, the Chromium codebase contains modules that translate a more abstract representation of filter predicates (ErrorCode) into a basic block graph and then into a linear sequence of BPF instructions (CodeGen); this includes constructing a binary search tree to dispatch on the system call number in O(log n) time.

It is also possible to implement a "warn-only" mode by having the filter program check the instruction pointer (program counter) and allow all syscalls where the machine instruction that performed the call is at a specific address. This way, if a syscall would have been rejected, it instead raises SIGSYS and the signal handler logs the syscall before re-issuing it by jumping to the always-allowed syscall gate. This offers basically no security, as an attacker who compromises the process could issue arbitrary syscalls by jumping to that location, but it can simplify testing and developing a sandbox policy.

Use in Gecko

The code is in mozilla-central at security/sandbox/linux. Files of interest:

  • Sandbox.h: the public interface; used when a child process is ready to enter sandboxed mode
  • SandboxFilter.cpp: the sandbox policy definitions, and trap handlers for intercepted syscalls
  • Sandbox.cpp: the code that starts the sandbox and handles violations

The policy is compiled into a seccomp-bpf program using the Chromium code imported in security/sandbox/chromium/sandbox/linux. Files of interest in that subtree:

  • bpf_dsl/bpf_dsl.h: defines the interface used to specify the policy
  • bpf_dsl/policy_compiler.cc: converts the intermediate form into BPF instructions
  • services/arm_linux_syscalls.h and other *_linux_syscalls.h: syscall number definitions; grep these to translate syscall numbers seen in error messages (use the file corresponding to the architecture in question)

Seccomp reporter

When seccomp denies a system call, it sends a signal (SIGSYS) which is handled by the reporter. The reporter logs information about the syscall, invokes the crash reporter, tries to log the current JS stack if any, and finally terminates the process.

The log message looks like this:

 seccomp sandbox violation: pid %u, syscall %lu, args %lu %lu %lu %lu %lu. Killing Process.

Note that the SIGSYS handler is also used for syscalls that we want to intercept and “polyfill” with some other action; in that case it modifies the signal context and returns, instead of crashing.

Crash reports

A sample crash report looks like this. One can see, that the crash reason is stated as 'SIGSYS', as stated above. In order to figure out which system call caused the crash, one needs to look at the 'Crash Address' line, which in the above mentioned crash reports says 0x57.

With the architecture mentioned in the 'Build Architecture' line, one can then look into the corresponding system call table and figure out which system call caused the crash. System call tables for x86_64 can be found in a file called syscall_64.tbl and for x86 called syscall_32.tbl in the Linux kernel source tree.

0x57 in decimal is 87, and since it is x86_64, syscall 87 is sys_unlink.

How do I check my processes are sandboxed by seccomp?

There is a seccomp flag in the process status. Use this command, replacing <pid> with the process's PID:

 grep Seccomp /proc/<pid>/status
  • 0: Seccomp is not enabled (bad!)
  • 1: Seccomp "strict mode" is enabled (shouldn't happen)
  • 2: Seccomp-bpf is enabled (correct)

Alternatively, on recent (1.4+) b2g versions:

  • b2g-ps, look at the SEC field (same meanings as above, 2 means sandboxed)

On B2G, you can find out your PIDs by using the command b2g-ps

How do I disable the sandbox temporarily?

For content process (e10s, B2G) sandboxing:

 export MOZ_DISABLE_CONTENT_SANDBOX=1

and restart Firefox. In B2G's case, this looks like:

 adb shell
 stop b2g
 export MOZ_DISABLE_CONTENT_SANDBOX=1
 /system/bin/b2g.sh

For Gecko Media Plugin sandboxing on desktop (OpenH264, EME, etc.):

 export MOZ_DISABLE_GMP_SANDBOX=1

Also, to simulate the effect of having no sandbox support in the kernel:

 export MOZ_FAKE_NO_SANDBOX=1

In particular, this will disable media plugin support on desktop, and cause B2G based on Android KitKat or later to completely refuse to start, unless the corresponding DISABLE option is set — which is the intended behavior if sandboxing isn't possible in those cases. This is probably not useful except to test that kind of mandatory-sandboxing feature.

More information