Isolation and Sandboxing

Isolation and Sandboxing #

Goal: run untrusted code without compromising systems

  • Programs from untrusted Internet sites
    • Mobile apps, JS, browser extensions
  • Exposed applications: Browser, PDF viewer, email client
  • Legacy daemons: sendmail, bind
  • Honeypots
  • If application misbehaves, want to kill it

Approach: confinement #

  • Idea: ensure misbehaving app cannot harm rest of system
  • Can be implemented at many levels
    • Hardware: run application on isolated hardware (airgap) - difficult to manage
    • Virtual machines: isolate OS’s on a single machine
    • Process level: system call interposition; isolate a process in a single OS
    • Threads: software fault isolation (SFI)
      • Isolating threads sharing same address space
    • Application level confinement
      • e.g. browser sandbox for Javascript and WebAssembly

Implementation #

Key component: reference monitor

  • Mediates requests from applications
    • Enforces confinement
    • Implements a specified protection policy
  • Must always be invoked; every application request must be mediated
  • Tamperproof: reference monitor cannot be killed; if it is, then monitored process is also killed

Example: chroot jail #

To use (must be root):

chroot /tmp/guest
su guest
  • Root dir / is now /tmp/guest
  • EUID set to guest
  • Now /tmp/guest is added to every file system access
    fopen('/etc/passwd', 'r'); // becomes the below:
    fopen('/tmp/guest/etc/passwd', 'r');
  • chroot should only be executable by root, otherwise jailed app can do
    • Create dummy file /aaa/etc/passwd
    • Run chroot /aaa
    • Run su root to become root
    • Many ways to escape jail as root:
      • Create device that lets you access raw disk
      • Send signals to non-chrooted process
      • Reboot system
      • Bind to privileged ports

FreeBSD jail #

  • Stronger mechanism than simple chroot
  • To run: jail [jail-path] [hostname] [IP-addr] [cmd]
    • Calls hardened chroot (no ../../ escape)
    • Can only bind to sockets with specified IP addresses and authorized ports
    • Can only comunicated with processes inside jail
    • Root is limited, e.g. cannot load kernel modules

Problems with chroot jails #

  • Coarse policies: all or nothing access to parts of a file system
    • Inappropriate for apps like web browsers, which need access to files outside jail (e.g. for sending attachments over email)
  • Does not prevent malicious apps from
    • Accessing network and messing with other machines
    • Trying to crash host OS

System call interposition #

  • Observation: to damage host system (e.g. persistent changes), app must make system calls
    • To delete/overwrite files: unlink, open, write
    • To do network attacks: socket, bind, connect, send
  • Idea: monitor app’s system calls and block unauthorized calls
  • Implementation options
    • Completely kernel-space (e.g., Linux seccomp)
    • Completely user-space (e.g., program shepherding)
    • Hybrid (e.g. systrace)

Early implementation: Janus (1996) #

  • Linux ptrace: process tracing
    • Process calls ptrace(..., pid_t pid, ...) and wakes up when pid makes system call
  • If monitored process calls fopen, monitor decides if application is allowed, if not, monitor kills application
  • Example policy (e.g., for PDF reader)
    path allow /tmp/*
    path deny /etc/passwd
    network deny all
  • Manually specifying policy for an app can be difficult
    • Recommended default policies are available, can be made more restrictive as needed
  • Complications
    • If app forks, monitor must also fork; forked monitor monitors forked app
    • If monitor crashes, app must be killed
    • Monitor must maintain all OS state associated with the app
      • Current working directory, UID, EUID, GID
      • When app does cd path monitor must update its CWD
    • Problems with ptrace:
      • Trace all system calls, which can be inefficient: no need to trace close system call
      • Monitor can abort syscall without killing app
    • Race conditions (time of check/time of use - TOCTOU bug) since checking/opening is not atomic
      1. Proc 1: open("me")
      2. Monitor checks and authorizes
      3. Proc 2: symlink me to /etc/passwd
      4. OS executes open("me")

SCI in Linux: seccomp-bpf #

  • seccomp-bpf: Linux kernel facility used to filter process syscalls
    • Syscall filters written in BPF language (use BPFC compiler)
    • Used in Chromium, Docker containers, etc.
  • How this works:
    1. Chrome renderer process starts
      prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy)
    2. Renderer process renders site
    3. If due to exploit the process tries to do
      fopen("/etc/passwd", "r");
      then seccomp-bpf will kill process

BPF filters (policy programs): #

  • Process can install multiple BPF filters
  • Once installed, filter cannot be removed - all run on every syscall
  • If process forks, child inherits all filters
  • If program calls execve, all filters are preserved
  • BPF filter input: syscall number, syscall args., system architecture
  • Filter returns one of
    • SECCOMP_RET_KILL: kill process
    • SECCOMP_RET_ERRNO: return specified error to caller
    • SECCOMP_RET_ALLOW: allow syscall

Installing a BPF filter #

  • Must be called before setting BPF filter
  • Ensures setuid, setgid ignored on subsequent execve
    • Attacker cannot elevate privilege
int main (int argc, char **argv) {
    prctl(PR_SET_NO_NEW_PRIVS, 1);
    // kill if call open() for write
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy);
    fopen("file.txt," "w");
    assert(false); // should never reach here

Docker: isolating containers using seccomp-bpf #

Container: process-level isolation

  • Container prevented from making syscalls filtered by seccomp-bpf
  • Whoever starts container can specify BPF policy
    • default policy blocks many syscalls, including ptrace
  • e.g. Nginx: docker run --security-opt="seccomp=filter.json" nginx
    "defaultAction": "SCMP_ACT_ERRNO." // deny by default
    "syscalls": [
            "names": ["accept"], // syscall name
            "action": "SCMP_ACT_ALLOW", // allow (whitelist)
            "args": [] // what args to allow
        // ...
  • More Docker confinement flags:
    • Specify an unprivileged user:
      • docker run --user www nginx
    • Limit Linux capabilities: drop all capabilities, allow bind to privileged ports:
      • docker run --cap-drop all --cap-add NET_BIND_SERVICE nginx
    • Prevent process from becoming privileged (e.g. by a setuid binary)
      • docker run --security-opt=no-new-privileges:true nginx
    • Limit number of restarts and resources
      • docker run --restart=on_failure:$MAX_RETRIES --ulimit nofile=$MAX_FD --ulimit nproc=$MAX_NPROC nginx

Confinement via virtual machines #

About VMs: see my notes from CS 111

Hypervisor security assumption #

  • Malware can infect guest OS and guest apps
  • But malware cannot escape from the infected VM
    • Cannot infect host OS or other VMs on the same hardware
  • Requires that hypervisor protects itself and is not buggy
    • Some hypervisors are much simpler than a full OS

Problem: covert channels #

  • Unintended communication channel between isolated components
  • Can leak classified data from secure component to public component
  • Example communication between cooperating malware and listener
    • To send a bit, malware does
      • b = 1: at 1:00am, do CPU intensive calculation
      • b = 0: at 1:00am, do nothing
    • At 1:00am, listener does CPU intensive calculation and measures completion time
      • b = 1: completion time > threshold
  • Many covert channels exist in running system
    • File lock status, cache contents, interrupts, etc.
    • Difficult to eliminate all
  • VMs from different customers may run on same machine in the cloud
    • However, some data may leak

VMs in end-user environments #

  • Qubes OS: a desktop/laptop OS where everything is a VM
    • Runs on Xen hypervisor
    • Access to peripherals (mic, camera, USBs, …) controlled by VMs
    • Every window frame identifies VM source

Hypervisor detection #

  • Malware can detect hypervisor and refuse to run to avoid reverse-engineering
  • Software that binds to hardware can refuse to run in VM
  • DRM systems may refuse to run on top of hypervisor
  • How to detect?
    • VM platforms often emulate simple hardware (e.g. i440bx chipset) but reports 8GB RAM, dual CPUs, etc
    • Hypervisor introduces time latency variances
      • Memory cache behavior differs in presence of hypervisor
      • Results in relative time for any two operations
    • Hypervisor shares TLB with guest OS
      • Guest OS can detect reduced TLB size
    • Can do detection in the browser
      • Timing variations in writing to the screen
      • Disadvantages identification of malware websites using VMs
  • The perfect hypervisor does not exist
    • Hypervisors today focus on compatibility (ensuring off-the-shelf software works) and performance (minimizing virtualization overhead)
    • VMMs do not provide transparency - anomalies reveal existence of hypervisor