Isolation and Sandboxing #
Goal: run untrusted code without compromising systems
- Programs from untrusted Internet sites
- Mobile apps, JS, browser extensions
- Exposed applications: Browser, PDF viewer, email client
- Legacy daemons: sendmail, bind
- Honeypots
- If application misbehaves, want to kill it
Approach: confinement #
- Idea: ensure misbehaving app cannot harm rest of system
- Can be implemented at many levels
- Hardware: run application on isolated hardware (airgap) - difficult to manage
- Virtual machines: isolate OS’s on a single machine
- Process level: system call interposition; isolate a process in a single OS
- Threads: software fault isolation (SFI)
- Isolating threads sharing same address space
- Application level confinement
- e.g. browser sandbox for Javascript and WebAssembly
Implementation #
Key component: reference monitor
- Mediates requests from applications
- Enforces confinement
- Implements a specified protection policy
- Must always be invoked; every application request must be mediated
- Tamperproof: reference monitor cannot be killed; if it is, then monitored process is also killed
Example: chroot
jail
#
To use (must be root):
chroot /tmp/guest
su guest
- Root dir
/
is now/tmp/guest
- EUID set to
guest
- Now
/tmp/guest
is added to every file system accessfopen('/etc/passwd', 'r'); // becomes the below: fopen('/tmp/guest/etc/passwd', 'r');
chroot
should only be executable byroot
, otherwise jailed app can do- Create dummy file
/aaa/etc/passwd
- Run
chroot /aaa
- Run
su root
to becomeroot
- Many ways to escape jail as
root
:- Create device that lets you access raw disk
- Send signals to non-
chroot
ed process - Reboot system
- Bind to privileged ports
- Create dummy file
FreeBSD jail #
- Stronger mechanism than simple
chroot
- To run:
jail [jail-path] [hostname] [IP-addr] [cmd]
- Calls hardened
chroot
(no../../
escape) - Can only bind to sockets with specified IP addresses and authorized ports
- Can only comunicated with processes inside jail
- Root is limited, e.g. cannot load kernel modules
- Calls hardened
Problems with chroot
jails
#
- Coarse policies: all or nothing access to parts of a file system
- Inappropriate for apps like web browsers, which need access to files outside jail (e.g. for sending attachments over email)
- Does not prevent malicious apps from
- Accessing network and messing with other machines
- Trying to crash host OS
System call interposition #
- Observation: to damage host system (e.g. persistent changes), app must make system calls
- To delete/overwrite files:
unlink
,open
,write
- To do network attacks:
socket
,bind
,connect
,send
- To delete/overwrite files:
- Idea: monitor app’s system calls and block unauthorized calls
- Implementation options
- Completely kernel-space (e.g., Linux
seccomp
) - Completely user-space (e.g., program shepherding)
- Hybrid (e.g.
systrace
)
- Completely kernel-space (e.g., Linux
Early implementation: Janus (1996) #
- Linux
ptrace
: process tracing- Process calls
ptrace(..., pid_t pid, ...)
and wakes up whenpid
makes system call
- Process calls
- If monitored process calls
fopen
, monitor decides if application is allowed, if not, monitor kills application - Example policy (e.g., for PDF reader)
path allow /tmp/* path deny /etc/passwd network deny all
- Manually specifying policy for an app can be difficult
- Recommended default policies are available, can be made more restrictive as needed
- Complications
- If app forks, monitor must also fork; forked monitor monitors forked app
- If monitor crashes, app must be killed
- Monitor must maintain all OS state associated with the app
- Current working directory, UID, EUID, GID
- When app does
cd path
monitor must update its CWD
- Problems with
ptrace
:- Trace all system calls, which can be inefficient: no need to trace
close
system call - Monitor can abort syscall without killing app
- Trace all system calls, which can be inefficient: no need to trace
- Race conditions (time of check/time of use - TOCTOU bug) since checking/opening is not atomic
- Proc 1:
open("me")
- Monitor checks and authorizes
- Proc 2: symlink
me
to/etc/passwd
- OS executes
open("me")
- Proc 1:
SCI in Linux: seccomp-bpf
#
seccomp-bpf
: Linux kernel facility used to filter process syscalls- Syscall filters written in BPF language (use BPFC compiler)
- Used in Chromium, Docker containers, etc.
- How this works:
- Chrome renderer process starts
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy)
- Renderer process renders site
- If due to exploit the process tries to dothen
fopen("/etc/passwd", "r");
seccomp-bpf
will kill process
- Chrome renderer process starts
BPF filters (policy programs): #
- Process can install multiple BPF filters
- Once installed, filter cannot be removed - all run on every syscall
- If process forks, child inherits all filters
- If program calls
execve
, all filters are preserved - BPF filter input: syscall number, syscall args., system architecture
- Filter returns one of
SECCOMP_RET_KILL
: kill processSECCOMP_RET_ERRNO
: return specified error to callerSECCOMP_RET_ALLOW
: allow syscall
Installing a BPF filter #
- Must be called before setting BPF filter
- Ensures
setuid
,setgid
ignored on subsequentexecve
- Attacker cannot elevate privilege
int main (int argc, char **argv) {
prctl(PR_SET_NO_NEW_PRIVS, 1);
// kill if call open() for write
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy);
fopen("file.txt," "w");
assert(false); // should never reach here
}
Docker: isolating containers using seccomp-bpf
#
Container: process-level isolation
- Container prevented from making syscalls filtered by
seccomp-bpf
- Whoever starts container can specify BPF policy
- default policy blocks many syscalls, including
ptrace
- default policy blocks many syscalls, including
- e.g. Nginx:
docker run --security-opt="seccomp=filter.json" nginx
"defaultAction": "SCMP_ACT_ERRNO." // deny by default "syscalls": [ { "names": ["accept"], // syscall name "action": "SCMP_ACT_ALLOW", // allow (whitelist) "args": [] // what args to allow }, // ... ]
- More Docker confinement flags:
- Specify an unprivileged user:
docker run --user www nginx
- Limit Linux capabilities: drop all capabilities, allow bind to privileged ports:
docker run --cap-drop all --cap-add NET_BIND_SERVICE nginx
- Prevent process from becoming privileged (e.g. by a
setuid
binary)docker run --security-opt=no-new-privileges:true nginx
- Limit number of restarts and resources
docker run --restart=on_failure:$MAX_RETRIES --ulimit nofile=$MAX_FD --ulimit nproc=$MAX_NPROC nginx
- Specify an unprivileged user:
Confinement via virtual machines #
About VMs: see my notes from CS 111
Hypervisor security assumption #
- Malware can infect guest OS and guest apps
- But malware cannot escape from the infected VM
- Cannot infect host OS or other VMs on the same hardware
- Requires that hypervisor protects itself and is not buggy
- Some hypervisors are much simpler than a full OS
Problem: covert channels #
- Unintended communication channel between isolated components
- Can leak classified data from secure component to public component
- Example communication between cooperating malware and listener
- To send a bit, malware does
b = 1
: at 1:00am, do CPU intensive calculationb = 0
: at 1:00am, do nothing
- At 1:00am, listener does CPU intensive calculation and measures completion time
b = 1
: completion time > threshold
- To send a bit, malware does
- Many covert channels exist in running system
- File lock status, cache contents, interrupts, etc.
- Difficult to eliminate all
- VMs from different customers may run on same machine in the cloud
- However, some data may leak
VMs in end-user environments #
- Qubes OS: a desktop/laptop OS where everything is a VM
- Runs on Xen hypervisor
- Access to peripherals (mic, camera, USBs, …) controlled by VMs
- Every window frame identifies VM source
Hypervisor detection #
- Malware can detect hypervisor and refuse to run to avoid reverse-engineering
- Software that binds to hardware can refuse to run in VM
- DRM systems may refuse to run on top of hypervisor
- How to detect?
- VM platforms often emulate simple hardware (e.g. i440bx chipset) but reports 8GB RAM, dual CPUs, etc
- Hypervisor introduces time latency variances
- Memory cache behavior differs in presence of hypervisor
- Results in relative time for any two operations
- Hypervisor shares TLB with guest OS
- Guest OS can detect reduced TLB size
- Can do detection in the browser
- Timing variations in writing to the screen
- Disadvantages identification of malware websites using VMs
- The perfect hypervisor does not exist
- Hypervisors today focus on compatibility (ensuring off-the-shelf software works) and performance (minimizing virtualization overhead)
- VMMs do not provide transparency - anomalies reveal existence of hypervisor