Heterogeneous Processing, Hardware Specialization, and Domain-Specific Languages #

Motivation #

A CPU is inefficient, even when using it as efficiently as possible
- Lots of CPU time is spent on control-related tasks, rather than FP math - can this be amortized?
Instead - think of a computer as a “swiss army knife” - some tasks may be faster on a CPU or a GPU, even within the same overall program
Heterogeneous (specialized) hardware designs improve efficiency, but difficult to map applications to hardware

Intel CPU: CPU cores + GPU (multiple cores itself), shared memory system and L3 cache enabling low-latency, high-bandwidth communication between CPU and GPU
Workstation: add high-end discrete GPU across PCIe bus
- Also itself heterogeneous! Some units are purely for graphical processing, not just for SIMD processing
Mobile: System-on-a-Chip (SoC), dedicated cores for GPU, neural network accel., video compression/decompression, etc.
Digital signal processors (DSPs): programmable processors, but with simpler instruction stream control paths
- Complex instructions: perform many ops per instruction (amortize cost of control)
  - SIMD
  - Very Large Instruction Word (VLIW) - single instruction specifies multiple operations
Anton supercomputer for molecular dynamics (D.E. Shaw Research)
- Simulates protein evolution over time
- ASIC for computing particle-particle interactions (512 in a machine)
Google TPU pods for ML
Google Pixel Visual Core
- Programmable image-processing unit (IPU)
- Each core: 16x16 grid of 16-bit mul-add ALUs
- 10-20x more efficient than GPU at image processing tasks
FPGAs (Field-Programmable Gate Arrays)
- Middle ground between ASIC and processor: provides array of logic blocks, connected by interconnect
- Programmer-defined logic implemented directly by FPGA

Note: performance is not the only goal, power consumption is important too!
- Not just battery life for mobile devices, but heat!
Power = (ops/sec) * (joules/op), where ops/sec is performance and joules/op is energy efficiency
- The amount of power usable is fixed
  - Either by regulation (phone case temperature)
  - Or b/c the processor itself can physically degrade if too hot
- Specialization (fixed function) leads to better energy efficiency
Data movement has high energy cost
- Reading 64 bits from small local SRAM is ~26x energy cost as an integer op
- Reading 64 bits from LPDDR is ~1200x energy cost as an integer op
  - Suggests that recomputing values, rather than storing and reloading them, is better when optimizing code for energy efficiency

Rules of thumb: compared to high-quality C code on a CPU
Throughput-maximized processor architecture e.g. GPU cores:
- Approximately 10x improvement in perf/watt
- Assuming code maps well to wide data-parallel execution and is compute bound
Fixed-function ASIC (application-specific integrated circuit)
- Can approach 100-1000x or greater improvement in perf/watt
- Assuming code is compute bound and is not floating-point math

Challenges for developers: mapping applications onto heterogeneous collection of resources
- Esp. scheduling
Challenges for hardware designers: what is the right mixture of resources?

The ideal programming language balances performance, productivity, and generality - but this is impossible!
- Performance, generality: C/C++, Java, Rust, CUDA
- Productivity, generality: Python, Ruby, JS, PHP
- Performance, productivity: SQL, Matlab, PyTorch
DSLs: Programming languages with restricted expressiveness for a given domain
- Main idea: raise level of abstraction for expressing programs
  - Goal: write one program, run it efficiently on different machines
- Introduce high-level programming primitives specific to application domain
  - Productive: intuitive to use, portable across systems, primitives correspond to domain-specific behaviors
  - Performant: system uses domain knowledge to provide efficient, optimized implementation(s)
    - System can optimize code for the given task behind the scenes
- Cost: loss of generality/completeness
e.g. Halide language: simple domain-specific language embedded in C++ for describing secuences of image processing operations
- Used in Android camera app