A Modern Multi-core Processor #
Part 1: Speeding up an example sine program #
Computes sin
across an array of float
values using a Taylor series expansion
void sinx(int N, int terms, float *x, float *y) {
for (int i = 0; i < N; i++) {
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;
for (int j = 1; j <= terms; j++) {
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2 * j + 2) * (2 * j + 3);
sign *= -1;
}
y[i] = value;
}
}
This compiles to some assembly:
ld r0, addr[r1] // x[i]
mul r1, r0, r0
mul r1, r1, r0
// ...
st addr[r2], r0 // y[i]
Pre-multicore era processor #
- Majority of chip transistors used to perform ops that help make single instruction streams run fast
- Out of order control logic
- Fancy branch predictor
- Memory prefetcher
- More transistors = larger cache
Multicore-era processor #
- Idea: use transistor count to add more cores to the processor
- Simple cores: each core maybe slowre at running a single instruction stream than original fancy core (e.g. 25% slower)
- However, with two cores: 2 * 0.75 = 1.5 = 50% speedup!
- However: C program will compile to single instruction stream that runs on only one thread on one core
Transforming a program to multicore #
typedef struct {
int N,
int terms;
float *x;
float *y;
} my_args;
void my_thread_func(my_args *args) {
sinx(args->N, args->terms, args->x, args->y); // do work
}
void parallel_sinx(int N, int terms, float *x, float *y) {
std::thread my_thread;
my_args args;
args.N = N/2;
args.terms = terms;
args.x = x;
args.y = y;
my_thread = std::thread(my_thread_func, &args); // launch thread
sinx(N - args.N, terms, x + args.N, y + args.N); // do work on main thread
/* NOTE: because of this waiting, we can only ever get 2x speedup!! */
my_thread.join(); // wait for thread to complete
}
Data-parallel expressions #
void sinx(int N, int terms, float *x, float *y) {
// original:
// for (int i = 0; i < N; i++) {
// new: fictituous forall construct that tells the compiler
// that each of these iterations are independent
forall (int i from 0 to N) {
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;
for (int j = 1; j <= terms; j++) {
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2 * j + 2) * (2 * j + 3);
sign *= -1;
}
y[i] = value;
}
}
With such a construct, a compiler might be able to autogenerate threaded code.
SIMD processing #
- Single instruction, multiple data
- Additional execution units (ALUs) increase compute capability
- Idea: Amortize cost/complexity of managing an instruction stream across many ALUs
- Same instruction broadcast to all ALUs
- Operation executed in parallel on all ALUs
- e.g. Original scalar program processes one array element using scalar instructions on scalar registers
- Instead, use vector operations (i.e.
__m256
types and__mm256
arith and load operations)- These are intrinsic datatypes and functions available to C programmers
- These operate on vectors of eight 32-bit values (e.g. vector of 8 floats)
- Compiled program processes eight array elements simultaneously using vector instructions on 256-bit vector register
- Note: with Data-parallel expression example code, the
forall
abstraction can also facilitate autogeneration of SIMD instructions- In addition to multi-core parallel code!
Conditional execution with SIMD #
Consider the following example
forall (int i from 0 to N) {
float t = x[i];
// unconditional code
if (t > 0.0) {
t = t * t;
t = t * 50.0;
t = t + 100.0;
} else {
t = t + 30.0;
t = t / 10.0;
}
// resume unconditional code
y[i] = t;
}
- How to process this on a vector-only CPU?
- Mask (discard) output of ALU
- Run all with
true
branches, discard output offalse
branches - Run all with
false
branches, discard output oftrue
branches - Worst case: 1/8 of peak performance
- Can create this by making the
if
branch extremely costly, but processor is still forced to send all the branch that wayif (i == 0) { do_something_super_costly(); } else { do_something_normal(); }
- Can create this by making the
SIMD jargon #
- Instruction stream coherence (“coherent execution”)
- Property of a program where same instruction stream applies to many data elements
- NECESSARY for SIMD processing to be used efficiently
- NOT NECESSARY for efficient parallelization across different cores
- Divergent execution
- Lack of instruction stream coherence
Modern CPU examples of SIMD #
- AVX2 (Intel): 256b operations: 8x32bit or 4x64bit
- AVX512 (Intel): 512b: 16x32bit
- ARM Neon: 128b: 4x32bit
- Instructions generated by complier
- Requested by programmer using intrinsics
- Or parallel language semantics (e.g.
forall
) - Inferred by dependency analysis of loops by auto-vectorizing compiler
- Explicit SIMD: SIMD parallelization performed at compile time
- SIMD instructions in binary itself (e.g.,
vstoreps
,vmulps
, etc.)
- SIMD instructions in binary itself (e.g.,
Example: Intel i7-7700K (Kaby Lake, 2017) #
- 4 core CPU
- Three 8-wide SIMD ALUs per core for AVX2
- 4 cores x 8-wide SIMD * 3 * 4.2 GHz = 400 GFLOPs
SIMD on GPUs #
- “Implicit SIMD”
- Compiler generates binary with scalar instructions
- But N instances of program always run together on processor
- Hardware, not compiler, responsible for executing same instruction on multiple program instances on different data on SIMD ALUs
- SIMD width of most modern GPUs ranges from 8 to 32
- Divergent execution can result in 1/32 of peak capacity
Part 2: accessing memory #
Load/store: instructions that access memory #
// load 4b value in memory starting from address stored by register r2
// and put it into register r0
ld r0 mem[r2]
// store 4b value in r0 into address stored by register r2
st r0 mem[r2]
Memory terminology #
- Memory address space
- Memory organized as sequence of bytes
- Each byte identified by address in memory
- Address space: total addressable memory of a program
- Memory access latency
- Amount of time it takes memory system to provide data to the processor
- e.g. 100 clock cycles, 100 nsec
- Stalls
- A processor “stalls” when it cannot run the next instruction in an instruction stream due to dependency on a previous incomplete instruction
- e.g. Accessing memory can cause stalls
ld r0 mem[r2] ld r1 mem[r3] // Dependency: cannot execute `add` instruction until data // from mem[r2] and mem[r3] have been loaded from memory add r0, r0, r1
- Memory access times: ~hundreds of cycles
- Measure of latency
Caches #
- On-chip storage that maintains copy of subset of values in memory
- Address is “in the cache”: processor can load and store address more quickly than if data resided in memory
- Hardware implementation detail that does not impact program output; only performance
- For this class, abstraction: assume cache of size N keeps last N addresses accessed
- “LIFO” policy: on each memory access, throw out oldest data in cache to make room for newly accessed data
- In real world, other policies:
- Direct mapped cache
- Set-associative cache
- Cache line
- Reduce length of stalls and memory access latency
- Specifically: when accessing data that had recently been accessed
- e.g. Data access times, Intel Kaby Lake (2017)
- L1 cache: 4 cycles
- L2 cache: 12 cycles
- L3 cache: 38 cycles
- DRAM: best case ~248 cycles
Data prefetching #
- Logic on modern CPUs for guessing what data will be accessed in the future
- Prefetch this data into caches
- Dynamically analyze program memory access patterns to make predictions
- Reduces stalls since data is already resident in cache when accessed
- However: can reduce perf. if guess is wrong (consumes bandwidth, pollutes caches)
Multithreading to reduce stalls #
- Interleave processing of multiple threads on same core to hide stalls
- Can’t make process on one thread? Work on another one
- Once hitting a stall on one thread, immediately switch over another one
- Idea: potentially increase time to complete a single thread’s work in order to increase overall system throuput when running multiple threads
- However: this requires storing execution contexts
- Tradeoff between memory caching and storing these execution contexts
- Takeaway: a processor with multiple HW threads has the ability to avoid stalls by interleaving executions
- This doesn’t affect mem. access latency; it just improves processor utilization
Hardware-supported multithreading #
- Core manages execution contexts for multiple threads
- Interleaved multi-threading
- Simultaneous multi-threading (SMT)
- e.g. Intel Hyper-Threading
Broader takeaway: can combine each of these parallelism constructs!
- This is what modern chips (e.g., Intel Skylake/Kaby Lake, Nvidia GPUs) do
Latency and Bandwidth #
- Memory bandwidth: rate at which memory system can provide data to a processor (e.g., 20 GB/s)
- Execution speed is often limited by available bandwidth
- e.g. data transfer speed = 8 bytes per clock: then simple math adds (1 add per clock) can create up to 3 outstanding load requests
- This is called bandwidth-bound execution: instruction throughput limited by bandwidth, not latency or outstanding requests etc.
- Influences chip design: modern GPUs place memory on or near chip