Modern Multicore Processors

A Modern Multi-core Processor #

Part 1: Speeding up an example sine program #

Computes sin across an array of float values using a Taylor series expansion

void sinx(int N, int terms, float *x, float *y) {
    for (int i = 0; i < N; i++) {
        float value = x[i];
        float numer = x[i] * x[i] * x[i];
        int denom = 6; // 3!
        int sign = -1;

        for (int j = 1; j <= terms; j++) {
            value += sign * numer / denom;
            numer *= x[i] * x[i];
            denom *= (2 * j + 2) * (2 * j + 3);
            sign *= -1;
        }

        y[i] = value;
    }
}

This compiles to some assembly:

ld  r0, addr[r1] // x[i]
mul r1, r0, r0
mul r1, r1, r0
// ...
st  addr[r2], r0 // y[i]

Pre-multicore era processor #

  • Majority of chip transistors used to perform ops that help make single instruction streams run fast
    • Out of order control logic
    • Fancy branch predictor
    • Memory prefetcher
    • More transistors = larger cache

Multicore-era processor #

  • Idea: use transistor count to add more cores to the processor
    • Simple cores: each core maybe slowre at running a single instruction stream than original fancy core (e.g. 25% slower)
    • However, with two cores: 2 * 0.75 = 1.5 = 50% speedup!
  • However: C program will compile to single instruction stream that runs on only one thread on one core

Transforming a program to multicore #

typedef struct {
    int N,
    int terms;
    float *x;
    float *y;
} my_args;

void my_thread_func(my_args *args) {
    sinx(args->N, args->terms, args->x, args->y); // do work
}

void parallel_sinx(int N, int terms, float *x, float *y) {
    std::thread my_thread;
    my_args args;
    args.N = N/2;
    args.terms = terms;
    args.x = x;
    args.y = y;

    my_thread = std::thread(my_thread_func, &args); // launch thread
    sinx(N - args.N, terms, x + args.N, y + args.N); // do work on main thread

    /* NOTE: because of this waiting, we can only ever get 2x speedup!! */
    my_thread.join(); // wait for thread to complete
}

Data-parallel expressions #

void sinx(int N, int terms, float *x, float *y) {
    // original:
    // for (int i = 0; i < N; i++) {

    // new: fictituous forall construct that tells the compiler
    // that each of these iterations are independent
    forall (int i from 0 to N) {
        float value = x[i];
        float numer = x[i] * x[i] * x[i];
        int denom = 6; // 3!
        int sign = -1;

        for (int j = 1; j <= terms; j++) {
            value += sign * numer / denom;
            numer *= x[i] * x[i];
            denom *= (2 * j + 2) * (2 * j + 3);
            sign *= -1;
        }

        y[i] = value;
    }
}

With such a construct, a compiler might be able to autogenerate threaded code.

SIMD processing #

  • Single instruction, multiple data
  • Additional execution units (ALUs) increase compute capability
  • Idea: Amortize cost/complexity of managing an instruction stream across many ALUs
    • Same instruction broadcast to all ALUs
    • Operation executed in parallel on all ALUs
  • e.g. Original scalar program processes one array element using scalar instructions on scalar registers
  • Instead, use vector operations (i.e. __m256 types and __mm256 arith and load operations)
    • These are intrinsic datatypes and functions available to C programmers
    • These operate on vectors of eight 32-bit values (e.g. vector of 8 floats)
    • Compiled program processes eight array elements simultaneously using vector instructions on 256-bit vector register
  • Note: with Data-parallel expression example code, the forall abstraction can also facilitate autogeneration of SIMD instructions
    • In addition to multi-core parallel code!

Conditional execution with SIMD #

Consider the following example

forall (int i from 0 to N) {
    float t = x[i];
    // unconditional code
    if (t > 0.0) {
        t = t * t;
        t = t * 50.0;
        t = t + 100.0;
    } else {
        t = t + 30.0;
        t = t / 10.0;
    }

    // resume unconditional code
    y[i] = t;
}
  • How to process this on a vector-only CPU?
    • Mask (discard) output of ALU
    • Run all with true branches, discard output of false branches
    • Run all with false branches, discard output of true branches
    • Worst case: 1/8 of peak performance
      • Can create this by making the if branch extremely costly, but processor is still forced to send all the branch that way
        if (i == 0) {
            do_something_super_costly();
        } else {
            do_something_normal();
        }
        

SIMD jargon #

  • Instruction stream coherence (“coherent execution”)
    • Property of a program where same instruction stream applies to many data elements
    • NECESSARY for SIMD processing to be used efficiently
    • NOT NECESSARY for efficient parallelization across different cores
  • Divergent execution
    • Lack of instruction stream coherence

Modern CPU examples of SIMD #

  • AVX2 (Intel): 256b operations: 8x32bit or 4x64bit
  • AVX512 (Intel): 512b: 16x32bit
  • ARM Neon: 128b: 4x32bit
  • Instructions generated by complier
    • Requested by programmer using intrinsics
    • Or parallel language semantics (e.g. forall)
    • Inferred by dependency analysis of loops by auto-vectorizing compiler
  • Explicit SIMD: SIMD parallelization performed at compile time
    • SIMD instructions in binary itself (e.g., vstoreps, vmulps, etc.)
Example: Intel i7-7700K (Kaby Lake, 2017) #
  • 4 core CPU
  • Three 8-wide SIMD ALUs per core for AVX2
  • 4 cores x 8-wide SIMD * 3 * 4.2 GHz = 400 GFLOPs

SIMD on GPUs #

  • “Implicit SIMD”
    • Compiler generates binary with scalar instructions
    • But N instances of program always run together on processor
    • Hardware, not compiler, responsible for executing same instruction on multiple program instances on different data on SIMD ALUs
  • SIMD width of most modern GPUs ranges from 8 to 32
    • Divergent execution can result in 1/32 of peak capacity

Part 2: accessing memory #

Load/store: instructions that access memory #

// load 4b value in memory starting from address stored by register r2
// and put it into register r0
ld r0 mem[r2]

// store 4b value in r0 into address stored by register r2
st r0 mem[r2]

Memory terminology #

  • Memory address space
    • Memory organized as sequence of bytes
    • Each byte identified by address in memory
    • Address space: total addressable memory of a program
  • Memory access latency
    • Amount of time it takes memory system to provide data to the processor
    • e.g. 100 clock cycles, 100 nsec
  • Stalls
    • A processor “stalls” when it cannot run the next instruction in an instruction stream due to dependency on a previous incomplete instruction
    • e.g. Accessing memory can cause stalls
      ld r0 mem[r2]
      ld r1 mem[r3]
      // Dependency: cannot execute `add` instruction until data 
      // from mem[r2] and mem[r3] have been loaded from memory
      add r0, r0, r1
      
    • Memory access times: ~hundreds of cycles
      • Measure of latency

Caches #

  • On-chip storage that maintains copy of subset of values in memory
  • Address is “in the cache”: processor can load and store address more quickly than if data resided in memory
  • Hardware implementation detail that does not impact program output; only performance
  • For this class, abstraction: assume cache of size N keeps last N addresses accessed
    • “LIFO” policy: on each memory access, throw out oldest data in cache to make room for newly accessed data
    • In real world, other policies:
      • Direct mapped cache
      • Set-associative cache
      • Cache line
  • Reduce length of stalls and memory access latency
    • Specifically: when accessing data that had recently been accessed
  • e.g. Data access times, Intel Kaby Lake (2017)
    • L1 cache: 4 cycles
    • L2 cache: 12 cycles
    • L3 cache: 38 cycles
    • DRAM: best case ~248 cycles

Data prefetching #

  • Logic on modern CPUs for guessing what data will be accessed in the future
    • Prefetch this data into caches
    • Dynamically analyze program memory access patterns to make predictions
  • Reduces stalls since data is already resident in cache when accessed
    • However: can reduce perf. if guess is wrong (consumes bandwidth, pollutes caches)

Multithreading to reduce stalls #

  • Interleave processing of multiple threads on same core to hide stalls
    • Can’t make process on one thread? Work on another one
    • Once hitting a stall on one thread, immediately switch over another one
  • Idea: potentially increase time to complete a single thread’s work in order to increase overall system throuput when running multiple threads
  • However: this requires storing execution contexts
    • Tradeoff between memory caching and storing these execution contexts
  • Takeaway: a processor with multiple HW threads has the ability to avoid stalls by interleaving executions
    • This doesn’t affect mem. access latency; it just improves processor utilization

Hardware-supported multithreading #

  • Core manages execution contexts for multiple threads
  • Interleaved multi-threading
  • Simultaneous multi-threading (SMT)
    • e.g. Intel Hyper-Threading

Broader takeaway: can combine each of these parallelism constructs!

  • This is what modern chips (e.g., Intel Skylake/Kaby Lake, Nvidia GPUs) do

Latency and Bandwidth #

  • Memory bandwidth: rate at which memory system can provide data to a processor (e.g., 20 GB/s)
    • Execution speed is often limited by available bandwidth
    • e.g. data transfer speed = 8 bytes per clock: then simple math adds (1 add per clock) can create up to 3 outstanding load requests
    • This is called bandwidth-bound execution: instruction throughput limited by bandwidth, not latency or outstanding requests etc.
    • Influences chip design: modern GPUs place memory on or near chip