Parallel Models

Models of Parallel Programming #

Three types of models:

  • Shared address space: very little structure to communication
  • Message passing: communication is structured in the form of messages
    • Communication explicit in source code
  • Data parallel structure: more rigid structure to computation
    • Same function on collection of large elements

Shared address space model #

  • All threads access the same memory
    • Requires coordinated access to shared data using locks, e.g.
      // thread 1
      int x = 0;
      Lock my_lock;
      
      spawn_thread(foo, &x, &my_lock);
      
      my_lock.lock();
      x++;
      my_lock.unlock();
      
      // thread 2
      void foo (int *x, Lock *my_lock) {
          my_lock->lock();
          x++;
          my_lock->unlock();
      
          std::cout << x << std::endl;
      }
      
    • Locks are needed because instructions to do adds and prints may get interleaved; locks make programmer-intended semantics atomic
      • Note: can sometimes specify atomicity in other ways:
        • Language-supported atomicity: atomic { ... }
        • Hardware-supported atomic read-modify-write intrinsics: atomicAdd(x, 10)
  • Any processor can directly reference any memory location
    • On some processors, ring or grid bus connects cores to enable this access
    • Non-uniform memory access (NUMA): latency of accessing a memory location may be different from different physical cores

Message-passing model #

  • Threads operate within own private address space
  • Communication via sending/receiving messages
    • send: specifies recipient, buffer to be transmitted, identifier (tag)
    • receive: sender, specifies buffer to store data, identifier (tag)
  • This allows hardware to operate in parallel without performing systemwide loads and stores
    • Can create a large parallel machine using connected message-passing commodity machines
      • e.g. Clusters and supercomputers do this

Data-parallel model #

  • Organize computation as operations on sequences of elements
    • e.g. same function on all elements of a sequence
  • Common example: numpy C = A + B, where C,A,B are vecs of equal length
  • Sequence: key data type of the model
    • C++-like: Sequence<T>
    • Scala: List[T]
    • Haskell: seq T
    • numpy: ndarray
  • Program can only access elements of sequence through sequence operators, e.g. map, reduce, scan, shift, etc

Map #

  • Function that takes a function as an argument that operates on sequences
  • Applies function f :: a -> b to elements of input sequence, to produce output sequence of same length
    • Function should be side-effect-free
  • Parallelization: due to side-effect-freeness, applying f to all elements of the sequence can be done in any order without changing program correctness
    • Map implementation can reorder or parallelize element processing when convenient

Data parallelism in ISPC #

export void absolute_value(
    uniform int N,
    uniform float *x,
    uniform float *y
) {
    // loop body is the function that gets mapped over x
    // by the foreach loop
    foreach (i = 0...N) {
        if (x[i] < 0)
            y[i] = -x[i];
        else
            y[i] = x[i];
    }
}

Parallel programming basics #

Creating a parallel program #

  • Process:
    1. Identify work that can be performed in parallel
    2. Partition work and associated data
    3. Manage data access, communication, synchronization
  • Goal: maximize speedup
    • Speedup(P processors) = Time(1 processor)/Time(P processors)

Problem decomposition #

  • Break up problem into tasks that can be performed in parallel
    • Need to create enough tasks to keep all execution units on a machine busy
    • Challenge: identifying dependencies
      • Amdahl’s law: dependencies limit maximum speedup due to parallelism
        • If S is the fraction of work that must be performed in serial, the maximum speedup is at most 1/S
  • e.g. two-step computation on NxN image
    • Steps: multiply brightness of all pixels by two, then compute average of all pixel values
    • Sequential implementation: time is ~2N2
    • Parallelization:
      • Step 1: N2/P
      • Step 2: compute partial sums in parallel, combine serially
        • P + N2/P
      • Maximum speedup: 2N2/((2N2/P) + P)
  • Usually: programmer responsible for program decomposition
    • Automatic decomposition: open research programming

Partitioning and assigning work #

  • Assigning tasks (things to do) to threads (workers)
  • Goals: achieve good workload balance and low comm. costs
  • Can be done statically (pre-execution) or dynamically
  • Assignment is usually the responsibility of the language

Assignment in ISPC #

void foo(
    uniform float *input,
    uniform float *output,
    uniform int N
) {
    launch[100] my_ispc_task(input, output, N);
}
  • ISPC task abstraction: managed assignment of tasks to threads
    • After completing current task, worker thread inspects list and assigns itself next remaining task
    • Thread pool abstraction: no need to keep respawning threads
      • Note: ISPC tasks are an abstraction for work, not threads!!
      • ISPC tasks are units of work that are put on a queue, that are then pulled off the queue by worker threads
      • The number of worker threads is the number of execution units on the hardware

Thread orchestration #

  • Involves communication structuring, dependency synchronization, data structure organization in memory, task scheduling
  • Goal: reduce comm/sync costs, preserve data reference locality, reduce overhead
  • May be impacted by machine details

Mapping threads to hardware #

  • Can be done at various points in the stack
    • OS: thread to HW execution context on CPU core
    • Compiler: map ISPC program instances to SIMD vector lanes
    • Hardware mapping: map CUDA thread blocks to GPU cores
  • Mapping decisions:
    • Place related threads on the same processor for better data reference
    • Place unrelated threads on the same processor (e.g., bandwidth limits vs compute limits) to improve efficiency