Models of Parallel Programming #
Three types of models:
- Shared address space: very little structure to communication
- Message passing: communication is structured in the form of messages
- Communication explicit in source code
- Data parallel structure: more rigid structure to computation
- Same function on collection of large elements
Shared address space model #
- All threads access the same memory
- Requires coordinated access to shared data using locks, e.g.
// thread 1 int x = 0; Lock my_lock; spawn_thread(foo, &x, &my_lock); my_lock.lock(); x++; my_lock.unlock(); // thread 2 void foo (int *x, Lock *my_lock) { my_lock->lock(); x++; my_lock->unlock(); std::cout << x << std::endl; }
- Locks are needed because instructions to do adds and prints may get interleaved; locks make programmer-intended semantics atomic
- Note: can sometimes specify atomicity in other ways:
- Language-supported atomicity:
atomic { ... }
- Hardware-supported atomic read-modify-write intrinsics:
atomicAdd(x, 10)
- Language-supported atomicity:
- Note: can sometimes specify atomicity in other ways:
- Requires coordinated access to shared data using locks, e.g.
- Any processor can directly reference any memory location
- On some processors, ring or grid bus connects cores to enable this access
- Non-uniform memory access (NUMA): latency of accessing a memory location may be different from different physical cores
Message-passing model #
- Threads operate within own private address space
- Communication via sending/receiving messages
- send: specifies recipient, buffer to be transmitted, identifier (tag)
- receive: sender, specifies buffer to store data, identifier (tag)
- This allows hardware to operate in parallel without performing systemwide loads and stores
- Can create a large parallel machine using connected message-passing commodity machines
- e.g. Clusters and supercomputers do this
- Can create a large parallel machine using connected message-passing commodity machines
Data-parallel model #
- Organize computation as operations on sequences of elements
- e.g. same function on all elements of a sequence
- Common example: numpy
C = A + B
, whereC,A,B
are vecs of equal length - Sequence: key data type of the model
- C++-like:
Sequence<T>
- Scala:
List[T]
- Haskell:
seq T
- numpy:
ndarray
- C++-like:
- Program can only access elements of sequence through sequence operators, e.g.
map
,reduce
,scan
,shift
, etc
Map #
- Function that takes a function as an argument that operates on sequences
- Applies function
f :: a -> b
to elements of input sequence, to produce output sequence of same length- Function should be side-effect-free
- Parallelization: due to side-effect-freeness, applying
f
to all elements of the sequence can be done in any order without changing program correctness- Map implementation can reorder or parallelize element processing when convenient
Data parallelism in ISPC #
export void absolute_value(
uniform int N,
uniform float *x,
uniform float *y
) {
// loop body is the function that gets mapped over x
// by the foreach loop
foreach (i = 0...N) {
if (x[i] < 0)
y[i] = -x[i];
else
y[i] = x[i];
}
}
Parallel programming basics #
Creating a parallel program #
- Process:
- Identify work that can be performed in parallel
- Partition work and associated data
- Manage data access, communication, synchronization
- Goal: maximize speedup
- Speedup(P processors) = Time(1 processor)/Time(P processors)
Problem decomposition #
- Break up problem into tasks that can be performed in parallel
- Need to create enough tasks to keep all execution units on a machine busy
- Challenge: identifying dependencies
- Amdahl’s law: dependencies limit maximum speedup due to parallelism
- If
S
is the fraction of work that must be performed in serial, the maximum speedup is at most1/S
- If
- Amdahl’s law: dependencies limit maximum speedup due to parallelism
- e.g. two-step computation on NxN image
- Steps: multiply brightness of all pixels by two, then compute average of all pixel values
- Sequential implementation: time is ~2N2
- Parallelization:
- Step 1: N2/P
- Step 2: compute partial sums in parallel, combine serially
- P + N2/P
- Maximum speedup: 2N2/((2N2/P) + P)
- Usually: programmer responsible for program decomposition
- Automatic decomposition: open research programming
Partitioning and assigning work #
- Assigning tasks (things to do) to threads (workers)
- Goals: achieve good workload balance and low comm. costs
- Can be done statically (pre-execution) or dynamically
- Assignment is usually the responsibility of the language
Assignment in ISPC #
void foo(
uniform float *input,
uniform float *output,
uniform int N
) {
launch[100] my_ispc_task(input, output, N);
}
- ISPC task abstraction: managed assignment of tasks to threads
- After completing current task, worker thread inspects list and assigns itself next remaining task
- Thread pool abstraction: no need to keep respawning threads
- Note: ISPC tasks are an abstraction for work, not threads!!
- ISPC tasks are units of work that are put on a queue, that are then pulled off the queue by worker threads
- The number of worker threads is the number of execution units on the hardware
Thread orchestration #
- Involves communication structuring, dependency synchronization, data structure organization in memory, task scheduling
- Goal: reduce comm/sync costs, preserve data reference locality, reduce overhead
- May be impacted by machine details
Mapping threads to hardware #
- Can be done at various points in the stack
- OS: thread to HW execution context on CPU core
- Compiler: map ISPC program instances to SIMD vector lanes
- Hardware mapping: map CUDA thread blocks to GPU cores
- Mapping decisions:
- Place related threads on the same processor for better data reference
- Place unrelated threads on the same processor (e.g., bandwidth limits vs compute limits) to improve efficiency