Parallel Computing #

Course themes #

Designing parallel programs that scale #

Parallel thinking
1. Decomposing work into pieces that can safely be performed in parallel
2. Assigning work to processors
3. Managing communication/synchronization between the processors to not limit speedup

Parallel computer hardware implementation #

How do parallel computers work?
Mechanisms used to implement different abstractions efficiently
- Performance characteristics of implementations
- Design tradeoffs: performance vs. convenience vs. cost
Why know about hardware?
- Characteristics of machine matter!
- We care about efficiency and performance

Thinking about efficiency #

Fast != efficient!
Just because program runs faster on a parallel computer, it does not mean it is using hardware efficiently
- Is 2x speedup on a computer with 10 processors a good result?
Hardware designer’s perspective: choosing the right capabilities to put in a processor
- Performance/cost
- Metric for cost: silicon area, power, etc

Course logistics #

Grading #

56%: programming assignments (4)
10%: written assignments (5)
16%: midterm exam (evening of Nov. 15)
16%: final exam (Thu, Dec. 15, 3:30pm)
2%: asynchronous participation (website comments)

Why parallelism? #

Defining programs and processors #

Program: a list of processor instructions
- e.g. write it in C, <compile>, results in assembly that maps to machine code for a certain instruction set
What does a processor do?
- Fetch/decode: determine what instruction to run next
- Execution unit: performs operation described by instruction, which may modify values in processor registers or computer memory
- Registers: maintain program state, store value of variables used as inputs and outputs to operation
- e.g. add two numbers
  1. Figure out next program instruction from memory
    - e.g. add r0 <- r0, r1
  2. Get operation inputs from registers e.g. r0 = 32, r1 = 64
  3. Perform addition operation e.g. execution unit performs arithmetic, result is 96

Key terminology:

Computer program: list of instructions to execute
Instruction: describes operation for processor to perform; typically modifies program state
Program state: values of program data, stored in registers and memory

Instruction-level parallelism (ILP) #

If certain instructions in a certain sequence are independent, it is possible to parallelize them without affecting program order

e.g.

// this does a = x*x + y*y + z*z
mul r0, r0, r0 // 1: parallel
mul r1, r1, r1 // 2: parallel
mul r2, r2, r2 // 3: parallel
add r0, r0, r1 // 4: sequential post-1 and 2
add r3, r0, r2 // 5: sequential post-3 and 4

Superscalar processor execution
- Idea: processor automatically finds independent instructions in an execution environment and executes them in parallel on multiple execution units
- Or: compiler finds such instructions and encodes such dependencies in the binary
- Issue: diminishing returns
  - Most available ILP is exploited by a processor capable of issuing four instructions per clock; little benefit from building a processor with more

Historically: avoiding parallel processing #

Single-threaded CPU performance doubled every ~18 months (Moore’s Law)
- Implication: working to parallelize code was often not worth the time
- Do no work, wait 18 months, speed doubles!
This was because of:
- Instruction-level parallelism scaling
- Frequency scaling: increase in CPU processing frequency (GHz)
However:
- ILP scaling hit dimishing returns
- Frequency scaling limited by power and heat: cannot effectively cool much more than a few hundred watts in a standard desktop computer
  - Intel i9-10900K (2020): 95W
  - Nvidia RTX 3080 (2020): 320W

Therefore: shift to “replicating processors”: parallelism! - 32-40x performance increase