Parallel Computing #
Course themes #
Designing parallel programs that scale #
- Parallel thinking
- Decomposing work into pieces that can safely be performed in parallel
- Assigning work to processors
- Managing communication/synchronization between the processors to not limit speedup
Parallel computer hardware implementation #
- How do parallel computers work?
- Mechanisms used to implement different abstractions efficiently
- Performance characteristics of implementations
- Design tradeoffs: performance vs. convenience vs. cost
- Why know about hardware?
- Characteristics of machine matter!
- We care about efficiency and performance
Thinking about efficiency #
- Fast != efficient!
- Just because program runs faster on a parallel computer, it does not mean it is using hardware efficiently
- Is 2x speedup on a computer with 10 processors a good result?
- Hardware designer’s perspective: choosing the right capabilities to put in a processor
- Performance/cost
- Metric for cost: silicon area, power, etc
Course logistics #
Grading #
- 56%: programming assignments (4)
- 10%: written assignments (5)
- 16%: midterm exam (evening of Nov. 15)
- 16%: final exam (Thu, Dec. 15, 3:30pm)
- 2%: asynchronous participation (website comments)
Why parallelism? #
Defining programs and processors #
- Program: a list of processor instructions
- e.g. write it in C, <compile>, results in assembly that maps to machine code for a certain instruction set
- What does a processor do?
- Fetch/decode: determine what instruction to run next
- Execution unit: performs operation described by instruction, which may modify values in processor registers or computer memory
- Registers: maintain program state, store value of variables used as inputs and outputs to operation
- e.g. add two numbers
- Figure out next program instruction from memory
- e.g.
add r0 <- r0, r1
- e.g.
- Get operation inputs from registers
e.g.
r0 = 32
,r1 = 64
- Perform addition operation
e.g. execution unit performs arithmetic, result is
96
- Figure out next program instruction from memory
Key terminology:
- Computer program: list of instructions to execute
- Instruction: describes operation for processor to perform; typically modifies program state
- Program state: values of program data, stored in registers and memory
Instruction-level parallelism (ILP) #
- If certain instructions in a certain sequence are independent, it is possible to parallelize them without affecting program order
- e.g.
// this does a = x*x + y*y + z*z mul r0, r0, r0 // 1: parallel mul r1, r1, r1 // 2: parallel mul r2, r2, r2 // 3: parallel add r0, r0, r1 // 4: sequential post-1 and 2 add r3, r0, r2 // 5: sequential post-3 and 4
- e.g.
- Superscalar processor execution
- Idea: processor automatically finds independent instructions in an execution environment and executes them in parallel on multiple execution units
- Or: compiler finds such instructions and encodes such dependencies in the binary
- Issue: diminishing returns
- Most available ILP is exploited by a processor capable of issuing four instructions per clock; little benefit from building a processor with more
Historically: avoiding parallel processing #
- Single-threaded CPU performance doubled every ~18 months (Moore’s Law)
- Implication: working to parallelize code was often not worth the time
- Do no work, wait 18 months, speed doubles!
- This was because of:
- Instruction-level parallelism scaling
- Frequency scaling: increase in CPU processing frequency (GHz)
- However:
- ILP scaling hit dimishing returns
- Frequency scaling limited by power and heat: cannot effectively cool much more than a few hundred watts in a standard desktop computer
- Intel i9-10900K (2020): 95W
- Nvidia RTX 3080 (2020): 320W
Therefore: shift to “replicating processors”: parallelism! - 32-40x performance increase