Intro

Parallel Computing #

Course themes #

Designing parallel programs that scale #

  • Parallel thinking
    1. Decomposing work into pieces that can safely be performed in parallel
    2. Assigning work to processors
    3. Managing communication/synchronization between the processors to not limit speedup

Parallel computer hardware implementation #

  • How do parallel computers work?
  • Mechanisms used to implement different abstractions efficiently
    • Performance characteristics of implementations
    • Design tradeoffs: performance vs. convenience vs. cost
  • Why know about hardware?
    • Characteristics of machine matter!
    • We care about efficiency and performance

Thinking about efficiency #

  • Fast != efficient!
  • Just because program runs faster on a parallel computer, it does not mean it is using hardware efficiently
    • Is 2x speedup on a computer with 10 processors a good result?
  • Hardware designer’s perspective: choosing the right capabilities to put in a processor
    • Performance/cost
    • Metric for cost: silicon area, power, etc

Course logistics #

Grading #

  • 56%: programming assignments (4)
  • 10%: written assignments (5)
  • 16%: midterm exam (evening of Nov. 15)
  • 16%: final exam (Thu, Dec. 15, 3:30pm)
  • 2%: asynchronous participation (website comments)

Why parallelism? #

Defining programs and processors #

  • Program: a list of processor instructions
    • e.g. write it in C, <compile>, results in assembly that maps to machine code for a certain instruction set
  • What does a processor do?
    • Fetch/decode: determine what instruction to run next
    • Execution unit: performs operation described by instruction, which may modify values in processor registers or computer memory
    • Registers: maintain program state, store value of variables used as inputs and outputs to operation
    • e.g. add two numbers
      1. Figure out next program instruction from memory
        • e.g. add r0 <- r0, r1
      2. Get operation inputs from registers e.g. r0 = 32, r1 = 64
      3. Perform addition operation e.g. execution unit performs arithmetic, result is 96

Key terminology:

  • Computer program: list of instructions to execute
  • Instruction: describes operation for processor to perform; typically modifies program state
  • Program state: values of program data, stored in registers and memory

Instruction-level parallelism (ILP) #

  • If certain instructions in a certain sequence are independent, it is possible to parallelize them without affecting program order
    • e.g.
      // this does a = x*x + y*y + z*z
      mul r0, r0, r0 // 1: parallel
      mul r1, r1, r1 // 2: parallel
      mul r2, r2, r2 // 3: parallel
      add r0, r0, r1 // 4: sequential post-1 and 2
      add r3, r0, r2 // 5: sequential post-3 and 4
      
  • Superscalar processor execution
    • Idea: processor automatically finds independent instructions in an execution environment and executes them in parallel on multiple execution units
    • Or: compiler finds such instructions and encodes such dependencies in the binary
    • Issue: diminishing returns
      • Most available ILP is exploited by a processor capable of issuing four instructions per clock; little benefit from building a processor with more

Historically: avoiding parallel processing #

  • Single-threaded CPU performance doubled every ~18 months (Moore’s Law)
    • Implication: working to parallelize code was often not worth the time
    • Do no work, wait 18 months, speed doubles!
  • This was because of:
    • Instruction-level parallelism scaling
    • Frequency scaling: increase in CPU processing frequency (GHz)
  • However:
    • ILP scaling hit dimishing returns
    • Frequency scaling limited by power and heat: cannot effectively cool much more than a few hundred watts in a standard desktop computer
      • Intel i9-10900K (2020): 95W
      • Nvidia RTX 3080 (2020): 320W

Therefore: shift to “replicating processors”: parallelism! - 32-40x performance increase