

PDC Summer School August 20, 2019

#### David Broman

Associate Professor, KTH Royal Institute of Technology





### **Different Kinds of Computer Systems**







#### Moore's law:

- Integrated circuit resources (transistors) double every 18-24 months.
- By Gordon E. Moore, Intel's co-founder, 1960s.
- Possible because of refined manufacturing processes. E.g., Intel Core i7-6800 processors uses 14nm manufacturing.
- Sometimes considered a *self-fulfilling prophecy*. Served as a goal for the semiconductor industry.

| David Broman | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
|              | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



#### Have we reached the limit?

| Why?                                                                 |                                               | -            | the last de<br>sed dramati                    |                                        | lock rate has                              |       |
|----------------------------------------------------------------------|-----------------------------------------------|--------------|-----------------------------------------------|----------------------------------------|--------------------------------------------|-------|
| The Powe                                                             | r Wall                                        | 1993<br>1997 | 9: 80486,<br>3: Pentium,<br>7: Pentium P      | 66I<br>ro, 200                         | MHz<br>Mhz<br>)MHz                         |       |
|                                                                      |                                               | 2004         | 1: Pentium 4<br>1: Pentium 4<br>0: Intel Xeor | , 3.6<br>n W, 3.2 GH                   | GHz<br>GHz<br>Iz, 8 Cores                  |       |
| http://www.publicdomainpictures.net/<br>image=1281&picture=tegelvagg | ck rate                                       |              |                                               | d since 20                             | <b>06: Multicore</b><br>ds (but will end s | soon) |
|                                                                      | the system enoug                              |              | •                                             |                                        | a chip: multicor<br>rallel programm        |       |
| David Broman<br>dbro@kth.se                                          | <b>Part I</b><br>Assembly and<br>Machine Code |              | <b>t II</b><br>cessor<br>sign                 | <b>Part III</b><br>Memory<br>Hierarchy | Part IV<br>Parallel Proc<br>and Program    |       |



### **Abstractions in Computer Systems**



KTH VETENSKAP OCH KONST

### This Course Module in one Slide





## Part I

## **Assembly and Machine Code**



Acknowledgement: The structure and several of the good examples are derived from the book "Digital Design and Computer Architecture" (2013) by D. M. Harris and S. L. Harris.





#### The Instruction Set Architecture (ISA) and its Surrounding



#### The ISA is the interface between hardware and software.

- Instructions: Encoding and semantics Registers •
- Memory

#### The microarchitecture is the implementation.

For instance, both Intel and AMD implement the x86 ISA, but they have different implementations.

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



#### **Different ISAs**

MIPS is the focus in this course because i) it is relatively easy to understand ii) most text books focus on MIPS.

We will only briefly compare with ARM and x86, but they are complex...



#### Instructions (1/2) CISC vs. RISC

Each ISA has a set of instructions. Two main categories:

#### **Complex Instruction Set Computers (CISC)**

- Many special purpose instructions.
- Example: x86. Now almost 900 instructions.
- Typically various encoding lengths (x86, 1-15 bytes)
- Different number of clock cycles, depending on instruction.

#### **Reduced Instruction Set Computers (RISC)**

- Few, regular instructions. Minimize hardware complexity.
- MIPS is a good example (ARM mostly RISC)
- Typically fixed instruction lengths (e.g., 4 bytes for MIPS)
- Typically one clock cycle per instruction (excluding memory accesses and cache misses)

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



### Instructions (2/2) C code, Assembly Code, and Machine Code

| C Code<br>a = b + c;                       | The compiler maps (if possible) C variables<br>to <b>registers</b> (small fast memory locations)<br>For instance, <b>a</b> to <b>\$s0</b> , <b>b</b> to <b>\$s1,</b> and <b>c</b> to <b>\$s2</b>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                        |                                                       |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------------|
| MIPS Assembly Code<br>add \$s0, \$s1, \$s2 | The assembly or readable form of the second |                                        |                                                       |
| MIPS Machine Code                          | Each assembly instruction is mapped to one or<br>more machine code instructions.<br>In MIPS, each instruction is 32 bits.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                        |                                                       |
| David Broman<br>dbro@kth.se                | <b>Part II</b><br>Processor<br>Design                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | <b>Part III</b><br>Memory<br>Hierarchy | <b>Part IV</b><br>Parallel Processors<br>and Programs |
|                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                        | 12                                                    |

| KTH<br>VETENSKAP<br>OCH KONST |
|-------------------------------|
|-------------------------------|

## Registers

| Name<br>\$0                 | Number<br>0                            | <b>Use</b><br>constant value of 0     |                                        |                                                |
|-----------------------------|----------------------------------------|---------------------------------------|----------------------------------------|------------------------------------------------|
| \$at                        | 1                                      | assembler tempora                     | ary                                    |                                                |
| \$v0-\$v1                   | 2-3                                    | function return valu                  | le                                     |                                                |
| \$a0-\$a3                   | 4-7                                    | function arguments                    | 5                                      |                                                |
| \$t0-\$t7                   | 8-15                                   | temporary (caller-s                   | aved)                                  |                                                |
| \$s0-\$s7                   | 16-23                                  | saved variables (callee-saved)        |                                        |                                                |
| \$t8-\$t9                   | 24-25                                  | temporary (caller-saved)              |                                        |                                                |
| \$k0-\$k1                   | 26-27                                  | reserved for OS kernal                |                                        |                                                |
| \$gp                        | 28                                     | global pointer                        |                                        |                                                |
| \$sp                        | 29                                     | stack pointer                         |                                        |                                                |
| \$fp                        | 30                                     | frame pointer                         |                                        |                                                |
| \$ra                        | 31                                     | function return add                   | lress                                  |                                                |
| David Broman<br>dbro@kth.se | Part I<br>Assembly and<br>Machine Code | <b>Part II</b><br>Processor<br>Design | <b>Part III</b><br>Memory<br>Hierarchy | Part IV<br>Parallel Processors<br>and Programs |



#### Memory

Big problem if 32 registers set the limit of the number of variables in a program. Solution: memory.

| Word address | ;  |    |    |    |        |
|--------------|----|----|----|----|--------|
| •            |    |    |    |    | •      |
| •            |    |    |    |    | •      |
|              |    |    |    |    | •      |
| 0000 000C    | 0f | a0 | b0 | 12 | Word 3 |
| 8000 0008    | 44 | 93 | 4e | aa | Word 2 |
| 0000 0004    | 33 | fa | 01 | 23 | Word 1 |
| 0000 0000    | 21 | a0 | 1b | 33 | Word 0 |
| Byte address | 0  | 1  | 2  | 3  |        |

#### Memory

- Has many more data locations than registers.
- Accessing memory is slower than accessing registers.

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



### **MIPS Reference Sheet**

| <section-header></section-header>                           | the MIPS<br>coding.                   | tes an importan<br>instructions a      | nd their                                       |
|-------------------------------------------------------------|---------------------------------------|----------------------------------------|------------------------------------------------|
| David Broman<br>dbro@kth.se Part I<br>Assembly<br>Machine C | <b>Part II</b><br>Processor<br>Design | <b>Part III</b><br>Memory<br>Hierarchy | Part IV<br>Parallel Processors<br>and Programs |





| <b>Branch if equal (beq)</b> branches if two operands have equal values.                                                                                                                                                        |                                                                                                                                                          | <b>ot equal (bne)</b><br>o not have equa | branches if two<br>al values.                                                                          |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------------------------------------------------|
| addi \$\$0, \$0, 4<br>xori \$\$1, \$\$0, 1<br>\$11 \$t0, \$\$1, 1<br>beq \$t0, \$\$0, foo<br>add \$\$1, \$\$1, \$\$0<br>foo:<br>add \$\$5, \$\$1, \$0<br>What is the value of \$\$5?<br>Stand for 9, sleep for 10.<br>Answer: 9 | in \$s1=5. Shift<br>\$t0 is 10. Hence<br>equal, so the b<br>add is executed<br>\$s1 is 9.<br>There is n<br>but <b>add</b> ca<br>done here<br>Note: There | is a <b>pseudoin</b><br>MIPS assembl     | s in that<br>are not<br>en and<br>n that<br>tion in MIPS,<br>this (as it is<br><b>struction</b> called |
| David Broman<br>dbro@kth.se<br>Part I<br>Assembly and<br>Machine Code                                                                                                                                                           | <b>Part II</b><br>Processor<br>Design                                                                                                                    | <b>Part III</b><br>Memory<br>Hierarchy   | Part IV<br>Parallel Processors<br>and Programs                                                         |
|                                                                                                                                                                                                                                 |                                                                                                                                                          |                                          |                                                                                                        |



#### Stored Programs with Instruction Encoding Formats







## **I-Type (immediate-type)** instructions have two register operands and one immediate operand.



|              | ) Part I     | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



### **Summary Part I**

#### Some key take away points:

- Moore's law: Integrated circuit resources (transistors) double every 18-24 months.
- **The Power Wall**: Clock rates cannot be increased anymore. Too high power; the chip gets too hot.
- An Instruction Set Architecture (ISA) defines the software/hardware interface, whereas a microarchitecture implements an ISA.



• It is important to understand **the concept of assembly programming**, although very few programs are actually written in assembly today.

David Broman dbro@kth.se Part I Assembly and Machine Code

Part II Processor Design **Part III** Memory Hierarchy **Part IV** Parallel Processors and Programs



## Part II

## Processor Design



KTH VETENSKAP OCH KONST

### Arithmetic Logic Unit (ALU)

An **ALU** saves hardware by combining different arithmetic and logic operations in one single unit/element.





**Data Path and Control Unit** 

#### A processor is typically divided into two parts

|  |  | Anna Bealthite Branch - |  |
|--|--|-------------------------|--|
|--|--|-------------------------|--|

#### **Data Path**

- Operates on a word of data.
- Consists of elements such as registers, memory, ALUs etc.



#### **Control Unit**

• Gets the current instruction from the data path and tells the data path how to execute the instruction.

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



### Instructions

In this lecture, we construct a microarchitecture for a subset of a MIPS processor with the following instructions





### **Read Instruction from the Current PC**







### **1w instruction – Read Data Word**





### **1w instruction – Write Back**





dbro@kth.se

#### **1w** instruction – Increment PC



Hierarchy

and Programs

Machine Code Design



#### **1w** instruction – Timing





## Data Path for Instructions add, sub, and, or, slt, addi, lw, sw, beq, j





## Part II







### What to Control?





**Control Unit Structure** 



33

34



|              | Part I       | 🔵 Part II | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



interesting.

Example: SPEC CPU Benchmark

| David Broman | Part I       | Part II   | Part III  | <b>Part IV</b>      |
|--------------|--------------|-----------|-----------|---------------------|
|              | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



#### Critical Path Example: Load Word (1w) Instruction



36

### **Performance Analysis (Revisited)**





## Part II

## **Processor Design**





# Parallelism and Pipelining (1/6) Definitions

**Processing System**: A system that takes input and produces outputs.



**Token**: An input that is processed by the processing system and results in an output.

**Latency**: The time it takes for the system to process one token.

**Throughput**: The number of tokens that can be processed per time unit.

| David | Broman  |
|-------|---------|
| dbro@ | )kth.se |

**Part I** Assembly and Machine Code Part II Processor Design **Part III** Memory Hierarchy

**Part IV** Parallel Processors and Programs





**Example**: Assume we have a Christmas card factory with two machines (M1 and M2).

M1: Prints out the card (takes 6s) M2: Puts on a stamp (takes 4s)

**Approach 1.** Process tokens **sequentially**. In this case a token is a card.





### Parallelism and Pipelining (3/6) Parallel Processing (Spatial Parallelism)



M1: Prints out the card (takes 6s) Example: Assume we have a Christ-M2: Puts on a stamp (takes 4s) mas card factory with four machines. M3: Prints out the card (takes 6s) Approach 2. Process tokens in **M4:** Puts on a stamp (takes 4s) parallel using more machines. The throughput is 2 \* 1/10 = The **latency** is 6 + 4 = 10s0.2 tokens per second or 12 tokens per minute. M1 M1 M2 M2 Μ4 M4 **M**3 **M**3 S 0 2 4 6 8 10 12 14 16 18 20 22 24 26

| David Broman | <b>Part I</b><br>Assembly and<br>Machine Code | Part II<br>Processor | Part III<br>Memory | <b>Part IV</b><br>Parallel Processors<br>and Programs |
|--------------|-----------------------------------------------|----------------------|--------------------|-------------------------------------------------------|
| dbro@kth.se  |                                               | Design               | Hierarchy          |                                                       |



### Parallelism and Pipelining (4/6) Pipelining (Temporal Parallelism)







### Parallelism and Pipelining (6/6) Performance Analysis for Pipelining

**Idea:** We introduce a pipeline in the processor

How does this affect the execution time?





#### Towards a Pipelined Datapath (1/8)

Recall the single-cycle data path (the logic for the j and beg instructions is hidden) CLK CLK CLK Inst WE3 32 25:21 A1 RD WE 32 RD1 PC Instruction Memory 32 PC<sub>next</sub> 20:16 RD A2 RD2 32 32 Data Memory A3 WD3 32 WD **7**32 20:16 Õ 15:11 <<2 15:0 Sign Extend 32 Part I Part IV Part II Part III Assembly and **Parallel Processors** Processor Memory David Broman Machine Code Hierarchy and Programs Design dbro@kth.se



#### Towards a Pipelined Datapath (2/8) Fetch Stage





#### Towards a Pipelined Datapath (3/8) Decode Stage





#### Towards a Pipelined Datapath (4/8) Execute Stage





#### Towards a Pipelined Datapath (5/8) Memory Stage





## Part II

### **Processor Design**







### Data Hazards (1/4) Read after Write (RAW)



|                         | value \$s0<br>But \$s    | instruction writes<br>in cycle 5<br>0 is used in the<br>e phase in cycle | $\backslash$      | A <b>data hazard</b> occurs when an instruction reads a register that has not yet been written to.<br>This kind of data hazard is called |
|-------------------------|--------------------------|--------------------------------------------------------------------------|-------------------|------------------------------------------------------------------------------------------------------------------------------------------|
|                         | \$s1, \$s2<br>\$s0, \$t1 | 1 2 3<br>F D E<br>F D                                                    | 4 5<br>M W<br>E M | read after write (RAW) hazard.<br>W                                                                                                      |
| and \$t2,<br>xori \$t3, | \$t1, \$s0               | F                                                                        | D E               | M W                                                                                                                                      |

|              | Part I                    | Part II   | Part III  | Part IV             |
|--------------|---------------------------|-----------|-----------|---------------------|
| David Broman | Assembly and Machine Code | Processor | Memory    | Parallel Processors |
| dbro@kth.se  |                           | Design    | Hierarchy | and Programs        |





#### Data Hazards (3/4) Solution 1: Forwarding (partially)







#### Data Hazards (4/4) Solution 2: Stalling





### Control Hazards (1/5) Assume Branch Not Taken

| 20: beq \$s1,               | 1<br>\$s2, 40 F                                                                 | 2 3 4<br>D E                   | 5 addr<br>equa<br>stag                 | nputes the branch target<br>ress and compares for<br>ality in the execute (E)<br>e. If branch taken, |
|-----------------------------|---------------------------------------------------------------------------------|--------------------------------|----------------------------------------|------------------------------------------------------------------------------------------------------|
| 24: sub \$t0,               | \$s0, \$t1                                                                      | FD                             | M W                                    | ate the PC in the<br>nory (M) stage.                                                                 |
| 28: and \$t2,               | \$t1, \$s0                                                                      | F                              | EMW                                    |                                                                                                      |
| 2C: xori \$t3,              | \$s0, 12                                                                        | F                              | D E M                                  | W                                                                                                    |
| <br>64: addi \$t3,          | \$s0, 100                                                                       |                                | F D E                                  | MW                                                                                                   |
| to flush                    | ranch is taken, we r<br>the pipeline. We ha<br><b>misprediction pe</b><br>cles. | ave a                          |                                        |                                                                                                      |
| David Broman<br>dbro@kth.se | <b>Part I</b><br>Assembly and<br>Machine Code                                   | Part II<br>Processor<br>Design | <b>Part III</b><br>Memory<br>Hierarchy | <b>Part IV</b><br>Parallel Processors<br>and Programs                                                |







Design

and Programs

KTH vetenskap

#### Control Hazards (2/2) Deeper Pipelines

Machine Code

How can we handle deep pipelines, and minimize misprediction?



 Statically (at compile time) determine if a branch is taken or not. For instance, predict branch not taken.

Hierarchy



#### **Dynamic Branch Predictors**

- Dynamically (at runtime) predict if a branch will be taken or note.
- Operates in the fetch state.
- Maintains a table, called the branch target buffer, that contains hundreds or thousands of executed branch instructions, their destinations, and information if the branches were taken or not.

|              | Part I       | Part II   | Part III  | Part IV                          |
|--------------|--------------|-----------|-----------|----------------------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors and Programs |
| dbro@kth.se  | Machine Code | Design    | Hierarchy |                                  |



## Part II



## **Processor Design**

x86





### Summary Part II

#### Some key take away points:

- The data path consists of sequential logic that performs processing of words in the processor.
- The **control unit** decodes instructions and tells the data path what to do.
- Pipelining is a temporal way of achieving parallelism
- Pipelining introduces pipeline hazards. There are two main kind of hazards: **data hazards** and **control hazards**.



60

Part II Processor Design **Part III** Memory Hierarchy

**Part IV** Parallel Processors and Programs



## Part III

## **Memory Hierarchies**



62

## Abstractions in Computer Systems





#### Memory Technologies (1/2) SRAM and DRAM

#### SRAM (Static Random Access Memory)

- · Simple integrated circuit, usually with one access port.
- Today, on-chip memory (same as processor)
- Access time 0.5-2.5ns Cost per GiB in 2012: \$500-\$1000

#### **DRAM (Dynamic Random Access Memory)**

- Memory stored in capacitors need to be refreshed
- One transistor per bit much cheaper than SRAM
- SDRAM (synchronous DRAM). Uses clocks. Transfer data in bursts.
- DDR (Double Data Rate) SDRAM. Transfer data both on rising and falling clock edge.
- Douglas Whitaker, Wikipedia, CC BY-SA 2.5 Access time: 50-70ns, Cost per GiB in 2012: \$10-\$20

#### Source: Patterson and Hennessy, 2012

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



#### Memory Technologies (2/2) Flash and Magnetic Disk



Wikipedia, CC BY-SA 3.0



Wikipedia Evan Amos, CC BY-SA 3.0

#### Flash Memory

- Electrically erasable programmable read-only memory (EEPROM)
- · Can wear out
- Used in solid state drivers (SSD)
- Access time: 5,000-50,000 ns, Cost per GiB in 2012 \$0.75-\$1

#### **Magnetic Disk**

- Collection of platters that spin 5,400 to 15,000 revolutions per minutes (rpm).
- Access time: 5,000,000-50,000,000 ns Cost per GiB in 2012 \$0.05-\$0.10

Clearly, there is a tradeoff between cost, access time, and size

#### How can we utilize these differences? Source: Patterson and Hennessy, 2012

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |





#### **Memory Hierarchy**





66

## Part III

## **Memory Hierarchies**





#### **Reading From Memory**







### **Cache Terminology**





#### Direct Mapped Cache (1/3) The mapping







### Direct Mapped Cache (2/3) Cache Fields



| An ado<br>certain<br>compo |                             | 27-bit 3-bit 2-b<br>11111                                                                    | it A <b>byte offs</b><br>✓ specifies the<br>in the block | e byte                                                                                                                                                                            |                                                |
|----------------------------|-----------------------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|
| -                          | field<br>ifies which<br>ess | A <b>set</b> field (also<br>called the <b>index</b> )<br>specifies which<br>set in the cache | )                                                        | How do we know wh<br>blocks that are store<br><b>Answer:</b> We store the<br>field in the cache.<br>How do we know if the<br>stored in the cache of<br><b>Answer</b> : We add a w | ed in the cache?<br>he tag<br>he data          |
| vid Broman<br>o@kth.se     |                             | <b>t I</b><br>embly and<br>chine Code                                                        | <b>Part II</b><br>Processor<br>Design                    | Part III<br>Memory<br>Hierarchy                                                                                                                                                   | Part IV<br>Parallel Processors<br>and Programs |



Davi dbro

#### Direct Mapped Cache (3/3) Hardware Implementation



David Broman<br/>dbro@kth.sePart IPart IIPart IIIPart IVDavid Broman<br/>dbro@kth.seAssembly and<br/>Machine CodeProcessor<br/>DesignMemory<br/>HierarchyParallel Processors<br/>and Programs



#### Loop Example 1 Data Cache – Temporal Locality



| addi  | \$t0, | \$O, 5    |
|-------|-------|-----------|
| loop: |       |           |
| beq   | \$t0, | \$0, done |
| lw    | \$t1, | 0x4(\$0)  |
| lw    | \$t2, | 0xC(\$0)  |
| addi  | \$t0, | \$t0, -1  |
| j     | loop  |           |
| done: |       |           |

| Va  | lid | Tag     | Data        |                               |
|-----|-----|---------|-------------|-------------------------------|
|     | 0   |         |             | Set 7, 111 <sub>2</sub>       |
|     | 0   |         |             | Set 6, <b>110<sub>2</sub></b> |
|     | 0   |         |             | Set 5, <b>101<sub>2</sub></b> |
|     | 0   |         |             | Set 4, 100 <sub>2</sub>       |
|     | 1   | 0000    | mem[0x000C] | Set 3, 011 <sub>2</sub>       |
|     | 0   |         |             | Set 2, 010 <sub>2</sub>       |
|     | 1   | 0000    | mem[0x0004] | Set 1, 001 <sub>2</sub>       |
|     | 0   |         |             | Set 0, 000 <sub>2</sub>       |
| 1 k | oit | 27 bits |             |                               |

**Exercise**: Assume that the cache is empty when entering the program. What is the *data* cache miss rate and the cache contents when reaching program point **done**.

Answer: The missrate is 2/10 = 20%.

 $4_{16} = 00100_2$   $C_{16} = 12_{10} = 01100_2$ 

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



#### Loop Example Instruction Cache – Spatial Locality



| addi  | \$t0, | \$0, 5    |
|-------|-------|-----------|
| loop: |       |           |
| beq   | \$t0, | \$0, done |
| lw    | \$t1, | 0x4(\$0)  |
| lw    | \$t2, | 0xC(\$0)  |
| addi  | \$t0, | \$t0, -1  |
| j     | loop  |           |
| done: |       |           |

**Exercise**: Assume that the first addi instruction starts at address 0x0000 4000. The instruction cache has S = B, C = 4096 bytes and b = 16 bytes. How many instruction cache misses occurs when executing the program, presupposed that the cache was empty from the beginning.

| / | Note the <b>spatial locality</b> . Since we load 4 instruction each time, we do |
|---|---------------------------------------------------------------------------------|
|   | not get cache misses for each instruction.                                      |

Answer: Two cache misses

First, when loading the first 4 instructions, then when loading the two last instructions.

4 bits for representing 16 bytes block 0x0000 4**00**0 // Address to first addi 0x0000 4**01**0 // Address to second addi The mapping does not conflict.

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |



#### **N-way Set Associative Cache**

We can reduce conflicts by having N blocks in each *set*. This is called an **N-way Set Associative Cache**.





#### **Replacement Policy**

#### **Direct Mapped Cache**

Each address maps to a unique block and set. Hence, when a set is full, it must be replaced with the new data.

#### N-way Set Associative Cache where N > 1

- Least Recently Used (LRU) Policy. Simple with a 2-way set associative cache by using a *use bit U*. Commonly used.
- Pseudo-LRU Policy. For N-ways were N > 2. Indicate the least recently used *group* and upon replacement, randomly selects a way in the group.
- · First-in First-out (FIFO) replacement policy.
- Random replacement policy.

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |

## **Multi-Level Caches**





78

# Part III

# **Memory Hierarchies**







|              | Part I       | Part II 🛛 🥥 | Part III  | Part IV             |
|--------------|--------------|-------------|-----------|---------------------|
| David Broman | Assembly and | Processor   | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design      | Hierarchy | and Programs        |



#### Virtual Memory vs. Caches

Virtual memory has similarities to caches, but uses another terminology.



|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |

# Virtual Memory Overview





## **Summary Part III**

#### Some key take away points:

- **Memory hierarchies** are used because memories have different cost, size, and speed.
- There are two kinds of caches: **instruction caches** and **data caches**.
- Two important properties that make caches useful: temporal locality and spatial locality.
- Caches can be **direct mapped**, **N-way**, or **fully associative**.
- **Virtual memories** enable large virtual address spaces and enable memory protection between different concurrent programs.

**Part I** Assembly and Machine Code **Part II** Processor Design Part III Memory Hierarchy **Part IV** Parallel Processors and Programs





# Part IV

## **Parallel Processors and Programs**





## What is a multiprocessor?

A **multiprocessor** is a computer system with two or more processors.

By contrast, a computer with one processor is called a **uniprocessor**.



**Multicore** microprocessors are multiprocessors where all processors (cores) are located on a single integrated circuit.



A **cluster** is a set of computers that are connected over a local area network (LAN). May be viewed as one large multiprocessor.

David Broman dbro@kth.se **Part I** Assembly and Machine Code **Part II** Processor Design **Part III** Memory Hierarchy

Part IV Parallel Processors and Programs



# Parallelism and Concurrency – what is the difference?

**Concurrency** is about **handling** many things at the same time. Concurrency may be viewed from the **software** viewpoint.

|                                                                                           |                    |          | Software                                                               |                                                                                               |  |  |
|-------------------------------------------------------------------------------------------|--------------------|----------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|--|--|
| Parallelism is about<br>doing (executing)                                                 |                    |          | Sequential                                                             | Concurrent                                                                                    |  |  |
| many things at the<br>same time. Parallelism<br>may be viewed from<br>the <b>hardware</b> | Hardware           | Serial   | Example: matrix<br>multiplication on a<br>unicore processor.           | Example: A Linux<br>OS running on a<br>unicore processor .                                    |  |  |
| viewpoint.                                                                                |                    | Parallel | Example: matrix<br>multiplication on a<br>multicore processor.         | Example: A Linux OS<br>running on a multicore<br>processor.                                   |  |  |
|                                                                                           | bly and<br>ne Code | Pr       | n <b>rt II Part II</b><br>ocessor Memo<br>esign Hierar                 | ry Parallel Processors                                                                        |  |  |
| Speed                                                                                     | up                 |          |                                                                        | 86                                                                                            |  |  |
| How much can we im performance using pa                                                   | •                  | י?       | Speeaup =                                                              | Execution time of<br>one program before<br>improvement<br>Execution time after<br>improvement |  |  |
| Spoodup 🔺                                                                                 |                    |          |                                                                        | eedup. Either wrong,                                                                          |  |  |
| Speedup                                                                                   |                    |          | or due to e.g.                                                         | cache effects.                                                                                |  |  |
| 4<br>3                                                                                    | 1/4                |          | Linear speedu<br>(or ideal speed                                       | •                                                                                             |  |  |
| 2                                                                                         |                    | -        | Still increased sp<br>less efficient                                   | peedup, but                                                                                   |  |  |
| 1                                                                                         |                    |          | •                                                                      | elative speedup                                                                               |  |  |
| 1 2                                                                                       | 2 3 4              |          | measures only the same programNumber ofTrue speedup compares also with |                                                                                               |  |  |
| Part I                                                                                    |                    |          | cessors the best k                                                     | nown sequential program,                                                                      |  |  |
| David Broman Assem                                                                        | bly and<br>ne Code | Pr       | ocessor Memo<br>esign Hierar                                           | ry Parallel Processors                                                                        |  |  |



#### Amdahl's Law (1/4)





Amdahl's Law (2/4)



**Exercise**: Assume a program consists of an image analysis task, sequentially followed by a statistical computation task. Only the image analysis task can be parallelized. How much do we need to improve the image analysis task to be able to achieve 4 times speedup?

Assume that the program takes 80ms in total and that the image analysis task takes 60ms out of this time.

#### Solution:

4 = 80 / (60 / N + 80 - 60)

60/N + 20 = 20

60/N = 0

It is impossible to achieve this speedup!

|              | Part I       | Part II   | Part III  | Part IV             |
|--------------|--------------|-----------|-----------|---------------------|
| David Broman | Assembly and | Processor | Memory    | Parallel Processors |
| dbro@kth.se  | Machine Code | Design    | Hierarchy | and Programs        |





#### Amdahl's Law (3/4)



| Speedup =                                                                              | <i>T</i> <sub>before</sub> =                                                                                                                                                                                                                                                                        | <b>T</b> <sub>before</sub>    |                                |                                |                                                |
|----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|--------------------------------|--------------------------------|------------------------------------------------|
|                                                                                        | $T_{after}$ $\frac{I_{affe}}{N}$                                                                                                                                                                                                                                                                    | ected<br>+                    | <b>T</b> <sub>unaffected</sub> |                                |                                                |
| integer addition<br>addition, when<br>Assume addit<br>of time and th<br>the matrix add | Assume that we perform 10 scalar<br>integer additions, followed by one matrix<br>addition, where matrices are 10x10.<br>Assume additions take the same amount<br>of time and that we can only parallelize<br>the matrix addition.<br><b>Exercise A</b> : What is the speedup with<br>10 processors? |                               |                                | ·                              | + 10) = 5.5<br>+ 10) = 8.8<br>⊦ 10) = 11 when  |
| 40 processors                                                                          | Vhat is the speedu<br>??<br>Vhat is the maxima                                                                                                                                                                                                                                                      |                               | · ·                            | 10) / (10*10<br>e that one a   | C<br>/100 + 10) = 10<br>add instruction        |
| David Broman<br>dbro@kth.se                                                            | <b>Part I</b><br>Assembly and<br>Machine Code                                                                                                                                                                                                                                                       | <b>Part</b><br>Proce<br>Desig | essor Me                       | rt III 🧼 🧼<br>emory<br>erarchy | Part IV<br>Parallel Processors<br>and Programs |



## Amdahl's Law (4/4)



Example continued. What if we change the size of the problem (make the matrices larger)?





# Part IV

## **Parallel Processors and Programs**





# What is Instruction-Level Parallelism?

Instruction-Level Parallelism (ILP) may increase

performance without involvement of the programmer.

#### Two main approaches:



#### 1. Deep pipelines with more pipeline stages

If the length of all pipeline stages are balanced, we may increase performance by increasing the clock speed.



#### 2. Multiple issue

A technique where multiple instructions are issued in each in cycle.

ILP may decrease the CPI to lower than 1, or using the inverse metric *instructions per clock cycle (IPC)* increase it above 1.

| David Broman<br>dbro@kth.se | <b>Part I</b><br>Assembly and<br>Machine Code | <b>Part II</b><br>Processor<br>Design | Part III<br>Memory<br>Hierarchy | Part IV<br>Parallel Processors<br>and Programs |
|-----------------------------|-----------------------------------------------|---------------------------------------|---------------------------------|------------------------------------------------|
| abroattinee                 |                                               | 3                                     | J                               | 9                                              |



### **Multiple Issue Approaches**



#### Many of the decisions of issuing instructions are made by the processor, dynamically, during execution.

| David Broman<br>dbro@kth.se | Part I<br>Assembly and<br>Machine Code | <b>Part II</b><br>Processor<br>Design | <b>Part III</b><br>Memory<br>Hierarchy | Part IV<br>Parallel Processors<br>and Programs |
|-----------------------------|----------------------------------------|---------------------------------------|----------------------------------------|------------------------------------------------|
| abro@ktn.se                 |                                        | Design                                | пенасну                                | and rograms                                    |



#### Static Multiple Issue (1/2) VLIW





#### Static Multiple Issue (2/2) Changing the hardware





#### Dynamic Multiple Issue Processors (1/2) Superscalar Processors

96

In many modern processors (e.g., Intel Core i7), instruction issuing is performed dynamically by the processor while executing the program. Such processor is called **superscalar**.





## Dynamic Multiple Issue Processors (2/2) Out-of-Order Execution, RAW, WAR

If the superscalar processor can reorder the instruction execution order, it is an **out-of-order execution** processor.

| lw \$t0, 0(\$<br>addi \$t1, \$t0  |                                             | dependency<br>superscalar                                                                                                                                         | a <b>Read After</b><br>y (dependency<br>pipeline must<br>available before                                                                                                                                                                                            | on \$t0). The<br>make sure that                |  |
|-----------------------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|--|
| sub \$t0, \$t1<br>addi \$t1, \$s0 | , 10 V                                      | dependend<br>order is flip<br>execution,<br>VAR dependencie                                                                                                       | Example of a <b>Write After Read (WAR)</b><br>dependency (dependency on \$t1). If the<br>order is flipped due to out-of-order<br>execution, we have a hazard.<br>dependencies can be resolved using <b>register</b><br><b>hing</b> , where the processor writes to a |                                                |  |
| addi \$r1, \$s0<br>sub \$t0, \$t1 | , 10 a                                      | nonarchitectural renaming register (here in the p<br>asm code called \$r1, not accessible to the progr<br>Note that out-of-order processor is not rewriting the c |                                                                                                                                                                                                                                                                      |                                                |  |
|                                   | keeps                                       | track of the renai                                                                                                                                                | med registers o                                                                                                                                                                                                                                                      | during execution.                              |  |
| David Broman A                    | <b>art I</b><br>ssembly and<br>lachine Code | <b>Part II</b><br>Processor<br>Design                                                                                                                             | Part III<br>Memory<br>Hierarchy                                                                                                                                                                                                                                      | Part IV<br>Parallel Processors<br>and Programs |  |



# Some observations on Multi-Issue Processors



| David Broman | Part I<br>Assembly and | <b>Part II</b><br>Processor | <b>Part III</b><br>Memory | Part IV Parallel Processors |
|--------------|------------------------|-----------------------------|---------------------------|-----------------------------|
| dbro@kth.se  | Machine Code           | Design                      | Hierarchy                 | and Programs                |



#### Intel Microprocessors, some examples

| Processor                           | Y                                             | Year                   | Clock Rate                                                                        | Pipeline<br>Stages                  | lssue<br>Width | Cores   | Power                         |
|-------------------------------------|-----------------------------------------------|------------------------|-----------------------------------------------------------------------------------|-------------------------------------|----------------|---------|-------------------------------|
| Intel 486                           | 1                                             | 1989 2                 | 25 MHz                                                                            | 5                                   | 1              | 1       | 5 W                           |
| Intel Pentium                       | 1                                             | 1993                   | 66 MHz                                                                            | 5                                   | 2              | 1       | 10W                           |
| Intel Pentium P                     | ro 1                                          | 1997 2                 | 200 MHz                                                                           | 10                                  | 3              | 1       | 29 W                          |
| Intel Pentium 4                     | Willamette 2                                  | 2001                   | 2000 MHz                                                                          | 22                                  | 3              | 1       | 75W                           |
| Intel Pentium 4                     | Prescott 2                                    | 2004 3                 | 3600 MHz                                                                          | 31                                  | 3              | 1       | 103W                          |
| Intel Core                          | 2                                             | 2006                   | 2930 MHz                                                                          | 14                                  | 4              | 2 1     | 75W                           |
| Intel Core i5 Ne                    | ehalem 🖊 🖊                                    | 2010                   | 3300 MHz                                                                          | 14                                  | 4              | 1 /     | 87W                           |
| Intel Core i5 Iv                    | y Bridge 2                                    | 2012                   | 3400 MHz 🔺                                                                        | 14                                  | 4              | 8       | 77W                           |
| Clock rate incre<br>(the power wall |                                               | and t<br>numb<br>after | ine stages first<br>then decreased<br>per of cores ind<br>2006.<br>Source: Patter | d, but the<br>creased               | peak           | ed with | Pentium 4<br>Page 344.        |
| David Broman<br>dbro@kth.se         | <b>Part I</b><br>Assembly and<br>Machine Code |                        | <b>Part II</b><br>Processor<br>Design                                             | <b>Part III</b><br>Memor<br>Hierard | ry 🚬           |         | V<br>el Processors<br>rograms |



## Part IV

## **Parallel Processors and Programs**





## **Main Classes of Parallelisms**





Data-Level Parallelism (DLP) Many data items can be processed at the same time.



**Example – Sheep shearing** Assume that sheep are data items and the task for the farmer is to do sheep shearing (remove the wool). Data-level parallelism would be the same as using several farm hands to do the shearing.



**Task-Level Parallelism (TLP)** Different tasks of work that can work in independently and in parallel



**Example – Many tasks at the farm** Assume that there are many different things that can be done on the farm (fix the barn, sheep shearing, feed the pigs etc.) Task-level parallelism would be to let the farm hands do the different tasks in parallel.

David Broman dbro@kth.se

Assembly and Machine Code

Part I

**Part II** Processor Design Part III Memory Hierarchy





D d

## SISD, SIMD, and MIMD



An old (from the 1960s) but still very useful classification of processors uses the notion of instruction and data streams.

|                         | Data Stream |                                        |                                       | Data-level parallelism. Examples<br>are multimedia extensions (e.g.,<br>SSE, streaming SIMD              |  |  |
|-------------------------|-------------|----------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------|--|--|
|                         |             | Single Multiple                        |                                       | extension), vector processors.                                                                           |  |  |
| am                      | Single      | SISD                                   | SIMD                                  | Graphical Unit Processors                                                                                |  |  |
| on Stream               | Sir         | E.g. Intel<br>Pentium 4                | E.g. SSE<br>Instruction in x86        | (GPUs) are both SIMD and<br>MIMD<br>Task-level parallelism.                                              |  |  |
| nstruction              | ple         | MISD                                   | MIMD 🖌                                | Examples are multicore and cluster computers                                                             |  |  |
| lus                     | Multiple    | No examples<br>today                   | E.g. Intel<br>Core i7                 | Physical Q/A<br>What is a modern Intel CPU,<br>such as Core i7? Stand for<br>MIMD, on the table for SIMD |  |  |
| David Broi<br>lbro@kth. |             | Part I<br>Assembly and<br>Machine Code | <b>Part II</b><br>Processor<br>Design | Part IIIPart IVMemoryParallel ProcessorsHierarchyand Programs                                            |  |  |



#### Subword Parallelism and Multimedia Extensions







# Recall the idea of a multi-issue uniprocesor





## Simultaneous multithreading (SMT)





# Shared Memory Multiprocessor (SMP)

A Shared Memory Multiprocessor (SMP) has a single physical address space across all processors.

#### An SMP is almost always the same as a **multicore processor**.





### **Cache Coherence**

Different cores' local caches could result in that different cores see different values for the same memory address.

This is called the **cache coherency** problem.



110



#### **Snooping Protocol**

Cache coherence can be enforced using a cache coherence protocol. For instance a *write invalidate protocol*, such as the **snooping protocol**.





## Processes, Threads, and Cores





## **General Matrix Multiplication (GEMM)**



| David Broman | <b>Part I</b> | <b>Part II</b> | Part III  | Part IV             |
|--------------|---------------|----------------|-----------|---------------------|
|              | Assembly and  | Processor      | Memory    | Parallel Processors |
|              | Machine Code  | Design         | Hierarchy | and Programs        |
| dbro@kth.se  |               | Design         | пеласпу   | and Frograms        |

114

# Para

## **Parallelizing GEMM**

| Unoptimized                                                                     | Unoptimzed C version (<br>one core.                                                                                                                    |                                       | OPS (32x32)<br>OPS (960x960) |                                                |  |
|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|------------------------------|------------------------------------------------|--|
| SIMD                                                                            | Use AVX instructions <b>va</b><br>to do 4 double precision<br>operations in parallel.                                                                  |                                       | 0.4 Gigai Li                 | OPS (32x32)<br>OPS (960x960)                   |  |
| LP                                                                              | AVX + unroll parts of the<br>multiple-issue, out-of-ord<br>more instructions to sche                                                                   | OPS (32x32)<br>OPS (960x960)          |                              |                                                |  |
| Cache                                                                           | AVX + unroll + blocking (dividing the problem into submatrices). This avoids cache misses. 13.6 GigaFLOPS (32x32) 12.0 GigaFLOPS (960x9)               |                                       |                              |                                                |  |
| Multi-                                                                          | AVX + unroll + blocking + multi core<br>(multithreading using OpenMP)23 GigaFLOPS (960x960, 2<br>44 GigaFLOPS (960x960, 4<br>174 GigaFLOPS (960x960, 1 |                                       |                              |                                                |  |
| Experim                                                                         | ent by P&H on a 2.6G                                                                                                                                   | Hz Intel Core i7                      | with Turbo mode              | turned off.                                    |  |
| For details see P&H, 5 <sup>th</sup> edtion, sections 3.8, 4.12, 5.14, and 6.12 |                                                                                                                                                        |                                       |                              |                                                |  |
| David Broman<br>dbro@kth.se                                                     | <b>Part I</b><br>Assembly and<br>Machine Code                                                                                                          | <b>Part II</b><br>Processor<br>Design | Memory                       | Part IV<br>Parallel Processors<br>and Programs |  |



# *"For x86 computers, we expect to see two additional cores per chip every two years and the SIMD width to double every four years."*





## **Summary Part IV**

#### Some key take away points:

- Amdahl's law can be used to estimate maximal speedup.
- Instruction-Level Parallelism (ILP) has been very important for performance improvements over the years.
- SIMD can efficiently parallelize problems that have data-level parallelism
- MIMD, Multicores, and Clusters can be used to parallelize problems that have task-level parallelism.
- In the future, we should try to combine and use both SIMD and MIMD!

#### Thanks for listening!

| David Broman<br>dbro@kth.se | <b>Part I</b><br>Assembly and<br>Machine Code | <b>Part II</b><br>Processor<br>Design | <b>Part III</b><br>Memory<br>Hierarchy | Part IV<br>Parallel Processors<br>and Programs |
|-----------------------------|-----------------------------------------------|---------------------------------------|----------------------------------------|------------------------------------------------|

