### **Multiprocessors**

#### Erik Hagersten Uppsala University

'nη



# **Outline of these lectures**

- 1. Processor implementations
- 2. Caches and memory system
- 3. Multiprocessors
- 4. HW optimizations
- 5. Multicore processors
- 6. SW optimizations

PDC Summer School 2016



# The era of the "supercomputer" multiprocessors in the 1990s

The one with the most blinking lights wins
The one with the strangest languages wins
The niftier the better!



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



PDC Summer School 2016

# **Coherent Shared Memory**

ጉበወ

#### Erik Hagersten Uppsala University



#### **Programming Model: Shared Memory**



#### **Thread-Level Parallelism (TLP)**

Dept of Information Technology www.it.uu.se

#### UPPSALA UNIVERSITET

#### Adding Caches: Cuts latency and memory bandwidth



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



Summer

School

2016

#### **Caches: Automatic Replication of Data**





#### **The Cache Coherent Memory System**



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



Summer

School

2016

#### **The Cache Coherent \$2\$**



© Erik Hagersten user.it.uu.se/~eh

Dept of Information Technology www.it.uu.se



Summer

School

2016

#### Writeback





# **Summing up Coherence**

Sloppy: there can be many copies of a datum, but only one values

# **Coherence:** There is a single global order of value changes to each datum

# <u>Memory order/model:</u> Defines the order between accesses to many data

PDC Summe School 2016

Dept of Information Technology www.it.uu.se



# **Implementing Coherence**

PDC Summer School 2016



#### "Upgrade" in snoop-based



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



Dept of Information Technology www.it.uu.se

UNIVERSITE

PDC

Summer

School

2016



Summer

School

2016

#### "Upgrade" in snoop-based





Summer

School

2016

#### **Cache-to-cache in snoop-based**



© Erik Hagersten user.it.uu.se/~eh

Dept of Information Technology www.it.uu.se



#### "Upgrade" in dir-based



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



#### Cache-to-cache in dir-based



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



#### **Directory-based coherence: Per-cachline info in the memory**



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



#### **Directory-based snooping: NUMA. Per-cachline info in the home node**



Dept of Information Technology www.it.uu.se



School

2016

#### **Multisocket**



#### UPPSALA UNIVERSITET

#### AMD Multi-socket Architecture (same applies to Intel multi-sockets)

**Coherence = Non-Uniform** 



PDC Summer School 2016



Dept of Information Technology www.it.uu.se

UNIVERSITET

PDC

Summer

School

2016

Summer

School

2016

## More Cache Lingo

- **Capacity miss** too small cache
- Conflict miss limited associativity
- Compulsory miss accessing data the first time
- Coherence miss I would have had the data unless it had been invalidated by someone else
- Upgrade miss (only for writes) I would have had a writable copy, but gave away readable data and downgraded myself to read-only
- False sharing: Coherence/downgrade is caused by a shared cacheline, to by shared data:



Dept of Information Technology www.it.uu.se

### Memory Ordering (aka Memory Consistency) -- tricky but important stuff

ነባ

Erik Hagersten Uppsala University Sweden



#### The Shared Memory Programming Model (Pthreads/OpenMP, ...)



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



# **Memory Ordering**

- Coherence defines a per-datum valuechange order
- Memory model defines the valuechange order for all the data.





#### **Dekker's Algorithm**



#### **Q:** Is it possible that both A and B win?

PDC Summer School 2016

Dept of Information Technology www.it.uu.se



# **Memory Ordering**

- Defines the [observable] memory order: If a thread has seen that A happened before B, what order may other threads observe?
- Is a "contract" between the HW and SW guys
- Without it, you can not say much about the result of a parallel execution

PDC Summer School 2016



School 2016

#### "The intuitive memory order" Sequential Consistency (Lamport)



- Global order achieved by *interleaving* <u>all</u> memory accesses from different threads
- "Programmer's intuition is maintained"
  - Flag synchronization? Yes
  - Store causality? Yes
  - Does Dekker work? Yes

Summer • Unnecessarily restrictive ==> performance penalty



Summer

School

2016

#### **One implementation of SC** in dir-based coherence



Dept of Information Technology www.it.uu.se



#### "Almost intuitive memory model" Total Store Ordering [TSO] (P. Sindhu)



- Global *interleaving* [order] for <u>all</u> stores from different threads (own stores excepted)
- "Programmer's intuition is maintained"
  - Flag synchronization? Yes
  - Store causality? Yes

Dept of Information Technology www.it.uu.se

Does Dekker work? No

Unnecessarily restrictive ==> performance penalty

Multiprocessors 34

PDC Summer School 2016





#### Q: Is it possible that both A and B wins?

Left: The read (i.e., test if B==0) can bypass the store (A:=1) Right: The read (i.e., test if A==0) can bypass the store (B:=1) → both loads can be performed before any of the stores → yes, it is possible that both wins → → Dekker's algorithm breaks

Multiprocessors 35

Dept of Information Technology www.it.uu.se

### **Dekker's Algorithm for TSO**



#### Q: Is it possible that both A and B win?

Membar: The read is started after all previous stores have been "globaly ordered"

- → behaves like SC
- → Dekker's algorithm works!

PDC Summer School 2016



## Weak/release Consistency (M. Dubois, K. Gharachorloo)



Most accesses are unordered

#### "Programmer's intuition is not maintained"

- Flag synchronization? No
- Store causality? No
- Does Dekker work? No
- Global order <u>only</u> established when the programmer explicitly inserts memory barrier instructions

PDC Summer School 2016

#### ++ Better performance!!

--- Interesting bugs!! Multipre

Dept of Information Technology www.it.uu.se



PDC

2016

Summer School

## Weak/Release consistency

New flag synchronization needed

while (flag != 1) {}; A := data;membar; membar; flag := 1; X := A;

- Dekker's: same as TSO
- Causal correctness provided for this code





### Learning more about memory models

Shared Memory Consistency Models: A Tutorial by Sarita Adve, Kouroush Gharachorloo in IEEE Computer 1996

RTFM: Read the manual of the system you are working on! (Different microprocessors and systems supports different memory models.)

#### **Issue to think about:**

Dept of Information Technology www.it.uu.se

PDC Summer School 2016

What code reordering may compilers really do? Sometimes have to use "volatile" declarations in C!



### X86's current memory model Common view in academia: TSO

### If you ask Intel:

- Processor consistency with causual correctness for non-atomic memory ops
- TSO for atomic memory ops

### Video presentation:

http://www.youtube.com/watch?v=WUfvvFD5tAA&hl=sv

PDC Summer School 2016





## **Examples of vector instructions**

Vector Regs



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



## **x86 Vector instructions**

- MMX: 64 bit vectors (e.g., two 32bit ops)
- SSE: 128 bit vectors(e.g., four 32 bit ops)
- AVX: 256 bit vectors(e.g., eight 32 bit ops) (in Sandy Bridge, ~y2011)
- Xeon Phi: 512 bit vectors
- GPUs: Good at vector-ish instructions
   A bit more general for "diverge code"

PDC Summer School 2016





PDC Summer School 2016 X = vec[i]; MPI\_send(X, to\_dest);

•••

Dept of Information Technology www.it.uu.se

Multiprocessors 45

© Erik Hagersten user.it.uu.se/~eh

MPI\_receive(Y, from\_source;

print (Y);



## **MPI inside a multicore?**

- MPI can be implemented on top of coherent shared memory
- Coherent shard memory can not [cheaply] be implemented on top of MPI
- Many options for parallelism within a "node":
  - OpenMP
  - MPI
  - Posix threads

\* ...

PDC Summer School 2016



Dept of Information Technology www.it.uu.se

2016



# A 5-stage 2-way superscalar pipeline



Dept of Information Technology www.it.uu.se



## A 5-stage superscalar pipeline

## One sequential program:



PDC Summer School 2016

Dept of Information Technology www.it.uu.se



PDC

Summer School 2016

### A 5-stage 2-way superscalar pipeline, Simultaneouslu Multithreaded 2-ways (SMT)



Dept of Information Technology www.it.uu.se



PDC

2016

Summer School

## **Choosing between different threads**

- Fixed interleaving (Xeon Phi, HEP 1982!!, ...)
  - Each of N threads executes one instruction every N:th cycles
  - If thread is not ready to go during its slot  $\rightarrow$  bubble

### Hardware-controlled thread scheduling

- E.g., hardware keeps track of which threads are ready to go (Niagra-1)
- E.g., picks next thread to execute based on hardware priority scheme (~Hyperthreading)
- I-count: Chose the thread with least Instr in-flight
- Course-grained: Run one thread until it "blocks"



### How are we doing?

- Create and explore locality:
  - ✓ a) Spatial locality
  - ✓ b) Temporal locality

### Create and explore parallelism

- ✓ a) Instruction level parallelism (ILP)
- b) Thread level parallelism (TLP)
  - c) Memory level parallelism (MLP)

### Speculative execution

- a) Out-of-order execution
- b) Branch prediction
- c) Prefetching

PDC Summer School 2016