### Multiprocessors

#### Erik Hagersten Uppsala University



### **Outline of these lectures**

- 1. Uniprocessors
- 2. Multiprocessors
  - Coherence
  - Memory ordering
  - Vector instructions
- 3. Multicores & Manycores
- 4. Optimizing for speed



## The era of the "supercomputer" multiprocessors

The one with the most blinking lights wins
The one with the strangest languages wins
The niftier the better!



PDC Summer School 2011



### **Coherent Shared Memory**

#### Erik Hagersten Uppsala University



#### **Programming Model:**





#### Adding Caches: More Concurrency





Summer

School

2011

#### Caches: Automatic Replication of Data



Dept of Information Technology www.it.uu.se



#### **The Cache Coherent Memory System**





Summer

School

2011

#### The Cache Coherent \$2\$



© Erik Hagersten | user.it.uu.se/~eh



Summer

School

2011

#### Writeback



© Erik Hagersten | user.it.uu.se/~eh



## Summing up Coherence

Sloppy: here can be many copies of a datum, but only one value 100 strong

# <u>Coherence:</u> There is a single global order of value changes to each datum

# <u>Memory order/model:</u> Defines the order between accesses to many data





## **Implementing Coherence**



Summer

School

2011

#### "Upgrade" in snoop-based





School

2011

#### Cache-to-cache in snoop-based





#### "Upgrade" in dir-based



Dept of Information Technology www.it.uu.se



#### Cache-to-cache in dir-based



Summer School 2011

PDC

Dept of Information Technology www.it.uu.se

© Erik Hagersten | user.it.uu.se/~eh



#### Directory-based coherence: Per-cachline info in the memory





#### Directory-based snooping: NUMA. Per-cachline info in the home node



MP 20



School

2011

#### **Multisocket**





## AMD Multi-socket Architecture (same applies to Intel multi-sockets)

Coherence = Non-Uniform



## UNIVERSITET

#### **AMD Magny-Cours** NUMA & NUCA on a socket **Non-Uniform Memory Architecture Non-Uniform Communication Architecture**



PDC Summer School 2011

Dept of Information Technology www.it.uu.se

© Erik Hagersten | user.it.uu.se/~eh



Dept of Information Technology www.it.uu.se

PDC

2011

### More Cache Lingo

- Capacity miss too small cache
- Conflict miss limited associativity
- Compulsory miss accessing data the first time
- Coherence miss I would have had the data unless it had been invalidated by someone else
- Upgrade miss (only for writes) I would have had a writable copy, but gave away readable data and downgraded myself to read-only
- False sharing: Coherence/downgrade is caused by a shared cacheline, to by shared data:

|     | False sharing                               | Read A  |         | cacheline:       |
|-----|---------------------------------------------|---------|---------|------------------|
|     | example:                                    |         | Read D  | A, B, C, D       |
|     |                                             | Write A |         |                  |
| ner |                                             |         | Write D |                  |
|     |                                             | Read A  | MP 25   |                  |
|     | Dept of Information Technology www.it.uu.se |         |         | © Frik Hagersten |

### Memory Ordering (aka Memory Consistency) -- tricky but important stuff

Erik Hagersten Uppsala University Sweden



#### The Shared Memory Programming Model (Pthreads/OpenMP, ...)





## **Memory Ordering**

- Coherence defines a per-datum valuechange order
- Memory model defines the valuechange order for all the data.





#### **Dekker's Algorithm**



#### **Q:** Is it possible that both A and B win?

PDC Summer School 2011

Dept of Information Technology www.it.uu.se



## **Memory Ordering**

#### Defines the guaranteed memory ordering: If a thread has seen that A happens before B, what order can the other threads can observe?

Is a "contract" between the HW and SW guys





# Human intuition: There is one global order!

(A' denotes a modified value to the data at addr A)

**Thread 1** 

Thread 2





#### **One possible** Another possible observed order observed order Thread 1 Thread 2 Thread 1 Thread 2 LD A ST B' LD A LD C ST A' ST A' ST B' LD B' LD B' LD C ST C' ST C' LD D LD D ST D' ST D' LD E LD E ST E' ST E' . . . . . . . . . . . .

#### "The intuitive memory order" Sequential Consistency (Lamport)



- Global order achieved by *interleaving* <u>all</u> memory accesses from different threads
- "Programmer's intuition is maintained"
  - Store causality? Yes
  - Does Dekker work? Yes

Summer • Unnecessarily restrictive ==> performance penalty

PDC

School

2011

## UNIVERSITET

PDC

Summer

School

2011

#### **One implementation of SC in dir-based** (....without speculation)



© Erik Hagersten user.it.uu.se/~eh



School

2011

#### "Almost intuitive memory model" Total Store Ordering [TSO] (P. Sindhu)



- Global *interleaving* [order] for <u>all</u> stores from different threads (own stores excepted)
- "Programmer's intuition is maintained"
  - Store causality? Yes
  - Does Dekker work? No

Summer 
 Unnecessarily restrictive ==> performance penalty

Dept of Information Technology| www.it.uu.se

MP 35



## **TSO HW Model**



PDC Summer School 2011

→ Stores are moved off the critical path Coherence implementation can be the same as for SC MP 36

Dept of Information Technology www.it.uu.se

© Erik Hagersten user.it.uu.se/~eh



PDC

Summer

School

2011

## TSO

## Flag synchronization works

A := data while (flag != 1) {}; flag := 1 X := A

#### Provides causal correctness







#### Q: Is it possible that both A and B wins?

Left: The read (i.e., test if B==0) can bypass the store (A:=1) Right: The read (i.e., test if A==0) can bypass the store (B:=1) → both loads can be performed before any of the stores → yes, it is possible that both wins → → Dekker's algorithm breaks

Dept of Information Technology www.it.uu.se

PDC

Summer

School 2011

## **Dekker's Algorithm for TSO**



#### **Q:** Is it possible that both A and B win?

Membar: The read is stared after all previous stores have been "globaly ordered"

- → behaves like SC
- → Dekker's algorithm works!



PDC

Summer

School

2011

## Weak/release Consistency (M. Dubois, K. Gharachorloo)



- Most accesses are unordered
- "Programmer's intuition is not maintained"
  - Store causality? No
  - Does Dekker work? No
- Global order <u>only</u> established when the programmer explicitly inserts memory barrier instructions

++ Better performance!!

--- Interesting bugs!!

Dept of Information Technology www.it.uu.se

MP 40



PDC

Summer

School

2011

# Weak/Release consistency

New flag synchronization needed

A := data;while (flag != 1) {};membarrier;membarrier;flag := 1;X := A;

- Dekker's: same as TSO
- Causal correctness provided for this code





### Learning more about memory models

Shared Memory Consistency Models: A Tutorial by Sarita Adve, Kouroush Gharachorloo in IEEE Computer 1996

RFM: Read the F\*\*\*\*ng Manual of the system you are working on! (Different microprocessors and systems supports different memory models.)

#### Issue to think about:

PDC Summer School 2011

What code reordering may compilers really do? Have to use "volatile" declarations in C.

Dept of Information Technology www.it.uu.se



#### X86's current memory model Common view in academia: TSO

### If you ask Intel:

- Processor consistency with causual correctness for non-atomic memory ops
- TSO for atomic memory ops

#### Video presentation:

Dept of Information Technology www.it.uu.se

http://www.youtube.com/watch?v=WUfvvFD5tAA&hl=sv



## See section 8.2 in this manual:

http://developer.intel.com/Assets/PDF/manual/253668.pdf

MP 43





## **Examples of vector instructions**

Vector Regs





## **x86 Vector instructions**

- MMX: 64 bit vectors (e.g., two 32bit ops)
- SSEn: 128 bit vectors(e.g., four 32 bit ops)
- AVX: 256 bit vectors(e.g., eight 32 bit ops) (in Sandy Bridge, Q1 2011)
- Intel MIC: "16-way vectors" ??



Dept of Information Technology www.it.uu.se



PDC Summer School 2011

. . .

Dept of Information Technology www.it.uu.se

MP 48

MPI\_receive(Y, from\_dest);

print (Y);



## MPI inside a multicore?

- MPI can be implemented on top of coherent shared memory
- Coherent Shard memory cannot easily be implemented on top of MPI
- Many options for parallelism within a "node":
  - OpenMP
  - MPI
  - Unix "fork"

**\*** ...



#### A few words about simultaneously multithreading (SMP) or "Hyper-threading"



#### Several threads sharing a pipeline **TLP helps finding more independent instructions**



Dept of Information Technology www.it.uu.se

UNIVERSITE'

PDC

School

2011

© Erik Hagersten | user.it.uu.se/~eh