

## **Remote Store Programming**

### Henry Hoffmann David Wentzlaff Anant Agarwal

High Performance Embedded Computing Workshop September 2009

## Introduction/Motivation

- Key Features of RSP
  - Usability
  - Performance
- Evaluation
  - Methodology
  - Results
- Conclusion



### Multicore requires innovation to balance usability and performance



- Parallel programming is becoming ubiquitous
  - Parallel programming is no longer the domain of select experts
  - Balancing ease-of-use and performance is more important than ever



## Existing programming models do no combine usability and performance



#### Remote Store Programming (RSP) can combine usability with performance



#### RSP combines the usability of Threads with the performance of DMA





- Introduction/Motivation
- Key Features of RSP
  - Usability
  - Performance
- Evaluation
  - Methodology
  - Results
- Conclusion







- Process Model
  - Each process has private memory by default
  - A process can grant write access to remote processes
- Communication
  - Processes communicate by storing to remotely writable memory
- Synchronization
  - Supports test-and-set, compare-and-swap, etc.
  - We assumer higher level primitives like barrier



# RSP emphasizes locality for performance on large scale multicores





**Performance** 

### 2D FFT example illustrates performance and usability







Local

Memory

Core 1

RF

data

load

(DMA)

RSP

Usability

**Performance** 





col\_fft(C,B,pid);



For more detail see: Hoffmann, Wentzlaff, Agarwal. Remote Store Programming: Mechanisms and Performance. Technical Report MIT-CSAIL-TR-2009-017. May, 2009



#### **RSP** requires incremental hardware support



- RSP requires incremental additional hardware
  - In processor supporting cache coherent shared memory
    - Additional memory allocation mode for remotely writable memory
    - Do not update local cache when writing to remote memory
  - In processor without cache coherent shared memory
    - Memory allocation for remotely writable memory
    - Write misses on remote cores forward miss to allocating core



- Introduction/Motivation
- Key Differentiators of RSP
  - Usability
  - Performance
- Evaluation
  - Methodology
  - Results
- Conclusion



#### RSP, Threads, and DMA are compared on the TILEPro64 processor





#### Speedups of RSP and cache-coherent benchmarks





## RSP is outperforms threading with cache coherence for large numbers of cores

## Speedup of RSP versus Cache Coherence for selected benchmarks (higher is better)



Remote Store Programming 14

#### Load latency of selected benchmarks on RSP versus Cache Coherence (lower is better) Average load latency of RSP/CC 3.5 3 2.5 1.5 0.5 A 2 8 16 32 64 4 Cores Transpose Matrix Multiply Convolution FFT Error Diffusion **Bitonic Sort** Histogram H.264



#### Comparison of RSP, Threading, and DMA for two applications



• RSP is faster than Threading and DMA due to fine emphasis on locality and finegrain communication



- Introduction/Motivation
- Key Differentiators of RSP
  - Usability
  - Performance
  - Hardware Requirements
- Evaluation
  - Methodology
  - Results





## Conclusion

- Talk presents Remote Store Programming
  - A programming model for multicore
  - Uses familiar communication mechanisms
  - Achieves high-performance through locality and fine-grain communication
  - Requires incremental hardware support
- Conclusion:
  - Threads and shared memory good performance and easy to use
  - For large numbers of cores RSP can outperform threading because of greater locality
  - RSP is easier to use than DMA and slightly harder than Threads
- Anticipated use for RSP:
  - Can supplement threads and cache-coherent shared memory on multicore
    - Most code uses standard threading and shared memory techniques
    - Performance critical sections of code or applications use RSP for additional performance
    - Gentle Slope to programming multicore

