

# High Performance Simulations of Electrochemical Models on the Cell Broadband Engine

James Geraci

**HPEC Workshop** 

September 20, 2007

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government



CBELU - 1 James 12/13/2007



# Outline

- Introduction
- Out of Core Algorithm
- In Core Algorithm
- Summary

- Introduction
- Modeling battery physics
- LU decomposition
- CELL Broadband Engine

### Introduction Real Time Battery State of Health Estimation



•The rate of self discharge is also a function of battery geometry.



# **Inside a Lead Acid Battery**





# Finite volume description of battery





•The center volume's physics can be influenced by the physics of the volumes to the right, the left, above, and below

•The 5 point stencil gives rise to banded matrix



# Matlab spy plot of Banded Matrix





# Outline

- Introduction
- Out of Core Algorithm
- In Core Algorithm
- Summary

- Introduction
- Modeling battery physics
- LU decomposition
- CELL Broadband Engine



- LU decomposition decomposes a matrix J into a lower triangular matrix L and an upper triangular matrix U
- L & U can be used to solve a system of linear equations
   Jx = r by forward elimination back substitution
  - Essentially Gaussian Elimination
- Often used on poorly conditioned systems where 'iterative solvers' can't be used.
- Difficult to parallelize for small systems because of the fine grain nature of the parallelism involved.
- Banded LU is a special case of LU
  - The matrix J has a special 'banded' data pattern.



# Outline

- Introduction
- Out of Core Algorithm
- In Core Algorithm
- Summary

- Introduction
  - Modeling battery physics
  - LU decomposition
- CELL Broadband Engine



# **Cell Broadband Engine**



•Cell Broadband Engine is a new heterogeneous multicore processor that features large internal and off chip bandwidth.

MIT Lincoln Laboratory



# **Cell Broadband Engine**







- Introduction
- Out of Core Algorithm
  - In Core Algorithm
  - Summary

- Banded LU
- Performance
- Latency
- Synchronization



### Banded LU Out of Core Algorithm Explained







•Compute/Memory Ratio =  $\frac{1}{2}$ •For a 16728x16728 matrix with band size of 420, almost 22 GB of data would have to be moved.





- Introduction
- Out of Core Algorithm
  - In Core Algorithm
- Summary

- Banded LU
- Performance
- Latency
- Synchronization



# **Peformance of Out of Core Algorithm**



Compute Time for matrix size 16728x16728 w/ band size of 420

Gflops for matrix size 16728x16728 w/ band size of 420

•Out of Core Algorithm outperforms UMFpack on Opteron 246 based workstation.

•No appreciable gain in performance past 4 SPEs

CBELU - 16 James Geraci 12/13/2007 MIT Lincoln Laboratory





- Introduction
- Out of Core Algorithm
  - In Core Algorithm
  - Summary

- Banded LU
- Performance
- Latency
- Synchronization



# Latency for DMA put



•Significant Performance hit for memory access smaller than 16bytes

•Bandwidth limited region starts at 8x128bytes



# SPE to main memory bandwidth



•Theoretical maximum bandwidth can almost be achieved for larger message sizes

# **OCA Memory Access Size Dependence**



•Out of Core performance is better when memory access is a byte multiple of 128

**MIT Lincoln Laboratory** 

CBELU - 20 James Geraci 12/13/2007



### **PPE/SPE Synchronization Mailboxes**

| MFC<br>16B/cycle<br>// Restart SPEs<br>writeSPEinMboxes();<br>Spd_writeCh(SPU_WrodtMbox)<br>// Wait for PPE<br>spu_readch(SPU_RdInMbox); |   |
|------------------------------------------------------------------------------------------------------------------------------------------|---|
| Hailboxes are one common method of synchronization     16B/cycle     PPE                                                                 | n |
|                                                                                                                                          |   |



### PPE/SPE Synchronization Mailboxes Round Trip Times

| • | Mailboxes using IBM SDK 2.1 C-<br>intrinsics                                 | 6.24 μseconds                                            |
|---|------------------------------------------------------------------------------|----------------------------------------------------------|
| • | Mailboxes using IBM SDK 2.1 C-<br>intrinsics & pointers to MMIO<br>registers | 3.65 μseconds                                            |
| • | Mailboxes by pointers alone                                                  | 0.35 μseconds / not reliable                             |
| • | Standard round trip latency<br>16byte message                                | 58.33 μseconds TCP (Gigabit)<br>8.08 μseconds Infiniband |

• IBM SDK 2.1 C-intrinsics for mailboxes do not seem to have idea performance.



### Synchronization by hybrid of C-intrinsics and pointers to MMIO registers



•Synchronization with IBM SDK 2.1 C-intrinsics & pointers to MMIO registers yields fairly good performance for a moderate numbers SPEs



### Synchronization exclusively by IBM SDK 2.1 mailbox C intrinsics

#### **SPE synchronization by SDK C intrinsics**



•Synchronization by IBM SDK 2.1 mailbox C intrinsics alone, yields little gain for low SPE count and performance LOSS after only 4 SPEs!!!

CBELU - 24 James Geraci 12/13/2007



# Data from synchronization exclusively by pointers



Synchronization by pointers alone, yields a nice speed up for all SPEs
Seems to be reliability issue with reading mailbox status register via pointers



- Introduction
- Out of Core Algorithm
- In Core Algorithm



- Hide memory accesses
- Hide synchronizations

• Summary

### Banded LU In Core Algorithm





# In Core Performance





CBELU - 28 James Geraci 12/13/2007



### In Core Performance w/ linear speed up

#### Performance improves as band size increases



**MIT Lincoln Laboratory** 

CBELU - 29 James Geraci 12/13/2007



# **IBM QS20 performance**



CBELU - 30 James Geraci 12/13/2007



### **ICA Memory Access Size Dependence**



•In core performance does not show a large dependence on memory access size

CBELU - 31 James Geraci 12/13/2007 MIT Lincoln Laboratory



- Introduction
- Out of Core Algorithm
- In Core Algorithm
- Summary



- Parallel LU decomposition can benefit from the high bandwidth of the CBE
  - Benefit depends greatly on synchronization scheme
- inCore LU offers performance advantages over out of core LU
  - Limits on size of matrix bandwidth for inCore LU.
- Partial pivoting



- Out of Core Algorithm
  - Sudarshan Raghunathan
  - John Chu MIT
- In Core Algorithm
  - Jeremy Kepner MIT LL
  - Sharon Sacco MIT LL
  - Rodric Rabbah IBM