



# Partitioned FFTC: An Improved Fast Fourier Transform for the IBM Cell Broadband Engine

## Andrew Shaffer Bruce Einfalt, Padma Raghavan

Applied Research Lab

Department of Computer Science & Engineering

The Pennsylvania State University

aps148@psu.edu

High Performance Embedded Computing (HPEC) Workshop 23-25 September 2008



# Partitioned FFTC (PFFTC) Requirements







### PFFTC Approach





### **Optimizations:**

- Single-pass partitioning
- Register-level double buffering
- "Asynchronous" synchronization
- Communicationfree combination stage



### PFFTC Results







#### **PFFTC Features:**

- Lowest known latency on Cell BE
- Peak performance of 33.61 GFLOPS for 16K problem size
- Speedup of 31% 56% over best prior Cell FFT
- Further improvement to 40 GFLOPS possible by using Fused Multiply-Add (FMA)-based FFT in solution stage

\* FFT GFLOPS based on 5Nlog<sub>2</sub>N operations / runtime

See poster C.8 for more details