### Abstract Machines for RASCs and Signal/Image Processing

# L. Mullin\*, J. Raynolds\*, I. Grout, J. Ryan, and Q. Li\* University at Albany, SUNY\* & University of Limerick, Ireland HPEC 2006

University of Limerick, Ireland

**University at Albany, SUNY** 

Irm-1 Irm 11/27/2006

## **Levels of Processor/Memory Hierarchy**

#### **continued**



## **Abstract Machine Concept**

- Abstract Machine is a one-to-one mapping between the various levels of hardware and software cache, memory, disk, network, FPGA etc. onto multi-dimensional arrays.
- The Abstract Machine is an algebraic specification of the algorithm using Mullin's formalism of A Mathematics of Arrays (MoA) and its corresponding indexing scheme the psi-calculus.
- The Abstract Machine is an exact Engineering Specification derived mathematically that can be given to a software engineer for direct, step by step, implementation into any convenient hardware and/or software language.
- We use an Abstract Machine approach to the efficient implementation of our cache-optimized FFT algorithm on a RASC architecture in which various levels of the FFT are efficiently distributed over hardware (FPGA) and software.

University of Limerick, Ireland

**University at Albany, SUNY** 

## **Arbitrary RASC/FPGA**



University of Limerick, Ireland

University at Albany, SUNY

Irm-4 Irm 11/27/2006

## **Design Flow**



## Field Programmable Gate Arrays

For use in RASC systems:

- Large number of available logic gates and interconnect resources to implement complex DSP functions.
- Specific FPGAs come with dedicated hardware blocks such as embedded 2's complement multipliers (combinational logic designs that do not require multiple clock cycles to complete the operation) allowing for the support of high-speed operations.

• Specific FPGAs come with embedded SRAM blocks to allow for temporary storage of data.

University of Limerick, Ireland

**University at Albany, SUNY** 

Irm-6 Irm 11/27/2006

## **Machine Abstraction**

#### **FFT with Cache Algorithm**



Irm-7 Irm 11/27/2006

### Summary: FFT Cache Algorithm to FFT Machine with Cache

- Maximize in-cache operations through use of repeated transposereshape operations
  - **Re-materialize the array to achieve locality.**
  - Do as many operations in cache as possible and repeat process
  - **Operations carried out at the price of rearranging**
  - Cache loop is an iteration space
  - Could also be over processors, etc.
- Data blocks, *abstract cache sizes*, can be of any size (powers of the radix):
- Optimum performance: tradeoff between reduction of cache misses and cost of transpose-reshape operations, i. e. rearranging.
- Costs are determined by combining iteration space from *normal* from to abstract machine and actual costs of access and control.
- Uniform view and analysis using an array algebra and index calculus: Conformal Computing.
- Cost-intensive calculations carried out in hardware (FPGA) for extreme speed seemlessly integrated with faster software calculations by design.

University of Limerick, Ireland

University at Albany, SUNY