# intrinsity...

### An Innovative High-Performance Architecture for Vector and Matrix Math Algorithms

Presented by: Tim Olson, Architect

HPEC 2002 – September 24, 2002

Authors: Veeraraghavan Anantha, Ph.D.; Christophe Harlé, Ph.D.; Tim Olson, George Yost, Ph.D.

© 2002 Intrinsity, Inc.

Intrinsity, the Intrinsity logo, the Intrinsity dot logo, Advanced Signal Processor, and FastMATH are trademarks of Intrinsity, Inc. MIPS is among the registered trademarks and MIPS32 is among the trademarks of MIPS Technology, Inc. RapidIO and the RapidIO logo are trademarks of the RapidIO Trade Association. All other trademarks are for reference purposes only and are the property of their respective owners.

# Intrinsity FastMATH™ Vector and Matrix Math Processor

#### **Optimized for real-time and adaptive signal processing needs:**

Innovative architecture:

- 2 GHz SIMD 4 × 4 matrix engine with multiprocessor scalability due to high bandwidth RapidIO<sup>™</sup> interfaces
- Fixed-point math
- High-level (e.g., C) language programmable
  - Compiler built-in matrix intrinsics
  - Vector/matrix library

- On-chip matrix coprocessor and MIPS32<sup>™</sup> ISA RISC core
- 4 × 4 array of processors, each with sixteen 32-bit registers, two 40-bit MACs
- 64 GOPS (peak)
- Matrix and vector math native instructions: 1-, 8-, 16-, 32-bit support; convenient complex math
- Descriptor-based DMA controller
- 1 Mbyte on-chip cache-coherent L2 cache

#### Speed *plus* an architecture designed for parallel computations





Matrix Register Arithmetic: Element-by-Element

The matrix engine has 16 matrix registers, each with 16 32-bit values. Halfword and word arithmetic is supported.

Single instruction, element-wise addition of two  $4 \times 4$  matrices





HPEC 2002; 09/24/02

 $M_{2}$ 

© 2002 Intrinsity, Inc.

M₁

## Matrix Register Arithmetic: Block Rearrangement for Parallelism



Load 4 or 16 data streams (users) and re-block for SIMD parallel processing

- Original register load instructions
- block4 (four cycles): matrix operations on four streams
- For SIMD operations on 16 parallel data streams: continue rearrangement with block data movement instructions—70 cycles (35 ns) total

intrinsity

## FastMATH Performance Example: Fast Fourier Transform

Matrix architecture plus cycle speed combine approximately equally for advantage on this key benchmark

1 K Radix-4 FFT, 16-bit complex data



© 2002 Intrinsity, Inc.



for 8 antennas, 10 Msamples per second, 1024-pt complex FFT: requires *14.4% FastMATH processor* 

© 2002 Intrinsity, Inc.



## FastMATH Performance Example: Smart Antennas

#### Background

- More users than antennas  $\Rightarrow$  orthogonal beams not possible
- No a priori information about signal directions ⇒ need real-time adaptation
- Input stream is 16-bit complex data

#### **FastMATH Implementation**

- Covariance matrix calculated by *complex matrix-matrix multiplications on*  $4 \times 4$  submatrices, then re-assembling full matrix
- Covariance matrix inverted by Cholesky decomposition; use *block matrix manipulation* instructions to rearrange input into blocks for SIMD parallelization
- Beamforming using matrix-matrix multiplications; more efficient than simple vector math

#### **WCDMA Example Results**

• With 64 voice users and 16 antennas, 4 rake fingers per user, weights updated every slot: 0.73 FastMATH processors

## Scaled Multiprocessor Example: CDMA Multi-User Detection

#### **Algorithms**

- Mitigate interference between users in CDMA
- Solve for estimators for correct symbols, beginning with user-user correlation matrix *R* and user input vector *y*
- Difference equation for interference on symbol m of desired user from nearby symbols of all other users:

$$y_{m} = \sum_{k=-K}^{K} R_{m-k} \hat{b}_{m-k}$$

• *b* is desired estimator vector for symbol *m* of *N* users to be found

#### Implementation

- Jacobi iteration: Solve for matrix *B* of *M* symbols for *N* users. Perform matrixmatrix multiplications distributed over processors
- Calculate correlation matrices *R* on chip; large capacity L2 cache reduces data transfer
- At each iteration exchange partial results over RapidIO port via DMA
- RapidIO interfaces work in background in parallel with computations data transfer time efficiently hidden

# **Scaled Multiprocessor Example:** WCDMA Short Code Multi-User Detection

- Data transfer in parallel with computation
- Scalable multiprocessor system distributing tasks and results over RapidIO interface via coherent L2 cache

