

LOCKHEED MARTIN



### **Authors**

| Stewart Reddaway / World <i>Scape</i><br>Inc. | Rick Pancoast / Lockheed Martin<br>MS2 |  |
|-----------------------------------------------|----------------------------------------|--|
| Brad Atwater / Lockheed Martin                | Pete Rogina / WorldScape Inc.          |  |
| MS2                                           | Leon Trevito / Lockheed Martin         |  |
| Paul Bruno / WorldScape Inc.                  | MS2                                    |  |

September 29, 2004

**ClearSpeed**<sup>\*\*</sup>

Technology, plc.

# Overview

### + Work Objective

- Provide working hardware benchmark for Multi-Threaded Array Processing Technology
  - Enable embedded processing decisions to be accelerated for upcoming platforms (radar and others)
  - Validate Pulse Compression benchmark with hardware, and with data flowing from and to external DRAM
  - Support customers' strategic technology investment decisions

#### • Share results with industry

New standard for performance AND performance per watt

ClearSpeed

N<sup>n</sup>rid**G**aape

# Architecture

### ClearSpeed's Multi Threaded Array Processor Architecture – MTAP



#### Architectural DSP Features:

- •Multiple operations per cycle
  - -Data-parallel array processing
  - -Internal PE parallelism
  - -Concurrent I/O and compute
  - -Simultaneous mono and poly operations
- Specialized execution units in each PE
  - -Integer MAC, Floating-Point Units
- •On-chip memories
  - -Instruction and data caches
  - -High bandwidth PE "poly" memories
  - -Large scratchpad "mono" memory
- •Zero overhead looping
  - -Concurrent mono and poly operations

- Fully programmable at high level with Cn (parallel variant of C)
- Hardware multi-threading
- Extensible instruction set
- Fast processor initialization and restart
- High performance, low power
   10 GFLOPS/Watt

- Scalable internal parallelism
  - Array of Processor Elements (PEs)
  - Compute and bandwidth scale together
  - From 10s to 1,000s of PEs
  - Multiple specialized execution units per PE
- Multiple high speed I/O channels

ClearSpeed

8 0

c.

Lo.

La s

0

å

Prideoape

# Architecture

### Processor Element Structure



- ALU + accelerators: integer MAC, Dual FPU, DIV/SQRT
- High-bandwidth, multi-port register file
- Closely-coupled SRAM for data
- High-bandwidth per PE DMA: PIO, SIO

- High-bandwidth inter-PE communication
- Supports multiple data types:
  - 8, 16, 24, 32-bit, ...
    fixed point
  - 32-bit IEEE floating point

ClearSpeed

W nd cape

æ

# Applications

### Power Comparison Results (Table presented at HPEC 2003)

| Processor                                        | Clock      | Power          | FFT/sec<br>/Watt | PC/sec/<br>Watt |
|--------------------------------------------------|------------|----------------|------------------|-----------------|
| Mercury<br>PowerPC<br>7410                       | 400<br>MHz | 8.3<br>Watts   | 3052             | 782.2           |
| World <i>Scapel<br/>ClearSpeed</i><br>64 PE Chip | 200<br>MHz | 2.0<br>Watts** | 56870            | 24980           |
| Speedup                                          |            |                | 18.6 X           | 31.9 X          |

\*\* 2.0 Watts was the worst case result from Mentor Mach PA Tools.

Actual Measured Hardware Results < 1.85 Watts

HPEC 2003 Cycle Accurate Simulations were validated on actual hardware. Results matched to within 1%.

ClearSpeed

W nd cape

WorldScape and Lockheed Martin collaborated to provide demonstration using realistic Pulse Compression data on actual hardware



 1K FFT and IFFT implemented on 8 PEs with
 128 complex points per PE (8 FFTs performed in parallel over 64 PEs)

-Pulse Compression based upon optimized instructions: FFT, complex multiply by a realistic reference FFT, IFFT

-32-bit IEEE standard floating point

ClearSpeed

Manid Gape

### **Benchmark Measurements:**

Validate Pulse Compression performance with hardware and with data flowing from and to external DRAM (1 MTAP processor)

|                      | Per Second<br>( /s) | Per Second<br>Per Watt<br>(/s/W) |
|----------------------|---------------------|----------------------------------|
| FFTs (within PC)     | 68800*              | 37200                            |
| Pulse<br>Compression | 34680               | 18744                            |
| GFLOP                | 3.73                | 2.02                             |

\* Adjusted for CM = 73000 FFT/s, 39400 FFT/s/W





- 1) Input Data and reference Function loaded from Host onto DRAM
- 2) Data input from DRAM to MTAP #1, processed, and output into DRAM
- 3) Results returned to Host for display

ClearSpeed

WindGoape

#### •Pulse Compression Input (MatLab)



- 1 KHz PRF (1ms PRI)
- 20 MHz sampling rate
- 870 samples
- + Echo
  - 10 us pulse
  - LFM chirp up
  - 200 samples

### •Pulse Compression Reference (MatLab)



- Frequency Domain Reference
- 10 us
- LFM chirp up
- 1024 samples
- Hamming weighting
- Bit-reversed to match optimized implementation

#### •Pulse Compression Output (MatLab)

+671 samples out of PC



ClearSpeed

N ridecape

### •Pulse Compression Input/Output (Actual)



#### •Pulse Compression Reference (Actual)\*



### Benchmark Measurements:

Validate Pulse Compression performance with hardware and with data flowing from and to external DRAM

(Average Performance across 2 MTAP processors)

|                      | Per Second<br>( /s) | Per Second<br>Per Watt<br>( /s/W) |
|----------------------|---------------------|-----------------------------------|
| FFTs (within PC)     | 56800*              | 30700                             |
| Pulse<br>Compression | 28610               | 15465                             |
| GFLOP                | 3.08                | 1.67                              |

\* Adjusted for CM = 60200 FFT/s, 32510 FFT/s/W





#### Host

1) Input Data and reference Function loaded from Host onto DRAM

2) Data input to MTAP #1 and (via MTAP #1) to MTAP #2 processed, and output (via MTAP #1) into DRAM

3) Results returned to Host for display

ClearSpeed

WindGoape



Hardware validation of HPEC 2003 results to within 1%



#### World-class radar processing benchmark results



Optimized Pulse Compression functions modified using COTS SDK and integrated onto Host platform

#### Wide Ranging Applicability to DoD/Commercial Processing Requirements

VSIPL Core Lite Libraries under development

#### **Application Areas**

- Image Processing
- Signal Processing
- Compression/De-compression
- Encryption/De-cryption
- Network Processing
- Search Engine
- Supercomputing Applications

ClearSpeed

N rid Scape