



# High Performance Embedded Computing Software Initiative (HPEC-SI)

#### **Dr. Jeremy Kepner / Lincoln Laboratory**

This work is sponsored by the Department of Defense under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.











- Goals
- Program Structure

- Demonstration
- Development
- Applied Research
- Future Challenges

Summary

## Overview - High Performance Embedded Computing (HPEC) Initiative



<u>Challenge:</u> Transition advanced software technology and practices into major defense acquisition programs

#### **Common Imagery Processor (CIP)**





### Why Is DoD Concerned with Embedded Software?





- COTS acquisition practices have shifted the burden from "point design" hardware to "point design" software
- Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards

**MITRE** 

MIT Lincoln Laboratory



### **Issues with Current HPEC Development**

#### **Inadequacy of Software Practices & Standards**







- High Performance Embedded Computing pervasive through DoD applications
  - Airborne Radar Insertion program
     85% software rewrite for each hardware platform
  - Missile common processor
     Processor board costs < \$100k</li>
     Software development costs > \$100M
  - Torpedo upgrade

Two software re-writes required after changes in hardware design



- Not portable
- Not scalable
- Difficult to develop
- Expensive to maintain

**MIT Lincoln Laboratory** 



### Evolution of Software Support Towards "Write Once, Run Anywhere/Anysize"





- Application software has traditionally been tied to the hardware
- Many acquisition programs are developing stove-piped middleware "standards"
- Open software standards can provide portability, performance, and productivity benefits
- Support "Write Once, Run Anywhere/Anysize"





#### **Program Goals**

- Develop and integrate software technologies for embedded parallel systems to address portability, productivity, and performance
- Engage acquisition community to promote technology insertion
- Deliver quantifiable benefits

| Portability:         | reduction in lines-of-code to  |  |
|----------------------|--------------------------------|--|
|                      | change port/scale to new       |  |
|                      | system                         |  |
| <b>Productivity:</b> | reduction in overall lines-of- |  |
|                      | code                           |  |
| Performance:         | computation and                |  |

MITRE

communication benchmarks



**MIT Lincoln Laboratory** 

Slide-7 www.hpec-si.org



### Organization





- Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs
- Over 100 participants from over 20 organizations













Common Imagery Processor

• AEGIS BMD (planned)

• Introduction



- Development
- Applied Research
- Future Challenges
- Summary



38.5"

### **Common Imagery Processor**

- Demonstration Overview -





Slide-11 www.hpec-si.org \* CIP picture courtesy of Northrop Grumman Corporation



### **Common Imagery Processor** - Demonstration Overview -





Processor

#### **Demonstrate standards-based platform**independent CIP processing (ASARS-2)

- Assess performance of current COTS portability standards (MPI, VSIPL)
- Validate SW development productivity of emerging Data Reorganization Interface
- **MITRE and Northrop Grumman**

Embedded **Multicomputers** 

**Common Imagery** 





**Shared-Memory Servers** 

Single code base optimized for all high performance architectures provides future flexibility



**Commodity Clusters Massively Parallel Processors** 







### **Embedded Multicomputers**

- CSPI 500MHz PPC7410 (vendor loan)
- Mercury 500MHz PPC7410 (vendor loan)
- Sky 333MHz PPC7400 (vendor loan)
- Sky 500MHz PPC7410 (vendor loan)

### **Mainstream Servers**

- HP/COMPAQ ES40LP 833-MHz Alpha ev6 (CIP hardware)
- HP/COMPAQ ES40 500-MHz Alpha ev6 (CIP hardware)
- SGI Origin 2000 250MHz R10k (CIP hardware)
- SGI Origin 3800 400MHz R12k (ARL MSRC)
- IBM 1.3GHz Power 4 (ARL MSRC)
- Generic LINUX Cluster

MITRF







www.hpec-si.org



### Shared Memory / CIP Server versus Distributed Memory / Embedded Vendor





Slide-15 www.hpec-si.org MIT Lincoln Laboratory

AFRI



## **Form Factor Improvements**





• IOP: 6U VME chassis (9 slots potentially available)

• IFP: HP/COMPAQ ES40LP

MITRE



- IOP could support 2 G4 IFPs
  - form factor reduction (x2)
- 6U VME can support 5 G4 IFPs
  - processing capability increase (x2.5)

MIT Lincoln Laboratory

Slide-16 www.hpec-si.org



Slide-17 www.hpec-si.org







- Introduction
- Demonstration
- Development
- Applied Research
- Future Challenges
- Summary



Parallel (||VSIPL++)



# **Emergence of Component Standards**









#### **BLAS** zherk Routine

- **BLAS = Basic Linear Algebra Subprograms**
- Hermitian matrix M: conjug(M) = M<sup>t</sup>
- zherk performs a rank-k update of Hermitian matrix C: •

 $C \leftarrow \alpha * A * conjug(A)^t + \beta * C$ 

**VSIPL** code

```
A = vsip cmcreate d(10,15,VSIP ROW, MEM NONE);
C = vsip cmcreate d(10,10,VSIP ROW,MEM NONE);
tmp = vsip cmcreate d(10,10,VSIP_ROW,MEM_NONE);
vsip cmprodh d(A,A,tmp); /* A*conjug(A)<sup>t</sup> */
vsip rscmmul d(alpha,tmp,tmp);/* α*A*conjug(A)<sup>t</sup> */
vsip rscmmul d(beta,C,C); /* \beta*C */
vsip cmadd d(tmp,C,C); /* \alpha*A*conjug(A)<sup>t</sup> + \beta*C */
vsip cblockdestroy(vsip cmdestroy d(tmp));
vsip cblockdestroy(vsip cmdestroy d(C));
vsip cblockdestroy(vsip cmdestroy d(A));
```

VSIPL++ code (also parallel)

```
Matrix<complex<double> > A(10,15);
Matrix<complex<double> > C(10,10);
```

```
C = alpha * prodh(A,A) + beta * C;
```

- Sonar Example
   K-W Beamformer
- **Converted C VSIPL to** VSIPL++
- 2.5x less SLOCs

MITRE Slide-20 www.hpec-si.org







- Pland coded loop achieves good performance, but is problem specific and low level
- Optimized VSIPL performs well for simple expressions, worse for more complex expressions
- PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level



#### Software Technology

| AltiVec loop                                                                                                                                     | VSIPL (vendor optimized)                                                                                                                                                                     | PETE with AltiVec                                                                                                                                          |
|--------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>C</li> <li>For loop</li> <li>Direct use of AltiVec extensions</li> <li>Assumes unit stride</li> <li>Assumes vector alignment</li> </ul> | <ul> <li>C</li> <li>AltiVec aware VSIPro Core Lite<br/>(www.mpi-softtech.com)</li> <li>No multiply-add</li> <li>Cannot assume unit stride</li> <li>Cannot assume vector alignment</li> </ul> | <ul> <li>C++</li> <li>PETE operators</li> <li>Indirect use of AltiVec extensions</li> <li>Assumes unit stride</li> <li>Assumes vector alignment</li> </ul> |



# **Parallel Pipeline Mapping**







# Scalable Approach





#### Lincoln Parallel Vector Library (PVL)

- Single processor and multi-processor code are the same Maps can be changed without changing software High level code is compact







- Introduction
- Demonstration
- Development
- Applied Research
- Future Challenges
- Summary

- Fault Tolerance
- Parallel Specification
- Hybrid Architectures (see SBR)



# **Dynamic Mapping for Fault Tolerance**





Slide-25 www.hpec-si.org

#### MIT Lincoln Laboratorv

**AFRL** 



# **Parallel Specification**





- Matlab is the main specification language for signal processing
- pMatlab allows parallel specifications using same mapping constructs being developed for ||VSIPL++

MITRE '





- Introduction
- Demonstration
- Development
- Applied Research
- Future Challenges

Summary



# **Optimal Mapping of Complex Algorithms**



Application





## **HPEC-SI Future Challenges**







# Summary



- HPEC-SI Program on track toward changing software practice in DoD HPEC Signal and Image Processing
  - Outside funding obtained for DoD program specific activities (on top of core HPEC-SI effort)
  - 1st Demo completed; 2nd selected
  - Worlds first parallel, object oriented standard
  - Applied research into task/pipeline parallelism; fault tolerance; parallel specification
- Keys to success
  - Program Office Support: 5 Year Time horizon better match to DoD program development
  - Quantitative goals for portability, productivity and performance
  - Engineering community support



www.hpec-si.org

# Web Links



High Performance Embedded Computing Workshop http://www.II.mit.edu/HPEC **High Performance Embedded Computing Software Initiative** http://www.hpec-si.org/ Vector, Signal, and Image Processing Library http://www.vsipl.org/ **MPI Software Technologies, Inc.** http://www.mpi-softtech.com/ **Data Reorganization Initiative** http://www.data-re.org/ CodeSourcery, LLC http://www.codesourcery.com/ MatlabMPI http://www.ll.mit.edu/MatlabMPI MIT Lincoln Laboratory MITRE AFRI Slide-31