# DarkHorse a Proposed PetaScale Architecture (+)

Steve Poole Los Alamos National Laboratory Oak Ridge National Laboratory HPEC06 September 19-22, 2006 LA-UR-06





UNCLASSIFIED LA-UR-06

HP C-DO

## Advanced Architecture Team (LANL)

- LANL
  - Dave DuBois
  - Andy DuBois
  - Steve Poole
  - Chris Kemper



## Some History of DH & 3D

- First basic ideas in 1997/1998
- HMM/GA Application (Kestrel, Sequence Alignment Modeling)
- Switch Application (SanNetworks, memory technology)
- 3D FPGA
- Potential Seismic Application (FD,RTM, A/E Modeling,XON)
- Specialized Search/Sort Problem (DB Problem)
- Started @ LANL 2001
  - 3D FPGA
  - 3D CAM
- Early processor disclosures in 2002





#### Advanced Architectures Project (finished)

Dark Horse

Processor & Memory Subsystems

#### Computer industry collaborations

 Understand and influence product roadmaps

#### Semiconductor industry collaborations

3D semiconductor stacking

#### Co-processor technologies

- FPGA accelerators
- Graphics/Network processor accelerators



Determine the feasibility of developing a PF system in the ~FY08 time frame that is:

- based potentially on a variety of microprocessors,
- computationally efficient for LANL algorithms, and
- straightforward to program.
- Balanced
- First Principle

Applications & Algorithms

#### Minimizing time to solution for LANL computational workloads

- Adapt algorithms to different architectures
- Develop new algorithms that take maximum advantage of computer architectures
- Programming model(s)



HP C-DO

## P(FL)OPS Project Phases, 1-5 Years

- Initial Studies, finished (ARM, BCom, SPARC-8)
- HW Phases: FPGA/CAM Proof of Concept SOC design in FPGA Optical Switch Development Integrated Optics on Processor Stack Processor based SOC, ~1 TFLOPS, 128/256 GByte Initial Prototype Not just PIM, true SOC
- SW Phases: System Software, Middleware, SGPFS, Data Migration, Communication, Scientific Libraries, VIZ, Applications, Hybrid Programming Model, OS
- Lvl of Effort: 5 years Development for: (Not Free, now 3 years)
   3-D PE = 1-10 TF + 128/256/1024 GByte + Communications Optical/Copper Switch Fabric Software (Lots of work) Hybrid + Languages + OS





#### PFLOPS Advanced Architecture Design, Hardware and Software

| Hardware                                                                                              | Elements                                                                                                                                                      | Technolog<br>Y                                | Software                                                                                                                                                                                                                                                                                                                                                                                |
|-------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Computing:<br>1K PEs,<br>1 PF=10 <sup>15</sup> FLOPS,<br>100 TB=10 <sup>14</sup> Byte                 | PE = System On Chip, "X"<br>GHz,<br>~200 Functional Units,<br>MultiCore, 128/256GB,<br>Self healing,<br>4-8TB/s Vertical Internal BW,<br>1TF<br>(128-256TB M) | 3-D<br>Stacking<br>with N<br>Layers/Spli<br>t | <ul> <li><u>OS:</u></li> <li>Linux (64 bit),</li> <li>K42 (multi-processor 64 bit all the way),</li> <li>Plan9 (64 bit, secure)</li> <li><u>Programming Models:</u></li> <li>asymmetric MP model,</li> <li>function off-load models: <ol> <li>multi-stage and 2) parallel-stage,</li> <li>computational acceleration models:</li> <li>in core and 2) out of core</li> </ol> </li> </ul> |
| Communication:<br>Compute Ratio<br>B/F=0.2,<br>Disk I/O Ratio B/F=0.1                                 | PE-PE BW = 20 x 10GB/s<br>PE-Blade BW = 10 x 10GB/s<br>Blade/PE = 100                                                                                         | Optical/<br>Copper<br>Switch<br>Fabric        | Systeenatinterghaveoto⊨parcedlevization                                                                                                                                                                                                                                                                                                                                                 |
| Data Storage:<br>Blade=10 <sup>5</sup> , Disk=10 <sup>17</sup> B,<br>Tape=10 <sup>18</sup> B<br>100PB | Blade = 1 TB capacity, 1GB<br>cache, 100 MB/s sustained<br>BW                                                                                                 | NASD<br>OBSDs                                 | Scalable Global Parallel File System,<br>OBSD Disk to Tape Migration SW                                                                                                                                                                                                                                                                                                                 |
| Visualization:                                                                                        | GPU = General Purpose Unit                                                                                                                                    | 3-D on Chip                                   | OpenGL (or successor) based Surface<br>and Volumetric Rendering Software                                                                                                                                                                                                                                                                                                                |
| • LOS ATATMOS<br>NATIONAL LABORATORY                                                                  | UNCLASSIFIED LA-UR-06                                                                                                                                         |                                               |                                                                                                                                                                                                                                                                                                                                                                                         |

#### Supercomputer Interconnect Example







UNCLASSIFIED LA-UR-06

## **Building 3-D Chips**





Rational Nuclear Security Administration

UNCLASSIFIED LA-UR-06

## Stacking Multiple Thin Layers







Repeat - One Wafer at a Time



HP C-DO

#### Stacking Process Three wafers successfully aligned and stacked





"Super Via" 4um in diameter and 12um in height



## "FaStack" Cross-Sectional Diagram



## Stacked Chip Interconnect/Thermal Paths





UNCLASSIFIED LA-UR-06



### Interlayer Interconnnect







UNCLASSIFIED LA-UR-06

#### Interlayer Interconnect

- Minimum interlayer delay
- Flexible width and routing
- One-to-many or many-to-one
- Complex topologies feasible
- Designed and delivered to LANL 2003





# Today ~\$50K/port → ~\$1.5k/port for commercial



- Modular chassis
  - Octal switch port blade
- Integration achieves
  - 200:1 Parts count reduction
  - 10X Power reduction
- Provisioning in 16 port increments



UNCLASSIFIED LA

#### **10 Terabit/sec form factor**



- •1.28 Gigapackets/sec in 64 port switch module
- •Cell-oriented error correction supports 10<sup>-21</sup> BER
- •Goal: 10 Tbit/sec in a single stage module @ first commercial release

## InfiniBand Roadmap



## CPU/Memory Stack

- R8051 CPU
  - XXX MHz operation; 140MHz Lab test (VDD High)
  - 220MHz Memory interface
- IEEE 754 Floating point coprocessor
- 32 bit Integer coprocessor
- 2 UARTs, Int. Cont., 3 Timers, ...
- Crypto functions
- 128KBytes/layer main memory
- Codes Running...
- Completely synthesized, placed and routed in 3D with standard Cadence tools. Runs slightly better than predicted by models and tools. We are working with this currently.



## 3D FPGA's

- Why 3D?
- Why FPGA's?
- Technology Elements
- Project Structure
- Project Objectives
- Cost and Schedule





# Why 3D?

- 2D chip performance is limited by chip size
  - Large chip size means defects, wire delay
- 2D function is limited by chip size
  - Even with better lithography, there's never enough room for memory
  - I/O is a perpetual problem
- 2D systems require high package count
  - Board real estate is precious for ALL applications





## Why FPGA's?

- ASIC's and Processors are too expensive to prototype
- Present FPGA's are too slow
- Better potential for rad hardening
- Properly designed 3D FPGA can implement complete SoC.





# Technology Elements

- 3D <u>wafer-scale</u> bonding technology is perfected and <u>DEMONSTRATED</u>
- FPGA design methodology is well known, readily adaptable to 3D
- FPGA software is available
  - This the most difficult problem



### Interlayer Interconnnect







UNCLASSIFIED LA-UR-06

#### **Conceptual Device Structure**



### **Interconnect Limitations**

- The interconnect is Achilles' heel of present FPGA's
- Major source of performance restriction, both on-chip and off-chip
- Resource limited, delay inducing
  - Wire length
  - Capacitance
  - Limited I/O resources





## Interconnect Strategy

- Examine Supercomputer Interconnect Architectures
- Review academic approaches to FPGA interconnects
- Examine new IC interconnect concepts
- Evaluate interconnect performance in FPGA use
- Adopt elements that work, as demonstrated in simulation





## **Project Structure**

- Phase I: Architecture/Proof of Principle
  - Simulated interconnect strategy
  - Test cells for each layer
- Phase II: Small Working Prototype
  - Constructed at final line width
  - Assembled with S-O-A chip stacking
- Phase III: Scaled-up Limited Volume Production
  - Assembled with full wafer bonding





# Project Objectives: I

- USABLE 1.25-2.0 GHz FPGA
- Targeted to processor architectures
- Provision for future SRAM/DRAM layers
- Minimum 4X I/O Resources of conventional parts





# Project Objectives: II

- Produce a USABLE prototype 3D FPGA
- Result in 24-36 months
- Phased with defined milestones
- Fully US project; no foreign design participation



## Conclusions

- DarkHorse pushed many design envelopes
  - It is the I/O, NOT FLOPS (ITIOS)
  - 3D Memories
    - Self Healing
    - 4-8TB/S Memory BW
  - 3D Stacking (S/MOC)
  - 3D FPGA/CAM/OC-768 Device Designs
  - Optical Interconnects
    - Networking (Optical & Copper)
    - Total Optics off-chip
    - Optics on board (Chip to Chip)
    - OSMOSIS (see refs)



## Conclusions (cont)

- Interconnects
  - 12X-QDR Infiniband
  - 32X-ODR Infiniband (Future)
- 3D Memories will improve Power/Performance
  - Non-DRAM (Hybrid)
  - Some new memory technologies (around the corner)
  - S/G
- Currently modeling codes against DH design
  - Some new algorithms (Sparse)
  - Libraries
  - Potentially new language approaches
    - PGAS/DGAS
  - Future HW/SW designs





## Conclusions (cont)

- The design is feasible
  - Most of the sub-components exist or have been proven
  - Cooling technologies exist for >1KW socket
    - Liquid Metal
    - Microchannel Cooling
    - Liquid Immersion
- Work has started the design of the follow-on
  - Pegasus (~800PF 1EF)
- Design expandable
  - ~10TF/Socket
  - ~10TF/4sq in.
- DOE/ASC does not like to fund disruptive technologies

