

#### The Impact of Multicore on Mathematical Software

#### Jack Dongarra INNOVATIVE COMPORING LABORATORY

University of Tennessee Oak Ridge National Laboratory



- Top500 Results
- Four Important Concepts that Will Effect Math Software
  - Effective Use of Many-Core
  - Exploiting Mixed Precision in Our Numerical Computations
  - Self Adapting / Auto Tuning of Software
  - Fault Tolerant Algorithms





#### H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

- Updated twice a year SC'xy in the States in November Meeting in Germany in June
  - All data available from www.top500.org



















GigE + Infiniband + Myrinet = 76%

# Increasing CPU Performance:

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor





The New york Times

#### Technology

| WORLD | U.S.  | N.Y. / REGIO | BUSINESS   | TECHNOL   | OGY | SCIENCE  | HEALTH    | SPORTS  | OPINION    |
|-------|-------|--------------|------------|-----------|-----|----------|-----------|---------|------------|
| CAMCO | RDERS | CAMERAS      | CELLPHONES | COMPUTERS | HAN | DHELDS H | OME VIDEO | MUSIC P | ERIPHERALS |

- Intel's 80
  Core chip
  - Tflop/s
  - 62 Watts
  - 1.2 TB/s internal BW

To Future Stacked Memory

Router

South neighbor

West neighbo North neighbor

Compute

East

eighbor

Element

One tile

#### Intel Prototype May Herald a New Age of Processing

By JOHN MARKOFF Published: February 12, 2007

SAN FRANCISCO, Feb. 11 — Intel will demonstrate on Monday an experimental computer chip with 80 separate processing engines, or cores, that company executives say provides a model for commercial chips that will be used widely in standard desktop, laptop and server computers within five years.





The Teraflop Chip has 80 separate processing engines and takes advantage of manufacturing technology that Intel introduced last month.

The new processor, which the company first described as a Teraflop Chip at a conference last year, will be detailed in a technical paper to be presented on the opening day of the International Solid States Circuits Conference, beginning here on Monday.

While the chip is not compatible with Intel's current chips, the company said it had already begun design work on a commercial version that would essentially have dozens or even hundreds of Intel-compatible microprocessors laid out in a tiled pattern on a single chip.



All Large Core



## Major Changes to Software

- Must rethink the design of our software
  - Another disruptive technology
    - Similar to what happened with cluster computing and message passing
  - Rethink and rewrite the applications, algorithms, and software
- Numerical libraries for example will change
  - For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

### A New Generation of Software:

| Algorithm                                      | s follow hardware evolution | on in time                              |
|------------------------------------------------|-----------------------------|-----------------------------------------|
| LINPACK (80's)<br>(Vector operations)          |                             | Rely on<br>- Level-1 BLAS<br>operations |
| LAPACK (90's)<br>(Blocking, cache<br>friendly) |                             | Rely on<br>- Level-3 BLAS<br>operations |

#### A New Generation of Software:

Parallel Linear Algebra Software for Multicore Architectures (PLASMA)



Those new algorithms

- have a very low granularity, they scale very well (multicore, petascale computing, ... )
- removes a lots of dependencies among the tasks, (multicore, distributed computing)
- avoid latency (distributed computing, out-of-core)
- rely on fast kernels

Those new algorithms need new kernels and rely on efficient scheduling algorithms.









#### Fork-Join vs. Dynamic Execution



Fork-Join – parallel BLAS





Experiments on Intel's Quad Core Clovertown with 2 Sockets w/ 8 Treads

#### Fork-Join vs. Dynamic Execution





Fork-Join – parallel BLAS



DAG-based – dynamic scheduling





Experiments on Intel's Quad Core Clovertown with 2 Sockets w/ 8 Treads



#### With the Hype on Cell & PS3 We Became Interested



- The PlayStation 3's CPU based on a "<u>Cell</u>" processor
- Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit, SPE: SPU + DMA engine)
  - An SPE is a self contained vector processor which acts independently from the others.
    - 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ
  - 204.8 Gflop/s peak!
  - The catch is that this is for 32 bit floating point; (Single Precision SP)
  - And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
    - Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues





#### Performance of Single Precision on Conventional Processors

- Realized have the similar situation on our commodity processors.
  - That is, SP is 2X as fast as DP on many systems
- The Intel Pentium and AMD Opteron have SSE2
  - 2 flops/cycle DP
  - 4 flops/cycle SP
- IBM PowerPC has AltiVec
  - 8 flops/cycle SP
  - 4 flops/cycle DP
    - No DP on AltiVec

|                          | Size | SGEMM/<br>DGEMM | Size | SGEMV/<br>DGEMV |
|--------------------------|------|-----------------|------|-----------------|
| AMD Opteron<br>246       | 3000 | 2.00            | 5000 | 1.70            |
| UltraSparc-lle           | 3000 | 1.64            | 5000 | 1.66            |
| Intel PIII<br>Coppermine | 3000 | 2.03            | 5000 | 2.09            |
| PowerPC 970              | 3000 | 2.04            | 5000 | 1.44            |
| Intel<br>Woodcrest       | 3000 | 1.81            | 5000 | 2.18            |
| Intel XEON               | 3000 | 2.04            | 5000 | 1.82            |
| Intel Centrino<br>Duo    | 3000 | 2.71            | 5000 | 2.21            |

Single precision is faster because:

- Higher parallelism in SSE/vector units
- Reduced data motion
- Higher locality in cache



#### 32 or 64 bit Floating Point Precision?

- A long time ago 32 bit floating point was used
  - Still used in scientific apps but limited
- Most apps use 64 bit floating point
  - Accumulation of round off error
    - A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (10<sup>18</sup>) ops.
  - Ill conditioned problems
  - IEEE SP exponent bits too few (8 bits, 10<sup>±38</sup>)
  - Critical sections need higher precision
    - Sometimes need extended precision (128 bit fl pt)
  - However some can get by with 32 bit fl pt in some parts
- Mixed precision a possibility
  - Approximate in lower precision and then refine or improve solution to high precision.



#### Idea Goes Something Like This...

- Exploit 32 bit floating point as much as possible.
  - Especially for the bulk of the computation
- Correct or update the solution with selective use of 64 bit floating point to provide a refined results
- Intuitively:
  - Compute a 32 bit result,
  - Calculate a correction to 32 bit result using selected higher precision and,
  - Perform the update of the 32 bit results with the correction using high precision.

#### Mixed-Precision Iterative Refinement

Iterative refinement for dense systems, Ax = b, can work this way.

| L U = Iu(A)               | SINGLE | <b>O(n</b> <sup>3</sup> ) |
|---------------------------|--------|---------------------------|
| x = L\(U\b)               | SINGLE | 0(n²)                     |
| r = b - Ax                | DOUBLE | 0(n²)                     |
| WHILE    r    not small e | nough  |                           |
| z = L\(U\r)               | SINGLE | 0(n²)                     |
| x = x + z                 | DOUBLE | <b>O(n</b> <sup>1</sup> ) |
| r = b - Ax                | DOUBLE | <b>O(n</b> <sup>2</sup> ) |
| END                       |        |                           |

#### Mixed-Precision Iterative Refinement

Iterative refinement for dense systems, Ax = b, can work this way.

| L U = lu(A)               | SINGLE | <b>O(</b> n <sup>3</sup> ) |
|---------------------------|--------|----------------------------|
| x = L\(U\b)               | SINGLE | <b>O(n²)</b>               |
| r = b - Ax                | DOUBLE | <b>O(n</b> ²)              |
| WHILE    r    not small e | nough  |                            |
| z = L\(U\r)               | SINGLE | <b>O(n<sup>2</sup>)</b>    |
| x = x + z                 | DOUBLE | <b>O(</b> n <sup>1</sup> ) |
| r = b - Ax                | DOUBLE | <b>O(n</b> ²)              |
| END                       |        |                            |

- Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
- It can be shown that using this approach we can compute the solution to 64-bit floating point precision.
  - Requires extra storage, total is 1.5 times normal;
  - O(n<sup>3</sup>) work is done in lower precision
  - O(n<sup>2</sup>) work is done in high precision
  - Problems if the matrix is ill-conditioned in sp; O(10<sup>8</sup>)

## Results for Multiple Precision Iterative Refinement



|    | Architecture (BLAS)                 |
|----|-------------------------------------|
| 1  | Intel Pentium III Coppermine (Goto) |
| 2  | Intel Pentium III Katmai (Goto)     |
| 3  | Sun UltraSPARC IIe (Sunperf)        |
| 4  | Intel Pentium IV Prescott (Goto)    |
| 5  | Intel Pentium IV-M Northwood (Goto) |
| 6  | AMD Opteron (Goto)                  |
| 7  | Cray X1 (libsci)                    |
| 8  | IBM Power PC G5 (2.7 GHz) (VecLib)  |
| 9  | Compaq Alpha EV6 (CXML)             |
| 10 | IBM SP Power3 (ESSL)                |
| 11 | SGI Octane (ATLAS)                  |

New routines in LAPACK that do this for LU and  $\mathsf{LL}^{\mathsf{T}}$ 



- Power PC at 3.2 GHz
  - DGEMM at 5 Gflop/s
  - Altivec peak at 25.6 Gflop/s
    - Achieved 10 Gflop/s SGEMM
- 8 SPUs
  - 204.8 Gflop/s peak!
  - The catch is that this is for 32 bit floating point; (Single Precision SP)
  - And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
    - Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

















## Cholesky on the Cell, Ax=b, $A=A^T$ , $x^TAx > 0$



# Cholesky - Using 2 Cell Chips





30

# Intriguing Potential

- Exploit lower precision as much as possible
  - Payoff in performance
    - Faster floating point
    - Less data to move
- Automatically switch between SP and DP to match the desired accuracy
  - Compute solution in SP and then a correction to the solution in DP
- Potential for GPU, FPGA, special purpose processors
  - What about 16 bit floating point?
    - Use as little you can get away with and improve the accuracy
- Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton's method is used.  $f(x_i) = x_i - \frac{f(x_i)}{f'(x_i)}$

$$x_{i+1} - x_i = -\frac{f(x_i)}{f'(x_i)}$$

Correction = -A(b - Ax)

Conclusions

- For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware.
- This strategy needs to be rebalanced barriers to progress are increasingly on the software side.
- Moreover, the return on investment is more favorable to software.
  - Hardware has a half-life measured in years, while software has a half-life measured in decades.
- High Performance Ecosystem out of balance
  - Hardware, OS, Compilers, Software, Algorithms, Applications
    - No Moore's Law for software, algorithms and applications



Alfredo Buttari, UTK Julien Langou, UColorado Julie Langou, UTK Piotr Luszczek, MathWorks Jakub Kurzak, UTK Stan Tomov, UTK







Advertising Programs - Business Solutions - About Google

©2007 Google