



# Overview of applications performance on Marconi

Piero Lanucara p.lanucara @cineca.it SCAI User Support team









### We would like to:

- Try to summarize the technological trend via benchmarks...
- ...and use them to understand application performance issues, limitations and best practices on actual (Broadwell) and future architectures (KNL)

#### **CAVEAT**

- $\checkmark$  All measurements was taken using HW at CINECA
- ✓ Sometimes there is an "unfair" comparison e.g.:
  - Sandy Bridge HW used was very "powerful", HPC oriented
  - Ivy Bridge HW used was devoted to "data crunching", not HRENECA oriented





#### Intel CPU roadmap: two step evolution

- Tock phase:
  - ✓ New architecture
  - ✓ New instructions (ISA)
- Tick phase:
  - ✓ Keep previous architecture
  - ✓ New technological step (e.g. Broadwell  $\rightarrow$  14nm)
  - ✓ Core "optimization"
  - ✓ Usually increasing core number, keeping Thermal Dissipation (TDP) constant







- Westmere (tick, a.k.a. plx.cineca.it)
  - Intel(R) Xeon(R) CPU E5645 @2.40GHz, 6 Core per CPU
  - Only serial performance figure
- Sandy Bridge (tock, a.k.a. eurora.cineca.it)
  - Intel(R) Xeon(R) CPU E5-2687W 0 @3.10GHz, 8 core per CPU
  - Serial/Node performance figure
- Ivy Bridge (tick, a.k.a pico.cineca.it)
  - Intel(R) Xeon(R) CPU E5-2670 v2 @2.50GHz, 10 core per CPU
  - Serial/Node/Cluster performance
  - Infiniband FDR
- Hashwell (tock, a.k.a. galileo.cineca.it)
  - Intel(R) Xeon(R) CPU E5-2630 v3 @2.40GHz, 8 core per CPU
  - Serial/Node/Cluster performance
  - Infiniband QDR
- Broadwell (tick)
  - Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 22 core per CPU
  - Serial/Node performance figure

#### Marconi: Intel E5-2697 v4 Broadwell, 18 cores @ 2.3GHz.







#### Benchmarks



#### Linpack Benchmark from Intel MKL

GFLOPS







# Performances

- Empirically tested on different HW at CINECA
  - LINPACK
    - Intel optimized benchmark, rel. 11.3
    - Stress Floating point performance, no Bandwidth limitation
  - STREAM
    - Rel. 3.6, OMP version
    - Bandwidth, no Floating point limitation
  - HPCG
    - Intel optimized benchmark, rel. 11.3
    - CFD oriented benchmark with Bandwidth Limitation







#### Best result obtained, single core

✓ 5.6x increase in 6 years (Q1-2010, Q1-2016)
 3.1 GHz 2.5 GHz









- Best result obtained (using intel/gnu), single core
- 2.6x speed-up in 6 years .....⊗









# Roofline Model: Arithmetic Intensity



- Which is the typical application arithmetic intensity?
- About 0.1, may be less.... ⊗
- It depends on application domain, solver, method,...







### Roofline Mode: serial figure

 Using the figures obtained on different HW (LINPACK, STREAM)



GFLOP vs Computational Intensity (single core)







- Conjugate Gradient Benchmark (http://hpcgbenchmark.org/)
- Intel benchmark: Westmere not supported
- 2x speed-up only for Broadwell







# HPCG parallel figure

Best performance with #tasks and #threads









Best result obtained: Marconi (1 MPI, 36 threads)



GFLOPs





# LINPACK parallel figure/2

- Best result obtained
- Efficiency = Parallel\_Flops/(#core\*Serial\_Flops)
  - $1 \rightarrow$  Linear speed-up



#### Efficiency





### Marconi – A1 HPL

#### Full system Linpack:

- 1 MPI task per node
- perf range: 1.6 1.7PFs.
- Max Perf: 1.72389PFs with Turbo-OFF.
- Turbo-ON -> throttling



#### June 2016:Number 46

| T/V                     |                    |                  | Р                  | Q                    | Time                                         |          | Gflops |
|-------------------------|--------------------|------------------|--------------------|----------------------|----------------------------------------------|----------|--------|
| WC06C2C4<br>HPL_pdgesv( | 4320000            | 192              | 30                 | 50                   | 31178.23                                     | 1.723    | 89e+06 |
| HPL_pdgesv(             | ) end time         | Tue              | May 31             | 01:22:46             | 2016                                         |          |        |
| Ax-b  _oo               | /(eps*(  A         | _00*             | x  _o              | o+  b  _o            | o)*N)= 0                                     | .0007856 | PASSED |
|                         |                    |                  |                    |                      |                                              |          |        |
| Finished                | 1 tests            | comple           | eted an            |                      | sults:<br>residual checks<br>residual checks |          |        |
| Finished                | 1 tests<br>0 tests | comple<br>comple | eted an<br>eted an | d passed<br>d failed | residual checks                              |          |        |







77666

### STREAM parallel figure



#### MB/s



# STREAM parallel figure: Marcon







# STREAM parallel figure/2

- Best result obtained (intel/gnu compiler)
- Efficiency = Parallel\_BW/(#core\*Serial\_BW)
  - $1 \rightarrow$  Linear Speed-up



#### Efficiency



#### SCAISTREAM parallel figure/2: SuperComputing Applications and Innovation Marconi Best result obtained (intel/gnu compiler) Best result obtained (intel/gnu compiler)

- Efficiency = Parallel\_BW/(#core\*Serial\_BW)
  - $1 \rightarrow$  Linear Speed-up







# Roofline: parallel graph

 Using the figures obtained on different HW (LINPACK, STREAM)







# Intel Matrix Benchmarks@Marconi

Preliminary investigation: try to check network performances (OPA)
 Different Benchmarks (PingPong, send-recv, collectives...) and message sizes

| PingPong   | MB/s (maximum size) |
|------------|---------------------|
| Same node  | 11305               |
| Close node | 10904               |
| Far node   | 11246               |

- 1 or 2 nodes
- Same node: processes on the same node
- Close node: processes on different nodes but onto the same edge switch
- Far node: processes on different nodes and different edge switches (must use the Director OPA switch)





### Intel Matrix Benchmarks@Marconi

Preliminary investigation: try to check network performances (OPA)
 Different Benchmarks (PingPong, send-recv, collectives...) and message sizes

| AlltoAll   | T_average (maximum size, microsec.) |
|------------|-------------------------------------|
| Same node  | 962                                 |
| Close node | 803                                 |
| Far node   | 804                                 |

- 1 or 2 nodes
- Same node: processes on the same node
- Close node: processes on different nodes but onto the same edge switch
- Far node: processes on different nodes and different edge switches (must use the Director OPA switch)





### **Computational Fluid Dynamics**







# Roofline Mode: LBM

TLBM: hand-made code (3D Multiblock-MPI/OpenMP version)
Three step serial optimization (an example)
1.Move+Streaming: Computational intensity → 0.36

- Playing with compilers flag (-01,-02,-03,-fast)
- 2.Fused: Computational intensity  $\rightarrow$  0.7
  - Playing with compilers flag (-01,-02,-03,-fast)
- 3.Fused+single precision: Computational intensity  $\rightarrow$  1.4
  - Playing with compilers flag (-01,-02,-03,-fast)
- Test case:
  - 3D driven cavity
  - 128^3







# Roofline Mode: LBM/2

- 1. Move+Streaming: Computational intensity  $\rightarrow$  0.36 (2.2x)
- 2. Fused: Computational intensity  $\rightarrow$  0.7 (1.8x)
- 3. Fused+single precision: Computational intensity  $\rightarrow$  1.4 (2.8x)







### Cuncurrent jobs

- LBM code, 3D Driven cavity, Mean value
- From 1 to n equivalent concurrent jobs









# Intel Turbo mode

- i.e. Clock increase
- Starting from Hashwell the increase depends from the number of the core involved
- For CINECA Hashwell:

| $\checkmark$ | Core 1,2: | 3.2 GHz |
|--------------|-----------|---------|
| $\checkmark$ | Core 3:   | 3.0 GHz |
| $\checkmark$ | Core 4:   | 2.9 GHz |
| $\checkmark$ | Core 5:   | 2.8 GHz |
| $\checkmark$ | Core 6:   | 2.7 GHz |
| $\checkmark$ | Core 7:   | 2.6 GHz |
| $\checkmark$ | Core 8:   | 2.6 GHz |

Now It's hard to make a "honest" speedup!!!!!





# Turbo mode & Concurrent jobs

LBM code, 3D Driven cavity. Mean value, Broadwell











### **Molecular Dynamics**







#### cores/node \*Life much easier for MD programmers and Memory/node



Similar to Haswell cores present on Galileo. \*Expect only a small difference in single core performance wrt Galileo, **but a big difference** compared to Fermi.

#### Phase 1: Broadwell nodes

Using MD on Marconi – Phase I



users.



36

128 GB





### MD Broadwell benchmarks

#### Gromacs DPPC (1 core)

| Computer<br>system                     | ns/day                   | Speedup wrt<br>Fermi |
|----------------------------------------|--------------------------|----------------------|
| Haswell (5.0.4,<br>Galileo)            | 1.364                    | 13.64                |
| Fermi (5.0.4)                          | 0.100                    | 1.00                 |
| Broadwell<br>(5.1.2)<br>NAMD APOA1 (10 | 1.977<br>6 <i>tasks)</i> | 19.77                |

Based on a 1-node Broadwell partition (40 cores, hyperthreading on).

| Computer<br>System         | ns/day | Speedup wrt<br>Fermi |
|----------------------------|--------|----------------------|
| Haswell (2.10,<br>Galileo) | 1.425  | 7.27                 |
| Fermi (2.10)               | 0.196  | 1.00                 |
| Broadwell<br>(2.11)        | 1.516  | 7.73                 |







Using MD on Marconi-Phase II



#### Programmers must utilise vectorisation (SIMD) and OpenMP threads, and possibly the fast memory of KNL.

•For the user, MD experience will depend on how software developers are able to exploit the KNL architecture. Some example:

**NAMD.** Already reasonable results with KNC. According to NAMD mailing list much effort being devoted to KNL version.

**\*GROMACS**. Developers didn't really bother with KNC Xeon Phi's (no offload version and poor symmetric mode). But since KNL is standalone and Gromacs can use OpenMP threads (which are advisable on KNL) should run well on KNL. **Also GROMACS has good SIMD optimisation**.

Amber. Already support for KNC and with OpenMP probably should be ok for KNL.

Worth noting that up to now KNC MICs haven't been widely supported by software developers. But this should change for KNL.







#### **Material Science**







### Preliminary QE benchmarks



| QE benchmark | Galileo       | Marconi        |
|--------------|---------------|----------------|
| W64@64pe     | 13.50s WALL   | 10.76s WALL    |
| W256@1024    | 37.38s WALL   | 38.83s WALL*   |
| W256@1024    | 37.38s WALL   | 28.23s WALL**  |
| W256@1024    | 37.38s WALL   | 30.81s WALL    |
| W256@2048    |               | 22.79s WALL*** |
| W256@512     |               | 45.05s WALL    |
| W256@256     | 1m 7.78s WALL | 1m11.62s WALL  |

\* Without tuning parallelization parameters

\*\* 32 proc per node

\*\*\* 1024-MPI x 2-OpenMP









### **Global Seismology**





# Global seismology activity on Marconic Phase II

PGlobal seismology developers must utilise vectorisation (SIMD) and OpenMP threads, and possibly the fast memory of KNL.

For the user, global seismology experience will depend on how software developers are able to exploit the KNL architecture:

\*SPECFEM3D\_GLOBE. Already reasonable results with KNC ("native" and "offload" version in the framework of the IPCC@CINECA activity). Good amount of vectorisation (FORCE\_VECTORIZATION preprocessing enabling ) and SIMD optimization suitable for KNC and future KNL. High number of OpenMP threads scaling (up to more than 60 on

Worth noting that up to now KNC MICs haven't been widely supported by Global seismology software developers and users. A remarkable exception is SPECFEM3D\_GLOBE software CIG repo where the "native" version is maintained and tested. Again, this should be fine for KNL startup.







### Global seismology benchmarks



SPECFEM3D\_GLOBE Regional\_MiddleEast test

case: forward simulation

| Computer<br>system   | e.t. (sec.) | Speedup wrt<br>Haswell |
|----------------------|-------------|------------------------|
| Haswell<br>(Galileo) | 570.20      | 1.00                   |
| KNC<br>(Galileo)     | 430.35      | 1.32                   |

Based on a 4-node Galileo partition (16 MPI processes, 4 and 60 OpenMP threads on Haswell and KNC respectively).

SPECFEM3D\_GLOBE Regional\_MiddleEast test

#### case: no vectorisation

| Computer<br>System   | e.t. (sec.) | Slowdown<br>factor wrt<br>vectorised | The impact o<br>vectorisation<br>Haswell and |
|----------------------|-------------|--------------------------------------|----------------------------------------------|
| Haswell<br>(Galileo) | 687.14      | 1.20                                 | respectively)                                |
| KNC<br>(Galileo)     | 848.12      | 1.97 <b>&lt;- 2x Slo</b>             | wdown factor                                 |

oact of sation: on I and KNC ively).





### Conclusions



Marconi A1 Single core: moderate improvements over the years.... but a big improvements compared to Fermi.
Target is always LINPACK performances.
Bandwidth grows more slowly than expected.

High expectations of Marconi A2 KNL performances.
KNC paves the way for increasing performances...
....try to manage domain parallelism, increase threading, exploit data parallelism (vectorisation) and improve data locality (new chance: use on package memory)







### Credits

Giorgio Amati, Ivan Spisso (Benchmarks, CFD)
Carlo Cavazzoni (Benchmarks, Material Science)
Andrew Emerson (Molecular Dynamics)
Vittorio Ruggiero (Global Seismology)





### Some Links



- TICK-TOCK: <u>http://www.intel.com/content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html</u>
- WESTMERE: <u>http://ark.intel.com/it/products/family/28144/Intel-Xeon-Processor-5000-Sequence#@Server</u>
- SANDY BRIDGE: <u>http://ark.intel.com/it/products/family/59138/Intel-Xeon-Processor-E5-Family#@Server</u>
- IVY BRIDGE: <u>http://ark.intel.com/it/products/family/78582/Intel-Xeon-Processor-E5-v2-Family#@Server</u>
- HASHWELL: <u>http://ark.intel.com/it/products/family/78583/Intel-Xeon-</u> <u>Processor-E5-v3-Family#@Server</u>
- BROADWELL: <u>http://ark.intel.com/it/products/family/91287/Intel-Xeon-Processor-E5-v4-Family#@Server</u>
- LINPACK: <u>https://en.wikipedia.org/wiki/LINPACK</u>
- STREAM: <u>https://www.cs.virginia.edu/stream/ref.html</u>
- HPCG: <u>http://hpcg-benchmark.org/</u>
- ROOFLINE: <u>http://crd.lbl.gov/departments/computer-</u> <u>science/PAR/research/roofline/</u>
- TURBO MODE: <u>http://cdn.wccftech.com/wp-content/uploads/2016/03/Intel-Broadwell-EP-Xeon-E5-2600-V4\_Non\_AVX.png</u>

