#### HPC Cineca Infrastructure:

#### State of the art, towards the exascale and OpenFOAM perspective





#### HPC Methods for Eng Applications 20 June 2017, Milan Italy

Ivan Spisso, Giorgio Amati [i.spisso@cineca.it](mailto:i.spisso@cineca.it), [g.amati@cineca.it](mailto:g.amati@cineca.it)





#### **Contents**

- CINECA in a nutshell and SCAI mission
- HPC ecosystem (up-to-date)
	- Galileo
	- Pico
	- Marconi
	- D.A.V.I.D.E.
- HPC future trends: towards the exascale
- OpenFOAM perspective
	- Figure about performances on CINECA's HPC ecosystem
	- Parallel aspect and actual bottlenecks
	- Suggested future work: CFD4exascale

# Who Am I?

#### Academic Achievements

- 01/12/2013 PhD in Computational Aeroacoustics, University of Leicester, UK.
- 26/09/2005 Master of second level (MSc) in `Satellites and Orbiting Platforms', Universita of Roma, La Sapienza

SuperComputing Applications and Innovation

• 17/11/2003 Degree in Aeronautical Engineering (MEng) with full marks, University of Palermo, Italy.

#### Professional activities

- From 07/2010 to date Staff member of SCAI (**SuperComputing Applications and Innovation**) HPC Department at CINECA, as consultant for academic and industrial CFD applications. CINECA, Casalecchio di Reno, Bologna, Italy.
- From 20/09/2009 to 20/12/2009 HPC-Europa2 Transnational Access fellowship at CINECA Casalecchio di Reno, Bologna, Italy.
- From 9/04/2009 to 02/05/2009 Visiting fellow at the IMFT (Institut de Mecanique de Fluides de Toulouse), Toulouse, France.
- From 01/2010 to 04/2010 Teaching assistant for the course of Fluid Dynamics, Introduction to Computing and Vector Calculus and Applications. University of Leicester.
- From 01/08/2006 to 31/07/2009 Marie Curie EST Fellow, Marie Curie multihost EST network Aero-TraNet at the University of Leicester. Project title: Development of aprefactored high-order compact scheme for low-speed aeroacoustics.
- From 01/07/2005 to 31/07/2006 Qualied tutor for the CEPU centre of San Giovanni, Rome, for tuition
- in Engineering and Applied Sciences.
- From 11/04/2005 to 24/09/2005 Stage at the R&D Department of Aerosekur s.p.a., Latina, Italy. Computational Fluid Dynamics analysis of the SPEM reentry system.

SuperComputing Applications and Innovation

#### Cineca in a nutshell

Cineca is a no-profit consortium composed by 70 italian universities, research institutions and the ministry of research.

- Cineca provides IT services and it is the largest italian supercomputing facility
- Cineca headquarters are in Bologna (selected for the new ECMWF datacenter) and it has offices in Rome and Milan.





#### SCAI department at Cineca



SuperComputing Applications and Innovation

 $\mathsf{C}\mathsf{C}$ SuperComputing Applications and Innovation

**CINECA** 

### SCAI mission

To support Italian researchers to face global scientific challenges



**CINECA SCA** SuperComputing Applications and Innovation

SuperComputing Applications and Innovation

# The Cineca ecosystem

- Cineca acts as a hub for innovation and research contributing to many scientifical and R&D projects on italian and european basis.
- In particular, Cineca is a PRACE hosting member and a member of EUDAT.





# HPC INFRASTRUCTURE: GALILEO

- **IBM Cluster Linux**
- 516 compute nodes
- 2 eight-core Intel Xeon 2630 (16 cores) @2.40 GHz a.k.a. Haswell
- 128GB RAM per node
- Infiniband with 4x QDR switch (40 Gb/s)
- TPP: 1 PFlop/s

nouting Applications and Innovatio

• National and PRACE Tier-1 calls, FORTISSIMO, industrial customers



# HPC INFRASTRUCTURE: MARCONI

- Marconi is the new Tier-0 LENOVO system that replaced the FERMI BG/Q.
- Marconi is planned in two technological stages in a 5 years programme with the objective to reach a 50 Pflop/s system by the year 2019-2020.
- Marconi is a Lenovo NextScale system equipped with Intel Xeon, Intel Xeon Phi processors and Intel SkyLake with an Intel OmniPath network.
- The first stage of MARCONI is made of 3 different partitions (A1, A2 and A3) whose installation started in 2016.
- Marconi is part of the infrastructure provided by Cineca to the EUROFUSION project
- **[UserGuide](https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.1:+MARCONI+UserGuide)**





# MARCONI A1 : Intel Broadwell

- Started in april 2016 and opened to the production in july 2016
- 1512 compute nodes
- 2 sockets Intel(R) Xeon(R) CPU E5- 2697 v4 @2.30 GHz, 18 cores
- 128GB RAM per node
- S.O. Linux Centos 7.2
- PBSpro 13 batch scheduler
- TPP: 2 PFlop/s





# MARCONI A2: Intel KNL

- Opened to production at the end of 2016
- 3600 Knights Landing compute nodes
- Intel Xeon Phi 7250 (68 cores) @1.40 GHz a.k.a. KNL
- 120GB RAM per node
- Default configuration: Cache/Quadrant
- TPP: 11 PFlop/s





# MARCONI's outlook

- In 2017 MARCONI will evolve with the installation of the A3 partition and the final configuration will have:
- 3024 Intel Skylake nodes (approx. 120960 cores)
- 3600 Intel Knights Landing (approx. 244800 cores)
- Peak performance: about 20 PFlop/s
- Internal network: Intel OPA



In 2019 we expect the convergence of the HPDA infrastructure and the HPC infrastructure towards the target of 50 PFlop/s





### HPC INFRASTRUCTURE: D.A.V.I.D.E.

- Development of an Added-Value Infrastucture Designed in Europe
- [PCP](http://www.e4sc16.com/E4_is_awarded_PCP-I3P.pdf) (Pre-Commercial Procurement) by PRACE
- OpenPOWER-based HPC cluster
- Power8 processors with [NVLink](https://www.ibm.com/blogs/systems/ibm-power8-cpu-and-nvidia-pascal-gpu-speed-ahead-with-nvlink/) bus + Nvidia Tesla P100 SXM2
- Designed, integrated and tested by E4. Installation in CINECA's data center
- Available for research projects starting from Septmber







#### HPC future trends: towards the exascale

HPC & CPU Intel evolution: 2010-2016

Westmere (a.k.a. plx.cineca.it)

- Intel(R) Xeon(R) CPU E5645 @2.40GHz, 6 Core per CPU
- Sandy Bridge (a.k.a. eurora.cineca.it)
	- Intel(R) Xeon(R) CPU E5-2687W 0 @3.10GHz, 8 core per CPU
- Ivy Bridge (a.k.a pico.cineca.it)
	- Intel(R) Xeon(R) CPU E5-2670 v2 @2.50GHz, 10 core per CPU
	- Infiniband FDR
- Hashwell (a.k.a. galileo.cineca.it)
	- Intel(R) Xeon(R) CPU E5-2630 v3 @2.40GHz, 8 core per CPU
	- Infiniband QDR/True Scale (x 2)

#### Broadwell (a.k.a marconi.cineca.it)

- $-$  Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, 18 core per CPU (x2)
- OmniPath

uperComputing Applications and Innovatio

Increasing # of cores, Same clock



#### Roadmap to Exascale

(architectural trends)

exascale: computing system capable of al least one exaFLOPs calculation per second. exaFLOPs = 10^18 FLOPS or a billion of billion calculations per seconds





SuperComputing Applications and Innovation

#### Top 500 [\(June 2017\)](https://www.top500.org/list/2017/06/?page=1)



SuperComputing Applications and Innovation

CINECA SCAI SuperComputing Applications and Innovation

#### Moore's Law - Chips

**Moore's law** is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years (18 months, Intel executive David House)



**CINECA** 

SuperComputing Applications and Innovation

#### **Performance Development**



SuperComputing Applications and Innovation

#### Moore's Law - Dollars



#### Oh-oh! Houston! we have a problem….



### The silicon lattice





Si lattice

#### 50 atoms!

There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).



SuperComputing Applications and Innovation

#### Dennard scaling law (downscaling)

also known as **MOSFET scaling** states that as transistors get smaller their power density (P) [s](https://en.wikipedia.org/wiki/Power_density)tays constant, so that the power (D) use stays in proportion with area: both voltage (V) and current scale downward with length.





#### Exascale How serious the situation is?



- Exascale is not (only) about scalability and Flops performance!
- In an exascale machine there will be 10^9 FPUs, bring data in and out will be the main challenge.
- 10^4 nodes, but 10^5 FPUs inside the node!
- heterogeneity is here to stay
- deeper memory hierarchies

#### POWER is the limit!

- At 7nm Power will be the main limit for chip designers, not number of transistors
- -> I cannot power all transistors all together -> dark silicon, how to use it? -> Memory? I/O interface? Different cores? Core & GPU?

Very Big co-design Problem!



SuperComputing Applications and Innovation

#### Amdahl's law

Amdahl's law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours ( $p = 0.95$ ) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times  $(1/(1 - p) = 20)$ . For this reason parallel computing with many processors is useful only for very parallelizable programs. The maximum speedup tends to maximum speedup tends to



uperComputing Applications and Innovatio



Oh-oh! Houston! we have an another problem….

#### Energy trends





**Compute Power**<br>SuperComputing Applications and Innovation



# Change of Paradigm: Energy Efficiency

New chips designed for maximum performance in a small set of workloads



Simple functional units, poor single thread performance, but maximum throughput

- HPC centres are vast and greedy consumers of electricity, requiring MW of energy (for example, Cineca is the largest consumer of power in the Emilia-Romagna region)
- Energy efficiency is clearly an important topic and there is much interest in renewable energy sources, re-using waste heat for builing, use of hot water cooling (see old Eurora [cluster,](https://www.cineca.it/it/comunicatistampa/eurora-il-supercomputer-accelerato-da-gpu-nvidia-conquista-il-record-mondiale) top rank in the Green500 in June 2013)
- Many EU projects, in the quest for Exascale performances, are studying strategies for reducing energy





#### Architecture toward exascale

**CINECA** 



#### Towards the exascale: Summary and trends

#### Software (turtle)

- As usual software lags behind hardware but must learn to exploit accelerators and other innovative technologies such as FGPAs, PGAS
- Reluctance by some software devs to learn new languages such as CUDA, OpenCL is driving interest in compiler-directive languages such as OpenAcc and OpenMP (4.x)
- Continued investment in efficient filesystems, checkpointing, resilience, parallel I/O
- **co-design** is the way the reduce the distance between hardware and software for HPC



#### Hardware (hare)

- Reaching physical limits of transistor densities and increasing clock frequencies further is too expensive and difficult (energy consumption, heat dissipation)
- Parallelism only solution in HPC but the Blue Gene road is no longer being persued. Hybrid with accelerators such as GPUs or Xeon Phi become the norm
	- Accelerator technologies advancing to remove limits associated with, (Intel KNL or Nvidia NVLINK)
	- A range of novel architectures being explored (e.g. Mont Blanc, DEEP) and technologies in many areas



#### HPC status and future trends. Which impact for OpenFoam?

- $\checkmark$  About 6 year CPU evolution
	- $\checkmark$  Linpack (Floating point Benchmark)
	- $\checkmark$  Stream (Memory BW benchmark)
	- OpenFoam (3D lid driven cavity, 80^3)



**Linpack Stream OpenFoam**<br>SuperComputing Applications and Innovation

**CINECA** 

### HPC status and future trends: roofline model

**The roofline model** 

**CINECA** 

Performance bound (y-axis) ordered according to arithmetic intensity (x-axis) (i.e. GFLOPs/Byte)



#### HPC status and future trends: Arithmetic intensity

Arithmetic Intensity: is the ratio of total floating-point operations to total data movement (bytes): i.e. flops per byte Which is the OpenFoam arithmetic intensity?

 $-$  About 0.1, may be less....  $\otimes$ 

**CINECA** 

SuperComputing Applications and Innovation

"Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms". Onazi et al, ParCFD14



SuperComputing Applications and Innovation

#### HPC status and future trends. Which impact for OpenFoam?

Using the figures obtained on different HW (LINPACK, STREAM)



#### Figure about performances on CINECA's HPC ecosystem

- Our aim is to stress Marconi Machine (a Petascale KNL-based machine) in order to understand the bottleneck & theoretical limits for an efficient performance using future exa-scale machine CAVEAT
- The test performed is a 3D lid-driven cavity, performance can be really different for different testcase
- Test-case info

–KNL: OF rel. v1612+, compiled with intel, flag= -xMIC-avx512 –BDW (reference): : OF rel. v1612+, compiled with intel, –3D lid driven cavity

–Size=300^3 (27M point) –T=0.20, dt=0.005, no output, viscosity=0.01 –Size=400^3 (64M point) –T=0.10, dt=0.0025, no output, viscosity=0.01 –Size=500^3 (125 MPoint) –T=0.10, dt=0.00125, no output, viscosity=0.01

### 1) Intranode performance

- **Fig. 3** Testing with  $100^{\circ}3$  and  $200^{\circ}3$  we found that  $64$  task is the maximum intranode decomposition
- Some noisy measurements...



**CI** 





BDW Total time (KNL version = **–axMIC-AVX512**)





BDW: total (KNL version = **–axMIC-AVX512**)



SuperComputing Applications and Innovation

**CINECA** 

# • KNL: Total time (with faster all\_reduce)  $4)$  500 $^{\circ}$ 3



BDW: total (KNL version = **–axMIC-AVX512**)



SuperComputing Applications and Innovation

**CINE** 

#### Figure about performances on CINECA's HPC ecosystem

- Strong scaling (how the solution time varies with the number of processors for a fixed total problem size)
- 10000 40  $Size=300^3$  $Size = 300^{\circ}3$ Size=400^3  $\longrightarrow$  $Size = 400^{\circ}3$ 35 30 25 Total time Speed-up 1000 20 15 10 5 100 1000 100 200  $\mathbf{1}$ 10 100 0 300 400 500 600 # node # node
- Total time, 64 task per node

**CINECA** 

SuperComputing Applications and Innovation

SuperComputing Applications and Innovation

#### Fine Tuning

Total time in seconds

A.Original time

B.Different allreduce algorithm

C.Explicit taskset (bind to cpu)

D.Multilevel decomposition





SuperComputing Applications and Innovation

#### Fine Tuning

Total time in seconds

- A. Original time
- B. Different allreduce algorithm
- C. Explicit taskset (bind to cpu)
- D. Multilevel decomposition
- $-300^{\circ}3$



#### Parallel aspect

- OpenFOAM is first and foremost a C++ library used to solve in discretized form systems of Partial Differntial Equations (PDE).
- The "Engine" of OpenFOAM is the Numerical Method. To solve equations for a continuum, OpenFOAM uses a numerical approach with the following features: segregated, iterative solution, finite volume method, co-located variables, equation coupling.
- The method of parallel computing used by OpenFOAM is based on the standard Message Passing Interface (MPI) using the strategy of domain decomposition.



Figure: Finite Volume Discretization



#### Parallel aspect

- The geometry and the associated fields are broken into pieces and allocated to separate processors for solution.
- A convenient interface, Pstream, is used to plug any Message Passing Interface (MPI) library into OpenFOAM. It is a light wrapper around the selected MPI Interface



Figure: Zero Layer Domain Decomposition



#### Actual bottlenecks

An analysis has been done in the framework of PRACE 1IP to study the current bottlenecks in the scalability of OpenFOAM on Massively parallel clusters.

- Standard OpenFOAM scales reasonably well up to thousands of cores, upper limit order of 1,000 cores.
- An in-depth proling identied the calls to the MPI AllReduce function in the linear algebra as core libraries as the main communication bottleneck
- A sub-optimal performance on-core is due the sparse matrices storage format that does not employ any cache blocking.

*M. Culpo, Current Bottlenecks in the Scalability of OpenFOAM on Massively Parallel Clusters, PRACE White Paper, available on-line at [www.prace-ri.eu](http://www.prace-ri.eu/IMG/pdf/Current_Bottlenecks_in_the_Scalability_of_OpenFOAM_on_Massively_Parallel_Clusters-2.pdf)*



#### Some references

<http://www.prace-ri.eu/application-scalability/>

P. Dagna, J.Hertzer: Evaluation of Multi-threaded OpenFOAM Hybridization for Massively Parallel Architectures, PRACE White Paper, available on-line at <http://www.prace-ri.eu/IMG/pdf/wp98.pdf>

M. Moylesa, P. Nash, I. Girotto: Performance Analysis of Fluid-Structure Interactions using OpenFOAM PRACE White Paper, available on-line at http://www.prace-ri.eu

M. Moylesa, P. Nash, I. Girotto: Performance Analysis of Fluid-Structure Interactions using OpenFOAM PRACE White Paper, available on-line at http://www.prace-ri.eu/IMG/pdf/wp98.pdf

#### **T. Ponweiser, P. Stadelmeyer, and T. Karasek, Fluid-Structure Simulations with OpenFOAM for Aircraft Design PRACE white paper, http://www.prace-ri.eu/IMG/pdf/wp172.pdf.**

A. Duran, M. S. Celabi, S. Piskin and M. Tuncel: Scalability of OpenFOAM for Bio-medical FLow Simulations, PRACE White Paper, available on-line at <http://www.prace-ri.eu/IMG/pdf/WP162.pdf>

**Pham Van Phuc et al., Shimizu Corporation, Fujitsu Limited, Riken: Evaluation of MPI Optimization of C++ CFD Code on the K Computer, SIG Technical Reports Vol. 2015-HPC-151 No. 19 2015/10/01. (in Japanese)**



### Actual Bottlenecks

Missing for a full enabling on Tier-0 Architecture:

Improve the parallelism paradigm, to be able to scale from the actual order of 1,000 cores to at least one order of magnitude (order of 10,000 or 100,000 procs).

Scalability of the linear solvers

- The linear algebra core libraries are the main communication bottlenecks for the scalability
- Whole bunch of MPI Allreduce stems from an algorithmic constraint and is unavoidable, increasing with the number of cores, . . . unless
- an algorithmic rewrite is proposed.

Generally speaking, the fundamental difficulty is the inability to keep all the processors busy when operating on very coarse grids. Need for communication-friendly agglomeration (geometric) linear multigrid solver.



### Actual Bottlenecks

Improve the I/O, which is a bottleneck for big simulation. For example LES/DNS with hundreds of cores that requires very often saving on disk.

- State of the art: A few million cells is now considered relatively small test case. Cases of this size will not scale usefully beyond 1K cores and there is not much to be done to improve this.
- Where we are looking at is radical scalability =) The real issues are in the scaling of cases of 100's of millions of cell on 10K+ cores.



#### **Suggestions**

Tune your application on HPC enviroment

- strong scaling =) how the solution time varies with the number of processors for a fixed total problem size
- The performance results vary depending on different parameters including the nature of the tests, the solver chosen, the number of cells per processors, the class of cluster used, choice of MPI distributions, etc
- Choose the linear system solvers: use the geometrical multi-grid solver (GAMG) for very large problems [1]. The GAMG solver can often be the optimal choice, particularly for solving the pressure equation
- Compile OpenFOAM in SP (Single Precision), if possible for your application. [1] W. Briggs, V. Henson, and S. McCormick, A Multigrid Tutorial: Second Edition Society for Industrial and Applied Mathematics, 2000.



#### Suggested future work: CFD4exascale

- We are in the phase of building a consortium to apply for a big H2020 projects to enable OF to be used to the upcoming generation of Tier-0 clusters.
- FET-HPC call. Topic: Transition to Exascale Computing, Dead-line: 26 September 2017
- CINECA will act as HPC core partner during the preparatory phase and will support the co-design, provide the HPC infrastructure and the related competences.



# Transition to Exascale Computing

#### Topic Description:

- Specic Challenge: **Take advantage of the full capabilities of exascale computing**, in particular through high-productivity programming environments, system software and management, exascale I/O and storage in the presence of multiple tiers of data storage, supercomputing for extreme data and emerging HPC use modes, **mathematics and algorithms for extreme scale HPC systems** for existing or visionary applications, including data-intensive and extreme data applications in scientic areas such as physics, chemistry, biology, life sciences, materials, climate, geosciences, etc.
- e) **Mathematics and algorithms for extreme scale HPC systems and applications working with extreme data**: Specific issues are quantication of uncertainties and noise, multi-scale, multi-physics and extreme data. **Mathematical methods, numerical analysis, algorithms and software engineering for extreme parallelism should be addressed**. **Novel and disruptive algorithmic strategies should be explored to minimize data movement as well as the number of communication and synchronization instances in extreme computing**. Parallel-in-time methods may be investigated to boost parallelism of simulation codes across a wide range of application domains. Taking into account data-related uncertainties is essential for the acceptance of numerical simulation in decision making; a unied European VVUQ (Verication Validation and Uncertainty Quantication) package for Exascale computing should be provided by improving methodologies and solving problems limiting usability for very large computations on many-core congurations; access to the VVUQ techniques for the HPC community should be facilitated by providing software that is ready for deployment on supercomputers.

