HPC Architectures – past, present and emerging trends

Author: Andrew Emerson, Cineca
a.emerson@cineca.it
Speaker: Alessandro Marani, Cineca
a.marani@cineca.it
Agenda

- Computational Science
- Trends in HPC technology
- Trends in HPC programming
  - Massive parallelism
  - Accelerators
  - The scaling problem
- Future trends
  - Memory and accelerator advances
  - Monitoring energy efficiency
- Wrap-up
“Computational science is concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyze and solve scientific problems. In practical use, it is typically the application of computer simulation and other forms of computation from numerical analysis and theoretical computer science to problems in various scientific disciplines.” (Wikipedia)

Computational science (with theory and experimentation), is the “third pillar” of scientific inquiry, enabling researchers to build and test models of complex phenomena.
Computational Sciences

Computational methods allow us to study complex phenomena, giving a powerful impetus to scientific research.

The use of computers to study physical systems allows to manage phenomena

- **very large**
  
  \(\text{meteo-climatology, cosmology, data mining, oil reservoir} \)

- **very small**
  
  \(\text{drug design, silicon chip design, structural biology} \)

- **very complex**
  
  \(\text{fundamental physics, fluid dynamics, turbulence} \)

- **too dangerous or expensive**
  
  \(\text{fault simulation, nuclear tests, crash analysis} \)
Which factors limit computer power?

we can try and increase the speed of microprocessors but ..

Moore’s law gives only a slow increase in CPU speed. (It is estimated that Moore's Law will still hold in the near future but applied to the number of cores per processor) and ..

.. the bottleneck between CPU and memory and other devices is growing
For all systems, CPUs are much faster than the devices providing the data.

* If volume is mounted
HPC Architectures

The main factor driving performance is parallelism. This can be on many levels:

- Instruction level parallelism
- Vector processing
- Cores per processor
- Processors per node
- Processors + accelerators (for hybrid)
- Nodes in a system

Performance can also derive from device technology

- Logic switching speed and device density
- Memory capacity and access time
- Communications bandwidth and latency
HPC systems evolution in CINECA

1969: CDC 6600 1\textsuperscript{st} system for scientific computing
1975: CDC 7600 1\textsuperscript{st} supercomputer
1985: Cray X-MP / 4 8 1\textsuperscript{st} vector supercomputer
1989: Cray Y-MP / 4 64
1993: Cray C-90 / 2 128
1994: Cray T3D 64 1\textsuperscript{st} parallel supercomputer
1995: Cray T3D 128
1998: Cray T3E 256 1\textsuperscript{st} MPP supercomputer
2002: IBM SP4 512 1 Teraflops
2005: IBM SP5 512
2006: IBM BCX 10 Teraflops
2009: IBM SP6 100 Teraflops
2012: IBM BG/Q 2 Petaflops

22/02/2016

Introduction to Parallel Computing with MPI and OpenMP - HPC architectures
The are several factors that have an impact on the system architectures incl:

1. Power consumption has become a primary headache.
2. Processor speed is never enough.
3. Network complexity/latency is a main hindrance.
4. Access to memory.
HPC architectures/2

Two approaches to increasing supercomputer power, but at the same time limiting power consumption:

1. Massive parallelism (IBM Bluegene range).
2. Hybrids using accelerators (GPUs and Xeon PHIs).
IBM BG/Q

- BlueGene systems link together tens of thousands of low power cores with a fast network.
- In some respects the IBM BlueGene range represents one extreme of parallel computing

Name: Fermi (Cineca)
Architecture: IBM BlueGene/Q
Model: 10 racks
Processor Type: IBM PowerA2, 1.6 GHz
Computing Cores: 163840
Computing Nodes: 10240, 16 core each
RAM: 16 GB/node, 1GB/core
Internal Network: custom with 11 links -> 5D Torus
Disk Space: 2.6 PB of scratch space
Peak Performance: 2PFlop/s
Hybrid systems

- Second approach is to “accelerate” normal processors by adding more specialised devices to perform some of the calculations.
- The approach is not new (maths co-procs, FPGAs, video-cards etc) but became important in HPC when Nvidia launched CUDA and GPGPUs.
- Capable of more Flops/Watt compared to traditional CPUs but still relies on parallelism (many threads in the chip).

Model: IBM PLX (iDataPlex DX360M3)
Architecture: Linux Infiniband Cluster
Nodes: 274
Processors: 2 six-cores Intel Westmere 2.40 GHz per node
Cores: 12 cores/node, 3288 cores in total
GPU: 2 NVIDIA Tesla M2070 per node (548 in total)
RAM: 48 GB/node, 4GB/core
Internal Network: Infiniband with 4x QDR switches
Disk Space: 300 TB of local scratch
Peak Performance: 300 TFlop/s
Hybrid Systems/2

- In the last few years Intel has introduced the Xeon PHI accelerator based on MIC (Many Integrated Core) technology.
- Aimed as an alternative to NVIDIA GPUs in HPC.

**Model:** Eurora prototype  
**Architecture:** Linux Infiniband Cluster  
**Processor Type:**
  - Intel Xeon (Eight-Core SandyBridge) E5-2658 2.10 GHz  
  - Intel Xeon (Eight-Core SandyBridge) E5-2687W 3.10 GHz  
**Number of cores:** 1024 (compute)  
**Number of accelerators:** 64 nVIDIA Tesla K20 (Kepler) + 64 Intel Xeon Phi (MIC)  
**OS:** RedHat CentOS release 6.3, 64 bit

---

The **Eurora supercomputer was ranked 1st in the June 2013 Green500 chart.**

---

**Galileo**

**Model:** IBM NeXtScale  
**Architecture:** Linux Infiniband Cluster  
**Nodes:** 516  
**Processors:** 2 8-cores Intel Haswell 2.40 GHz per node  
**Cores:** 16 cores/node, 8256 cores in total  
**Accelerator:** 2 Intel Phi 7120p per node on 384 nodes (768 in total)  
**RAM:** 128 GB/node, 8 GB/core  
**Internal Network:** Infiniband with 4x QDR switches  
**Disk Space:** 2.5 Pb (Total)  
**Peak Performance:** 1 PFlop
# Top500 – November 2014

<table>
<thead>
<tr>
<th>Rank</th>
<th>Site</th>
<th>System</th>
<th>Cores (TFlop/s, TFlop/s, KW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>National Super Computer Center in Guangzhou, China</td>
<td>Tianhe-2 (MilkyWay-2) - TH-1B-FE cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31SIP, NUDT</td>
<td>3,120,000 33.862.7 54.902.4 17,808</td>
</tr>
<tr>
<td>2</td>
<td>DOE/SC/Oak Ridge National Laboratory, United States</td>
<td>Titan - Cray XK7, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x, Cray Inc.</td>
<td>560,640 17,590.0 27,112.5 8,209</td>
</tr>
<tr>
<td>3</td>
<td>DOE/NNSA/LLNL, United States</td>
<td>Sequoia - BlueGene/Q Power BQC 16C 1.60GHz, Custom IBM</td>
<td>1,572,664 17,173.2 20,132.7 7,890</td>
</tr>
<tr>
<td>4</td>
<td>RIKEN Advanced Institute for Computational Science (AICS), Japan</td>
<td>K computer, SPARC64 VIIIfx, Tofu interconnect, Fujitsu</td>
<td>705,024 10,510.0 11,280.4 12,600</td>
</tr>
<tr>
<td>5</td>
<td>DOE/SC/Argonne National Laboratory, United States</td>
<td>Mira - BlueGene/Q Power BQC 16C 1.60GHz, Custom IBM</td>
<td>786,432 8,586.6 10,066.3 3,945</td>
</tr>
<tr>
<td>6</td>
<td>Swiss National Supercomputing Centre (CSCS), Switzerland</td>
<td>Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect, NVIDIA K20x, Cray Inc.</td>
<td>115,984 6,271.0 7,788.9 2,325</td>
</tr>
<tr>
<td>7</td>
<td>Texas Advanced Computing Center/Univ. of Texas, United States</td>
<td>Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi SE100, Dell</td>
<td>462,462 5,168.1 8,520.1 4,510</td>
</tr>
<tr>
<td>8</td>
<td>Forschungszentrum Juelich (FZJ), Germany</td>
<td>JUQUEEN - BlueGene/Q Power BQC 16C 1.600GHz, Custom Interconnect, IBM</td>
<td>458,752 5,008.9 5,872.0 2,301</td>
</tr>
<tr>
<td>9</td>
<td>DOE/NNSA/LLNL, United States</td>
<td>Vulcan - BlueGene/Q Power BQC 16C 1.600GHz, Custom Interconnect, IBM</td>
<td>393,216 4,293.3 5,033.2 1,972</td>
</tr>
<tr>
<td>10</td>
<td>Government, United States</td>
<td>Cray XC30, Intel Xeon E5-2697v2 12C 2.7GHz, Aries interconnect, Cray Inc.</td>
<td>225,984 3,143.5 4,881.3</td>
</tr>
</tbody>
</table>

**Introduction to Parallel Computing with MPI and OpenMP - HPC architectures**

22/02/2016
## Top500 – June 2015

<table>
<thead>
<tr>
<th>RANK</th>
<th>SITE</th>
<th>SYSTEM</th>
<th>CORES</th>
<th>RMAX TFLOP/S</th>
<th>RPEAK TFLOP/S</th>
<th>POWER (KW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>National Super Computer Center in Guangzhou, China</td>
<td>Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.20GHz, TH Express-2, Intel Xeon Phi 31S1P NUBT</td>
<td>3,120,000</td>
<td>33,862.7</td>
<td>54,902.4</td>
<td>17,808</td>
</tr>
<tr>
<td>2</td>
<td>DOE/SC/Oak Ridge National Laboratory, United States</td>
<td>Titan - Cray XE7, Opteron 6274 16C 2.20GHz, Cray Gemini interconnect, NVIDIA K20x</td>
<td>560,640</td>
<td>17,590.0</td>
<td>27,112.5</td>
<td>8,209</td>
</tr>
<tr>
<td>3</td>
<td>DOE/NNSA/LLNL, United States</td>
<td>Sequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, Custom IBM</td>
<td>1,572,864</td>
<td>17,173.2</td>
<td>20,132.7</td>
<td>7,890</td>
</tr>
<tr>
<td>4</td>
<td>RIKEN Advanced Institute for Computational Science (AICS), Japan</td>
<td>K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect, Fujitsu</td>
<td>706,024</td>
<td>10,510.0</td>
<td>11,280.4</td>
<td>12,660</td>
</tr>
<tr>
<td>5</td>
<td>DOE/SC/Argonne National Laboratory, United States</td>
<td>Mira - BlueGene/Q, Power BQC 16C 1.60GHz, Custom IBM</td>
<td>786,432</td>
<td>8,586.6</td>
<td>10,066.3</td>
<td>3,945</td>
</tr>
<tr>
<td>6</td>
<td>Swiss National Supercomputing Centre (CSCS), Switzerland</td>
<td>Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect, NVIDIA K20x</td>
<td>115,984</td>
<td>6,271.0</td>
<td>7,788.9</td>
<td>2,325</td>
</tr>
<tr>
<td>7</td>
<td>King Abdullah University of Science and Technology, Saudi Arabia</td>
<td>Shaheen II - Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries Interconnect</td>
<td>196,608</td>
<td>5,537.0</td>
<td>7,235.2</td>
<td>2,834</td>
</tr>
<tr>
<td>8</td>
<td>Texas Advanced Computing Center/Univ. of Texas, United States</td>
<td>Stampede - PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P</td>
<td>462,462</td>
<td>5,168.1</td>
<td>8,520.1</td>
<td>4,510</td>
</tr>
<tr>
<td>9</td>
<td>Forschungszentrum Juelich (FZJ), Germany</td>
<td>JuQueue - BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect</td>
<td>458,752</td>
<td>5,008.9</td>
<td>5,872.0</td>
<td>2,301</td>
</tr>
<tr>
<td>10</td>
<td>DOE/NNSA/LLNL, United States</td>
<td>Vulcan - BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect</td>
<td>393,216</td>
<td>4,293.3</td>
<td>5,033.2</td>
<td>1,972</td>
</tr>
</tbody>
</table>
Roadmap to Exascale (architectural trends)

<table>
<thead>
<tr>
<th>Systems</th>
<th>2009</th>
<th>2011</th>
<th>2015</th>
<th>2018</th>
</tr>
</thead>
<tbody>
<tr>
<td>System Peak Flops/s</td>
<td>2 Peta</td>
<td>20 Peta</td>
<td>100-200 Peta</td>
<td>1 Exa</td>
</tr>
<tr>
<td>System Memory</td>
<td>0.3 PB</td>
<td>1 PB</td>
<td>5 PB</td>
<td>10 PB</td>
</tr>
<tr>
<td>Node Performance</td>
<td>125 GF</td>
<td>200 GF</td>
<td>400 GF</td>
<td>1-10 TF</td>
</tr>
<tr>
<td>Node Memory BW</td>
<td>25 GB/s</td>
<td>40 GB/s</td>
<td>100 GB/s</td>
<td>200-400 GB/s</td>
</tr>
<tr>
<td>Node Concurrency</td>
<td>12</td>
<td>32</td>
<td>0(100)</td>
<td>0(1000)</td>
</tr>
<tr>
<td>Interconnect BW</td>
<td>1.5 GB/s</td>
<td>10 GB/s</td>
<td>25 GB/s</td>
<td>50 GB/s</td>
</tr>
<tr>
<td>System Size (Nodes)</td>
<td>18,700</td>
<td>100,000</td>
<td>500,000</td>
<td>0(Million)</td>
</tr>
<tr>
<td>Total Concurrency</td>
<td>225,000</td>
<td>3 Million</td>
<td>50 Million</td>
<td>0(Billion)</td>
</tr>
<tr>
<td>Storage</td>
<td>15 PB</td>
<td>30 PB</td>
<td>150 PB</td>
<td>300 PB</td>
</tr>
<tr>
<td>I/O</td>
<td>0.2 TB/s</td>
<td>2 TB/s</td>
<td>10 TB/s</td>
<td>20 TB/s</td>
</tr>
<tr>
<td>MTTI</td>
<td>Days</td>
<td>Days</td>
<td>Days</td>
<td>0(1Day)</td>
</tr>
<tr>
<td>Power</td>
<td>6 MW</td>
<td>~10 MW</td>
<td>~10 MW</td>
<td>~20 MW</td>
</tr>
</tbody>
</table>
Parallel Software Models

- How do we program for supercomputers?
- C/C++ or FORTRAN, together with one or more of
  - Message Passing Interface (MPI)
  - OpenMP, pthreads, hybrid MPI/OpenMP
  - CUDA, OpenCL, OpenACC, compiler directives
- Higher Level languages and libraries
  - Co-array FORTRAN, Unified Parallel C (UPC), Global Arrays
  - Domain specific languages and data models
  - Python or other scripting languages
Message Passing: MPI

Main Characteristics

- Implemented as libraries
- Coarse grain
- Inter-node parallelization (few real alternatives)
- Domain partition
- Distributed Memory
- Long history and almost all HPC parallel applications use it.

Open Issues

- Latency
- OS jitter
- Scalability
- High memory overheads (due to program replication and buffers)

Debatable whether MPI can handle millions of tasks, particularly in collective calls.

call MPI_Init(ierrerror)
call MPI_Comm_size(MPI_Comm_World, size, ierror)
call MPI_Comm_rank(MPI_Comm_World, rank, ierror)
call MPI_Finalize(ierrerror)
Shared Memory: OpenMP

Main Characteristics
- Compiler directives
- Medium grain
- Intra-node parallelization (p-threads)
- Loop or iteration partition
- Shared memory
- For Many HPC Applications easier to program than MPI (allows incremental parallelisation)

Open Issues
- Thread creation overhead (often worse performance than equivalent MPI program)
- Memory/core affinity
- Interface with MPI

Threads communicate via variables in shared memory
Accelerator/GPGPU

Exploit massive stream processing capabilities of GPGPUs which may have thousands of cores

Sum of 1D array

```c
global__void GPUCode( int* input1, int*input2, int* output, int length) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if ( idx < length ) {
        output[ idx ] = input1[ idx ] + input2[ idx ];
    }
}
```
Main Characteristics

- Ad-hoc compiler
- Fine grain
- offload parallelization (GPU)
- Single iteration parallelization
- Ad-hoc memory
- Few HPC Applications

Open Issues

- Memory copy (via slow PCIe link)
- Standards
- Tools, debugging
- Integration with other languages
Accelerator/Xeon PHI (MIC)

The Xeon PHI co-processor based on Intel’s Many Integrated Core (MIC) Architecture combines many cores (>50) in a single chip.

Main Characteristics

- Standard Intel compilers and MKL library functions.
- Uses C/C++ or FORTRAN code.
- Wide (512 bit) vectors
- Offload parallelization like GPU but also “native” or symmetric modes.
- Currently very few HPC Applications

```
ifort -mmic -o exe_mic prog.f90
```

Open Issues

For Knight’s Corner:

- Memory copy via slow PCIe link (just like GPUs).
- Internal (ring) topology slow.
- Wide vector units need to be exploited, so code modifications probable.
- Best also with many threads
Putting it all together - Hybrid parallel programming (example)

Python: Ensemble simulations

MPI: Domain partition

OpenMP: External loop partition

CUDA: assign inner loops
Iteration to GPU threads

Quantum ESPRESSO

http://www.qe-forge.org/
Software Crisis

Real HPC Crisis is with Software
A supercomputer application and software are usually much more long-lived than a hardware
- Hardware life typically four-five years at most.
- Fortran and C are still the main programming models

Programming is stuck
- Arguably hasn’t changed so much since the 70’s

Software is a major cost component of modern technologies.
- The tradition in HPC system procurement is to assume that the software is free.

It’s time for a change
- Complexity is rising dramatically
- Challenges for the applications on Petaflop systems
- Improvement of existing codes will become complex and partly impossible.
- The use of O(100K) cores implies dramatic optimization effort.
- New paradigm as the support of a hundred threads in one node implies new parallelization strategies
- Implementation of new parallel programming methods in existing large applications can be painful
Hardware and Software advances comparison

**STORAGE**

<table>
<thead>
<tr>
<th>Year</th>
<th>Storage</th>
</tr>
</thead>
<tbody>
<tr>
<td>1965</td>
<td>8Mb</td>
</tr>
<tr>
<td>1970</td>
<td></td>
</tr>
<tr>
<td>2015</td>
<td>128Gb</td>
</tr>
<tr>
<td>1975</td>
<td></td>
</tr>
</tbody>
</table>

**PERFORMANCE**

<table>
<thead>
<tr>
<th>Year</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>1965</td>
<td></td>
</tr>
<tr>
<td>1970</td>
<td></td>
</tr>
<tr>
<td>2015</td>
<td>173 Gflops (GPU)</td>
</tr>
<tr>
<td>2015</td>
<td>400 Mflops</td>
</tr>
<tr>
<td>1975</td>
<td></td>
</tr>
</tbody>
</table>

**SOFTWARE**

22/02/2016

**PROGRAM HELLO**

```
C
REAL A(10,10)
DO 50 I=1,10
   PRINT *,'Hello'
50 CONTINUE
CALL DGEMM(N,10,I,J,A)
```

**PROGRAM HELLO**

```
C
REAL A(10,10)
DO 50 I=1,10
   PRINT *,'Hello'
50 CONTINUE
CALL DGEMM(N,10,I,J,A)
```
The problem with parallelism...

In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

i.e. the max speedup is not dependent on N. Must minimise P if we want to many processors.

\[
S(N) = \frac{1}{(1 - P) + \frac{P}{N}}
\]

For \( N=\) no. of procs and \( P=\) parallel fraction

max. speedup \( S(N) \) is given by

\[
N \to \infty,
\]

\[
S(N) = \frac{1}{1 - P}
\]
The scaling limit

• Most application codes do not scale up-to thousands of cores.
• Sometimes the algorithm can be improved but frequently there is a hard limit dictated by the size of the input.
• For example, in codes where parallelism is based on domain decomposition (e.g. molecular dynamics) no. of atoms may be < no. of cores available.
## Parallel Scaling

The parallel scaling is important because funding bodies insist on a minimum level of parallelism.

<table>
<thead>
<tr>
<th>Computer System</th>
<th>Minimum Parallel Scaling</th>
<th>Max memory/core (Gb)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Curie</td>
<td>Fat Nodes 128, Thin Nodes 512, Hybrid 32</td>
<td>4, 4, 3</td>
</tr>
<tr>
<td>Fermi</td>
<td>2048 (but typically $\geq 4096$)</td>
<td>1</td>
</tr>
<tr>
<td>SuperMUC</td>
<td>512 (typically $\geq 2048$)</td>
<td>*</td>
</tr>
<tr>
<td>Hornet</td>
<td>2048</td>
<td>*</td>
</tr>
<tr>
<td>Mare Nostrum</td>
<td>1024</td>
<td>2Gb</td>
</tr>
</tbody>
</table>

* should use a substantial fraction of available memory

Minimum scaling requirements for PRACE Tier-0 computers for calls in 2013
Other software difficulties

• Legacy applications (includes most scientific applications) not designed with good software engineering principles. Difficult to parallelise programs with many global variables, for example.
• Memory/core decreasing.
• I/O heavy impact on performance, esp. for BlueGene where I/O is handled by dedicated nodes.
• Checkpointing and resilience.
• Fault tolerance over potentially many thousands of threads.
  – In MPI, if one task fails all tasks are brought down.
Memory and accelerator advances – things to look out for

• **Memory**
  - In HPC memory is generally either fast, small cache (SRAM) close to the CPU or larger, slower, main memory (DRAM). But memory technologies and ways of accessing it are evolving.
    - **Non-volatile RAM (NVRAM).** Retains information when power switched off. Includes flash and PCM (Phase Change Memory).
    - **3D Memory.** DRAM chips assembled in “stacks” to provide a denser memory packing (e.g. Intel, GPU).

• **NVIDIA GPU**
  - **NVLINK,** high-speed link (80 Gb/s) to replace PCI-E (16 Gb/s).
  - **Unified Memory** between CPU and GPU to avoid separate memory allocations.
  - **GPU + IBM Power8** for new hybrid supercomputer (OpenPower).

• **Intel Xeon PHI (Knights Landing)**
  - Upgrade to **Knights Corner.** More memory and cores, faster internal network and possibility to boot as standalone host.
Energy Efficiency

- Hardware sensors can be integrated into batch systems to report the energy consumption of a batch job.
- Could be used to charge users according to energy consumed instead of resources reserved.

PowerDAM commands

Measures directly the energy in kWh (=3600 kJ). Current implementation still very experimental.

```
ets --system=Eurora --job=429942.node129
```

EtS is: 0.173056 kWh
Computation: 99 %
Networking: 0 %
Cooling: 0 %
Infrastructure: 0 %
## Energy Efficiency

Energy consumption of GROMACS on Eurora.

<table>
<thead>
<tr>
<th>PBS Job id</th>
<th>nodes</th>
<th>Clock freq (GHz)</th>
<th>#gpus</th>
<th>Walltime (s)</th>
<th>Energy (kWh)</th>
<th>Perf (ns/day)</th>
<th>Perf-Energy (ns/kJ)</th>
</tr>
</thead>
<tbody>
<tr>
<td>429942</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>1113</td>
<td>0.17306</td>
<td>10.9</td>
<td>69.54724</td>
</tr>
<tr>
<td>430337</td>
<td>2</td>
<td>2</td>
<td>0</td>
<td>648</td>
<td>0.29583</td>
<td>18.6</td>
<td>62.87395</td>
</tr>
<tr>
<td>430370</td>
<td>1</td>
<td>3</td>
<td>0</td>
<td>711</td>
<td>0.50593</td>
<td>17.00</td>
<td>33.60182</td>
</tr>
<tr>
<td>431090</td>
<td>1</td>
<td>3</td>
<td>2</td>
<td>389</td>
<td>0.42944</td>
<td>31.10</td>
<td>72.42023</td>
</tr>
</tbody>
</table>

Exercises:
- compare clock freq 2 Ghz with 3 Ghz
- clock freq 3 Ghz with and without GPU
Wrap-up

• HPC is only possible via parallelism and this must increase to maintain performance gains.

• Parallelism can be achieved at many levels but because of limited code scalability with traditional cores increasing role for accelerators (e.g. GPUs, MICs). The Top500 is becoming now becoming dominated by hybrid systems.

• Hardware trends forcing code re-writes with OpenMP, OpenCL, CUDA, OpenACC, etc in order to exploit large numbers of threads.

• Unfortunately, for many applications the parallelism is determined by problem size and not application code.

• Energy efficiency (Flops/Watt) is a crucial issue. Some batch schedulers already report energy consumed and in the near future your job priority may depend on predicted energy consumption.