Introduction to Intel Xeon Phi programming models

F. Affinito
F. Salvadore
SCAI - CINECA
Part I
Introduction to the Intel Xeon Phi architecture
Trends: transistors
Trends: clock rates
Trends: cores and threads

Core and Thread Counts

Threads
Cores

Trends: summarizing...

- The number of transistors increases
- The power consumption must not increase
- The density cannot increase on a single chip

**Solution:**

- Increase the number of cores
GP-GPU and Intel Xeon Phi..

- Coupled to the CPU
- To accelerate highly parallel kernels, facing with the Amdahl Law
What is Intel Xeon Phi?

- 7100 / 5100 / 3100 Series available
- 5110P:
  - Intel Xeon Phi clock: 1053 MHz
  - 60 cores in-order
  - ~ 1 TFlops/s DP peak performance (2 Tflops SP)
  - 4 hardware threads per core
  - 8 GB DDR5 memory
  - 512-bit SIMD vectors (32 registers)
  - Fully-coherent L1 and L2 caches
  - PCIe bus (rev. 2.0)
  - Max Memory bandwidth (theoretical) 320 GB/s
  - Max TDP: 225 W
MIC vs GPU naïve comparison

The comparison is naïve

<table>
<thead>
<tr>
<th>System</th>
<th>K20s</th>
<th>5110P</th>
</tr>
</thead>
<tbody>
<tr>
<td># cores</td>
<td>2496</td>
<td>60 (*4)</td>
</tr>
<tr>
<td>Memory size</td>
<td>5 GB</td>
<td>8 GB</td>
</tr>
<tr>
<td>Peak performance (SP)</td>
<td>3.52 TFlops</td>
<td>2 TFlops</td>
</tr>
<tr>
<td>Peak performance (DP)</td>
<td>1.17 TFlops</td>
<td>1 TFlops</td>
</tr>
<tr>
<td>Clock rate</td>
<td>0.706 GHz</td>
<td>1.053 GHz</td>
</tr>
<tr>
<td>Memory bandwidth</td>
<td>208 GB/s (ECC off)</td>
<td>320 GB/s</td>
</tr>
</tbody>
</table>
Terminology

- **MIC** = Many Integrated Cores is the name of the architecture
- **Xeon Phi** = Commercial name of the Intel product based on the MIC architecture
- **Knight's corner, Knight's landing, Knight's ferry** are development names of MIC architectures
- We will often refer to the CPU as **HOST** and Xeon Phi as **DEVICE**
Is it an accelerator?

- **YES**: It can be used to “accelerate” hot-spots of the code that are highly parallel and computationally extensive.
- In this sense, it works alongside the CPU.
- It can be used as an accelerator using the “offload” programming model.
- An important bottleneck is represented by the communication between host and device (through PCIe).
- Under this respect, it is very similar to a GPU.
Is it an accelerator? / 2

- NOT ONLY: the Intel Xeon Phi can behave as a many-core X86 node.
  - Code can be compiled and run “natively” on the Xeon Phi platform using MPI + OpenMP
- The bottleneck is the scalability of the code
  - Amdahl Law
- Under this respect, the Xeon Phi is completely different from a GPU
  - This is way we often call the Xeon Phi “co-processor” rather than “accelerator”
Many-core performances
Architecture key points/1

► Instruction Pipelining
  ● Two independent pipelines arbitrarily known as the U and V pipelines
  ● (only) 5 stages to cope with a reduced clock rate, e.g. compared to the Pentium 20 stages
  ● In-order instruction execution

► Manycore architecture
  ● Homogeneous
  ● 4 hardware threads per core
Architecture key points/2

- Interconnect: bidirectional ring topology
  - All the cores talk to one another through a bidirectional interconnect
  - The cores also access the data and code residing in the main memory through the ring connecting the cores to memory controller
- Given eight memory controllers with two GDDR5 channels running at 5.5 GT/s
  - Aggregate Memory Bandwidth = 8 memory controllers \times 2 channels \times 5.5 \text{ GT/s} \times 4 \text{ bytes/transfer} = 352 \text{ GB/s}
- System interconnect
  - Xeon Phi are often placed on PCIe slots to work with the host processors
Architecture key points/3

- Cache:
  - L1: 8-ways set-associative 32-kB instruction and 32-kB data
  - L1 access time: 3 cycles
  - L2: 8-way set associative and 512 kB in size (unified)

- Interconnect: bidirectional ring topology

- TLB cache:
  - L1 data TLB supports three page sizes: 4 kB, 64 kB, and 2 MB
  - L2 TLB
  - If one misses L1 and also misses L2 TLB, one has to walk four levels of page table, which is pretty expensive
The VPU (vector processing unit) implements a novel instruction set architecture (ISA), with 218 new instructions compared with those implemented in the Xeon family of SIMD instruction sets.

The VPU is fully pipelined and can execute most instructions with four-cycle latency and single-cycle throughput.

Each vector can contain 16 single-precision floats or 32-bit integer elements or eight 64-bit integer or double-precision floating point elements.
Each VPU instruction passes through one or more of the following five pipelines to completion:

- Double-precision (DP) pipeline: Used to execute float64 arithmetic, conversion from float64 to float32, and DP-compare instructions.
- Single-precision (SP) pipeline: Executes most of the instructions including 64-bit integer loads. This includes float32/int32 arithmetic and logical operations, shuffle/broadcast, loads including loadunpack, type conversions from float32/int32 pipelines, extended math unit (EMU) transcendental instructions, int64 loads, int64/float64 logical, and other instructions.
- Mask pipeline: Executes mask instructions with one-cycle latencies.
- Store pipeline: Executes the vector store operations.
- Scatter/gather pipeline: Executes the vector register read/writes from sparse memory locations.

Mixing SP and DP computations is expensive!
Architecture sketch/1
Architecture sketch/2
Part II
Programming models
Same target, different pathways

Though the problem is very similar (i.e. heterogeneous device connected to the CPU through a low bandwidth channel), there are many programming approaches:

- Based on a low-level language addressing the memory space of the device
  - Proprietary (CUDA)
  - Open (OpenCL)

- Based on directives ("pragma") to the compiler
  - OpenMP 4.0
  - OpenACC
  - Intel Language Extension for Offload (LEO)
Comparing strategies

- Directive based approach are relatively simpler because they permit to manage the data transfer in an explicit way without affecting too much the original code.
- Directive based approaches preserve the portability of the code.

- CUDA and OpenCL are low level languages that permits to obtain better performances but:
  - CUDA is a proprietary language and it is only suitable for GPUs.
  - OpenCL is open and suitable for both GPUs and MICs but it does not permit to reach the same level of performances for both of them.
  - Both CUDA and OpenCL require to make important changes to the code.
  - Using CUDA in many cases forces to create a separate developing branch of the code.

- Intel LEO can be used only on the Intel Xeon Phi platform.
  - Intel LEO approach is much similar to the directives implemented in OpenMP4.0.
  - Implementing LEO forces to better work on vectorization and threadization: this effort is fruitful also when working on CPU only.
Intel Xeon Phi programming models

Intel Xeon Phi has a twofold nature:
- accelerator
- coprocessor

According to this one has the chance to work with different programming models.

The choice of the programming model can be made according to the requirements of the application.
Intel Xeon Phi programming models

**HOST**
- Host-only
  - main()
  - foo()
  - MPI_*()

- Offload with LEO
  - main()
  - foo()
  - MPI_*()

- Symmetric with MPI
  - main()
  - foo()
  - MPI_*()

- Coprocessor only ("native")

**DEVICE**
- OFFLOAD
- MPI

**Diagrams**
- At a glance...
- Intel Xeon Phi programming models
- Models include Host-only, Offload with LEO, Symmetric with MPI, and Coprocessor only ("native").
- The Diagrams show the interactions between Host and Device with the use of MPI and Offload.
- The "-mmic flag" is mentioned.
Intel Xeon Phi programming models

**PRO:**
- it is a cross-compiling mode
- very simple
  - just add -mmic, login and execute
- use well known OpenMP and MPI (or pthreads or OpenCL)

**CONS:**
- very slow I/O
- poor single thread performance
- only suitable for highly parallel codes (cfr Amdahl)
- CPU unused
Intel provides a set of directives to the compiler named “LEO”: Language Extension for Offload.

These directives manage the transfer and execution of portions of code to the device.

C/C++
#pragma offload target (mic:device_id)

Fortran
!dir$ offload target (mic:device_id)
Variable and function definitions

C/C++
__attribute__((target(mic)))

Fortran
!dir$ attributes offload:mic :: <function/var name>

It compiles (allocates) variables on both the host and device

For entire files or large blocks of code (C/C++ only)
#pragma offload_attribute (push, target(mic))
#pragma offload_attribute (pop)
Since host and device don't have physical or virtual shared memory, variable must be copied in an explicit or in an implicit way.

Implicit copy is assumed for
- scalar variables
- static arrays

Explicit copy must be managed by the programmer using clauses defined in the LEO
Intel Xeon Phi programming models

Programmer clauses for explicit copy:
in, out, inout, nocopy

Data transfer with offload region:

C/C++      #pragma offload target(mic) in(data:length(size))
Fortran    !dir$ offload target (mic) in(data:length(size))

Data transfer without offload region:

C/C++      #pragma offload_transfer target(mic)in(data:length(size))
Fortran    !dir$ offload_transfer target(mic) in(data:length(size))
Intel Xeon Phi programming models: examples

C/C++

#pragma offload target (mic) out(a:length(n)) \ in(b:length(n))
for (i=0; i<n; i++){
    a[i] = b[i]+c*d
}

Fortran

!dir$ offload begin target(mic) out(a) in(b)
do i=1,n
    a(i)=b(i)+c*d
end do
!dir$ end offload
Intel Xeon Phi programming models: examples

C/C++

__attribute__((target(mic)))
void foo()
{
    printf("Hello MIC\n");
}

int main()
{
    #pragma offload target(mic)
    foo();
    return 0;
}

Fortran

!dir$ attributes &
!dir$ offload:mic :: hello
subroutine hello
    write(*,*")Hello MIC"
end subroutine

program main
!dir$ attributes &
!dir$ offload:mic :: hello
!dir$ offload begin target (mic)
    call hello()
!dir$ end offload
end program
Memory allocation

- CPU is managed as usual
- on coprocessor is defined by in, out and inout clauses

Input/Output pointers

- by default on coprocessor “new” allocation is performed for each pointer
- by default de-allocation is performed after offload region
- defaults can be modified with alloc_if and free_if qualifiers
Using memory qualifiers

free_if(0)
free_if(.false.) retain target memory

alloc_if(0)
alloc_if(.false.) reuse data in subsequent offload

alloc_if(1)
alloc_if(.true.) allocate new memory

free_if(1)
free_if(.true.) deallocate memory
Intel Xeon Phi programming models

```c
#define ALLOC alloc_if(1)
#define FREE free_if(1)
#define RETAIN free_if(0)
#define REUSE alloc_if(0)

allocate the memory but don't de-allocate
#pragma offload target(mic:0) in(a:length(8)) ALLOC RETAIN)
...

don't allocate or deallocate the memory
#pragma offload target(mic:0) in(a:length(8) REUSE RETAIN)

don't allocate the memory but de-allocate
#pragma offload target(mic:0) in(a:length(8) REUSE FREE)
```

note: specify the device_id when using more than one device
target(mic:0)
target(mic:1) ...

Offload (9)
Intel Xeon Phi programming models
Partial offload of arrays

int *p;
#pragma offload ... in (p[10:100] : alloc(p(5:1000))
{...}

It allocates 1000 elements on coprocessor; first usable element has index 5, last has index 1004; only 100 elements are transferred, starting from index 10.
Intel Xeon Phi programming models

Copy from a variable to another one

It permits to copy data from the host to a different array allocated on the device

integer :: p(1000), p1(2000)
integer :: rank1(1000), rank2(10,100)

!dir$ offload ... (p(1:500) : into (p1(501:1000)))
Using OpenMP in an offload region:

C/C++
#pragma offload target (mic)
#pragma omp parallel for
for (i=0; i<n; i++){
    a[i]=b[i]*c+d;
}

Fortran
!dir$ omp offload target (mic)
!$omp parallel do
do i=1,n
   A(i)=B(i)*C+D
end do
!$omp end parallel

optional, if defined, it must be immediately followed by an OpenMP directive

Setting up the environment:
OMP_NUM_THREADS = 16
MIC_ENV_PREFIX = MIC
MIC_OMP_NUM_THREADS = 120
Tuning up OpenMP

the coprocessor has 4 hardware threads per core; at least 2 should be used; for many-cores systems (and hence for the Xeon Phi) binding threads to the cores and choosing an affinity are crucial factor that affect performance

MIC_ENV_PREFIX = MIC
MIC_KMP_AFFINITY =

- compact
- scattered
- balanced
Intel Xeon Phi programming models

LEO compiler options

- `opt-report-phase:offload` activate reporting
- `no-offload` disable offload
- `offload-attribute-target=mic` build ALL functions for both host and device
Intel Xeon Phi programming models

Static libraries

xiar can be used to create libraries containing offloaded code

specify -qoffload-build that forces xiar to create both a library for the host (xxx.a) and a library for the device (xxxMIC.a)

use the same options that you would use for ar

use normally the linker options (-L.. -lxxx.a) and the compiler will automatically include the coprocessor library
Managing multiple devices

Including offload.h

#include <offload.h>

you can use a few API:

_offload_number_of_device()
_offload_get_device_number()

or use runtime environment variables:

OFFLOAD_DEVICES=0,1

always remember to specify the target device target(mic:1)
Intel Xeon Phi programming models

MIC specific MACROs

```c
#include <offload.h>

#ifdef __INTEL_OFFLOAD

int main()
{
    #pragma offload target(mic)
    {
        #ifdef __MIC__
            printf("Hello MIC number %d\n", _Offload_get_device_number());
        #else
            printf("Hello HOST\n");
        
    }
}
#endif

#ifdef __INTEL_OFFLOAD

    printf("%d MICS available\n", _Offload_number_of_devices());
#endif
```

Offload (18)

Intel Xeon Phi programming models
Intel Xeon Phi programming models

I/O from offloaded regions

Buffered printf are available (use only to debugging purposes!)
Always use fflush(0) on the coprocessor

Files I/O is possible only through a proxy filesystem
Asynchronous computation

By default, offload forces the host to wait for completion

Asynchronous offload starts the offload and continues on the next statement just after the offload region
Use the signal clause to synchronize with a offload_wait statement
Intel Xeon Phi programming models

Example

char signal_var;

do {
    #pragma offload target(mic:0) signal(&signal_var)
    {
        long_running_mic_compute();
    }
    concurrent_cpu_computation();
    #pragma offload_wait target(mic:0) wait(&signal_var)
} while(1);
Intel Xeon Phi programming models

Reporting

Use OFFLOAD_REPORT or the variable _Offload_report with a verbosity from 1 to 3. OFFLOAD_REPORT=1 only provides timing

Conditional offload

Only offload if it is worth

```
#pragma offload target (mic) in (b:length(size)) \ 
   out (a:length(size) \ 
    if(size>100)
```
Intel Xeon Phi programming models: native mode

```bash
icc -mmic mycode.c -o mycode.x
scp mycode.x mic0:.
ssh mic0
export OMP_NUM_THREADS=240
./mycode.x
```

Simple... but not always successful
Intel Xeon Phi programming models: symmetric mode

Using MPI you can make work together the executable running on the host and the one running on the device (compiled with -mmic)

Load balancing can be an issue

Tuning of MPI and OpenMP on both host and device is crucial

Dependent on the cluster implementation (physical network, MPI implementation, job scheduler..)
Example

# compile the program for the coprocessor (-mmic)
mpiicc -mmic -o test.MIC test.c

# compile the program for the host
mpiicc -mmic -o test test.c

# copy the executable to the coprocessor
scp test.MIC mic0:/tmp/test.MIC

# set the I_MPI_MIC variable
export I_MPI_MIC=1

# launch MPI jobs on the host knf1 and on the coprocessor mic0
mpirun -host knf1 -n 1 ./test -n 1 -host mic0 /tmp/test.MIC
Vectorization

Lots of cores but also... large registers!

- **SSE**: 128 bit
  - 2 x DP or 4 x SP

- **AVX**: 256 bit
  - 4 x DP or 8 x SP

- **MIC**: 512 bit
  - 8 x DP or 16 x SP
Vectorization

SIMD arithmetic

\[
\begin{array}{cccccccccc}
\end{array}
\]
Vectorization

SIMD Fused Multiply Add

\[
\begin{align*}
  \mathbf{a}[0] & \quad \mathbf{a}[1] & \quad \mathbf{a}[2] & \quad \mathbf{a}[3] & \quad \mathbf{a}[4] & \quad \mathbf{a}[5] & \quad \mathbf{a}[6] & \quad \mathbf{a}[7] \\
  \mathbf{b}[0] & \quad \mathbf{b}[1] & \quad \mathbf{b}[2] & \quad \mathbf{b}[3] & \quad \mathbf{b}[4] & \quad \mathbf{b}[5] & \quad \mathbf{b}[6] & \quad \mathbf{b}[7] \\
  \mathbf{c}[0] & \quad \mathbf{c}[1] & \quad \mathbf{c}[2] & \quad \mathbf{c}[3] & \quad \mathbf{c}[4] & \quad \mathbf{c}[5] & \quad \mathbf{c}[6] & \quad \mathbf{c}[7] \\
  \mathbf{d}[0] & \quad \mathbf{d}[1] & \quad \mathbf{d}[2] & \quad \mathbf{d}[3] & \quad \mathbf{d}[4] & \quad \mathbf{d}[5] & \quad \mathbf{d}[6] & \quad \mathbf{d}[7]
\end{align*}
\]
Intel released a version for Xeon Phi of the MKL mathematical libraries.

MKL Libraries

MKL have three different usage models:

- Automatic offload (AO)
- Compiler assisted offload (CAO)
- Native execution
Offload is automatic and transparent.

The library decides **when** to offload and **how much** to offload (workdivision).

Users can control parameters through environment variables or API.

You can enable automatic offload with
MKL_MIC_ENABLE=1
or
mkl_mic_enable()
MKL Libraries

Not all the MKL functions are enabled to AO.

In MKL 11.0.1:

*Level 3 BLAS*: xGEMM, xTRSM, xTRMM
*LAPACK* xGETRF, xPOTRF, xGEQRF

Always check the documentation for updates
MKL Libraries

MKL functions can be offloaded as other “ordinary” functions using the LEO pragmas. All MKL functions can take advantage of the CAO, which is a more flexible option in terms of data management (you can use data persistence or mechanisms to hide the latency...).
MKL Libraries: CAO

C/C++

```c
#pragma offload target (mic) \
in (transa, transb, N, alpha, beta) \
in (A:length(matrix_elements)) in (B:length(matrix_elements)) \
inout (C:length(matrix_elements))
{
    sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
}
```

Fortran

```fortran
!dir$ attributes offload : mic : sgemm
!dir$ offload target(mic) &
!dir$ in (transa, transb, m, n, k, alpha, beta, lda, ldb, ldc), &
!dir$ in (a:length(ncola*lda)), in (b:length(ncolb*ldb)) &
!dir$ inout (c:length(n*ldc))
CALL sgemm (transa, transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
```
MKL Libraries

MKL libraries are also available when using the native mode.

Tips:

Use all the 240 threads: MIC_OMP_NUM_THREADS=240

Set the thread affinity: MIC_KMP_AFFINITY = ...
Part III
Optimization hints
Performance and parallelism

- In principle the main advantage of using Intel MIC technology with respect to other coprocessors is the simplicity of the porting
  - Programmers may compile their source codes based on common HPC languages (Fortran/ C / C++) specifying MIC as the target architecture (*native mode*)
- Is it enough to achieve good performances? By the way, why offload?
- Usually not, parallel programming is not easy
  - A general need is to *expose parallelism*
GPU vs MIC

- GPU paradigms (e.g. CUDA):
  - Despite the sometimes significant effort required to port the codes...
  - ...are designed to force the programmer to expose (or even create if needed) parallelism

- Programming Intel MIC
  - The optimization techniques are not far from those devised for the common CPUs
  - As in that case, achieving optimal performance is far from being straightforward

- What about device maturity?
Let us recall 3 basic features of current Intel Xeon Phi:

- Peak performance originates from “many slow but vectorizable cores”

$$ \text{clock frequency} \times \text{n. cores} \times \text{n. lanes} \times 2 \quad \text{FMA Flops/cycle} $$

$$ 1.091 \ \text{GHz} \times 61 \ \text{cores} \times 16 \ \text{lanes} \times 2 = 2129.6 \ \text{Gflops/cycle} $$

$$ 1.091 \ \text{GHz} \times 61 \ \text{cores} \times 8 \ \text{lanes} \times 2 = 1064.8 \ \text{Gflops/cycle} $$

- Bandwidth is (of course) limited, caches and alignment matter

- The card is not a replacement for the host processor. It is a coprocessor providing optimal power efficiency
Optimization key points

In general terms, an application must fulfill three requirements to efficiently run on a MIC

1. Highly vectorizable, the cores must be able to exploit the vector units. The penalty when the code cannot be vectorized is very high.
2. High scalability, to exploit all MIC multi-threaded cores: scalability up to 240 processors (processes/threads) running on a single MIC, and even higher running on multiple MIC.
3. Ability of hiding I/O communications with the host processors and, in general, with other hosts or coprocessors.
Auto-vectorization

- In recent Intel compilers, vectorization is enabled by default
  - May be turned off by explicit options
  - The compiler must be able to detect the possibility to do that

- The essential requirement is the possibility to unroll the loop having the different iterations performed simultaneously

- Some critical conditions
  - If the loop is part of a loop nest, it must be the inner loop unless it is completely unrolled or interchange occurs (use -O3)
  - Straight-line code: no jumps or branches but masked assignment allowed
  - Countable loop: number of iterations must be known when starting (even if not at compile time)
  - No loop dependencies: iterations must be performed in parallel
Writing “clean” code is a good starting point to have the code vectorized

- Prefer array indexing instead of explicit pointer arithmetic
- Use `restrict` keyword to tell the compiler that there is no array aliasing

Excerpt from a real code the compiler manages to vectorize:

```c
REAL * __restrict__ anspx=an+spxoff;
REAL * __restrict__ ansmx=an+smxoff;
...
for(ix=istart; ix<iend; ix++) {
    as = anspx[ix]*JpxWO[ix] + anspy[ix]*JpyWO[ix] +
        anspz[ix]*JpzWO[ix] + ansmx[ix]*JmxWO[ix] +
        ansmy[ix]*JmyWO[ix] + ansmz[ix]*JmzWO[ix] +
        ...
    }
```
Using array notation is a good way to guarantee the compiler that the iterations are independent.

- In Fortran this is consistent with the language array syntax:
  \[ a(1:N) = b(1:N) + c(1:N) \]

- In C the array notation is provided by Intel Cilk Plus:
  \[ a[1:N] = b[1:N] + c[1:N] \]

Beware:

- The first value represents the lower bound for both languages.
- But the second value is the upper bound in Fortran whereas it is the length in C.
- An optional third value is the stride both in Fortran and in C.
- Multidimensional arrays supported, too.
Vectorization: directives

Another opportunity is forcing vectorization by means of directives

- The programmer guarantees the possibility to vectorize
- Until a few years ago, only compiler dependent directives available

#pragma ivdep
- Instructs the compiler to ignore assumed vector dependencies (proven dependencies area not ignored)

#pragma vector always
- Instructs the compiler to override any efficiency heuristic during the decision to vectorize or not
Vectorization: OpenMP 4.0 *simd*

Intel took leadership in defining OpenMP 4.0 SIMD extensions
- Several tuning options available

**Applied to a loop:**
- `#pragma omp simd`

<table>
<thead>
<tr>
<th>Thread Level Parallelism</th>
<th>SIMD parallelism</th>
</tr>
</thead>
<tbody>
<tr>
<td>Auto Parallel</td>
<td>Auto vectorization</td>
</tr>
<tr>
<td>OpenMP threading</td>
<td>OpenMP 4.0 simd</td>
</tr>
<tr>
<td>Posix threads</td>
<td>Vectorization intrinsics</td>
</tr>
</tbody>
</table>

**Applied to a function to enable the creation of a version that can process arguments using SIMD instructions from a single invocation from a SIMD loop:**
- `#pragma omp declared simd`
Vectorization: Intel Xeon Phi intrinsics

- IMCI intrinsics
  - The coding become hard
  - And the code is no more portable to common CPUs

```c
for(i=0; i<N; i++)
```

```c
for(i=0; i<N; i+=16) {
    __mm512 Avec = mm512load_ps(A+i);
    __mm512 Bvec = mm512load_ps(B+i);
    Avec = mm512add_ps(Avec, Bvec);
    _mm512_store_ps(A+i, Avec);
}
```

- The arrays float A[N] and float B[N] are aligned on a 64-byte boundary
- Variables Avec and Bvec are 512=16 x sizeof(float) bits
Exploiting cores

- MPI and OpenMP are the most common choices
  - Up to 60 MPI processes are reasonable for a single MIC
  - And 1 MPI process per MIC may be an interesting choice
  - The optimal choice between MPI and OpenMP depends on the application

- MPI Programming models, basically three configurations
  - Co-processor only (native mode)
  - MPI+Offload
  - Symmetric
MPI communications are heterogeneous. Performances strongly vary!
From some tests on the Eurora cluster at Cineca

<table>
<thead>
<tr>
<th></th>
<th>PingPong</th>
<th>SendRecv</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU-CPU same node</td>
<td>5-11</td>
<td>5-22</td>
</tr>
<tr>
<td>CPU-CPU diff node</td>
<td>2.9</td>
<td>5</td>
</tr>
<tr>
<td>MIC-MIC same node</td>
<td>0.9</td>
<td>1.8</td>
</tr>
<tr>
<td>MIC-MIC diff node</td>
<td>0.9</td>
<td>1.6</td>
</tr>
<tr>
<td>CPU-MIC same node</td>
<td>5.9</td>
<td>11</td>
</tr>
<tr>
<td>CPU-MIC diff node</td>
<td>1.45</td>
<td>1.65</td>
</tr>
</tbody>
</table>
Symmetric mode: load balancing

- When running in symmetric mode, load balancing is a critical issue
  - Usual MPI decompositions assume homogeneous computing units

- Mixing MPI and OpenMP may help
  - Assign a different number of MPI processes to host and coprocessor
  - Exploit the full machine potential by means of OpenMP threads
  - E.g.
    - Host: 4 MPI ranks + 4 OpenMP threads
    - MIC: 8 MPI ranks + 30 OpenMP threads
Threading models

- Several threading models available
  - OpenMP
  - Fortran (2008) DO concurrent
  - Intel Cilk Plus
  - Intel Threading Building Block
  - Intel Math Kernel Library

- OpenMP has clear advantages wrt portability
- In offload mode, it is possible (required) to tune both the host and coprocessor parameters (e.g. number of threads)
Thread Affinity

- Placement of threads on MIC cores and hardware threads
  - The basic configuration is controlled by the variable `KMP_AFFINITY`
  - Additional advanced settings are possible too
  - Scatter
  - Balanced
  - Compact
Affinity and performances

- The impact of affinity on performance may be very significant
- From a realworld example (3d-stencil code)
Collapse loops

As recalled, the number of threads for each MPI process may become large (up to 240).

From different tests, it turns out that collapsing OpenMP loops results in improved performances.

From a realworld example (3d reacting Navier-Stokes equations)

<table>
<thead>
<tr>
<th>MIC OMP threads</th>
<th>no-collapse</th>
<th>collapse</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>108.7</td>
<td>109.26</td>
</tr>
<tr>
<td>16</td>
<td>7.67</td>
<td>7.52</td>
</tr>
<tr>
<td>30</td>
<td>5.24</td>
<td>4.51</td>
</tr>
<tr>
<td>60</td>
<td>3.08</td>
<td>2.51</td>
</tr>
<tr>
<td>120</td>
<td>2.60</td>
<td>1.87</td>
</tr>
<tr>
<td>180</td>
<td>1.89</td>
<td>1.77</td>
</tr>
<tr>
<td>240</td>
<td>2.20</td>
<td>1.67</td>
</tr>
</tbody>
</table>
Tiling

“Dividing a loop into a set of parallel tasks of a suitable granularity. In general, tiling consists of applying multiple steps on a small part of a problem instead of running each step on the whole problem one after the other. The purpose of tiling is to increase reuse of data in caches”

```c
#pragma omp for collapse(2)
for (int z = 0; z < nz; z++) {
    for (int y = 0; y < ny; y++) {
        for (int x = 0; x < nx; x++) {

#define YBF 16
#pragma omp for collapse(2)
for (int yy = 0; yy < ny; yy += YBF) {
    for (int z = 0; z < nz; z++) {
        int ymax = yy + YBF;
        if (ymax >= ny) ymax = ny;
        for (int y = yy; y < ymax; y++) {
```
TLB cache thrashing

- Depending on the memory patterns, possible TLB cache thrashing must be considered with care
  - Padding between allocated arrays may be a good solution
  - The problem may be difficult to analyze for non-HPC experts

- From a spin glass simulation code, the spin updating time has been measured against the padding pages between arrays

<table>
<thead>
<tr>
<th>Padding pages</th>
<th>Time per spin</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1.458</td>
</tr>
<tr>
<td>1</td>
<td>0.737</td>
</tr>
<tr>
<td>4</td>
<td>0.764</td>
</tr>
<tr>
<td>8</td>
<td>1.222</td>
</tr>
<tr>
<td>16</td>
<td>1.537</td>
</tr>
<tr>
<td>32</td>
<td>1.543</td>
</tr>
</tbody>
</table>
When getting unexpected performance results or whenever there is the need to have a deep understanding of the measured times, using Intel Vtune profiler is a good idea.

From the previous TLB thrashing example:

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Non-padded</th>
<th>Padded</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU Time</td>
<td>33459.268</td>
<td>31783.926</td>
</tr>
<tr>
<td>Clockticks</td>
<td>351991500000000.000</td>
<td>334366900000000.000</td>
</tr>
<tr>
<td>CPU_CLK_UNHALTED</td>
<td>351991500000000.000</td>
<td>334366900000000.000</td>
</tr>
<tr>
<td>Instructions Retired</td>
<td>724220000000</td>
<td>417970000000</td>
</tr>
<tr>
<td>CPI Rate</td>
<td>48.503</td>
<td>79.998</td>
</tr>
<tr>
<td>Cache Usage</td>
<td></td>
<td></td>
</tr>
<tr>
<td>L1 Misses</td>
<td>13680450000</td>
<td>19016200000</td>
</tr>
<tr>
<td>L1 Hit Ratio</td>
<td>0.954</td>
<td>0.900</td>
</tr>
<tr>
<td>Estimated Latency Impact</td>
<td>2516.588</td>
<td>1732.644</td>
</tr>
<tr>
<td>Vectorization Usage</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Vectorization Intensity</td>
<td>10.913</td>
<td>10.248</td>
</tr>
<tr>
<td>L1 Compute to Data Access Ratio</td>
<td>3.138</td>
<td>4.783</td>
</tr>
<tr>
<td>L2 Compute to Data Access Ratio</td>
<td>68.417</td>
<td>47.795</td>
</tr>
<tr>
<td>TLB Usage</td>
<td></td>
<td></td>
</tr>
<tr>
<td>L1 TLB Miss Ratio</td>
<td>0.633</td>
<td>0.064</td>
</tr>
<tr>
<td>L2 TLB Miss Ratio</td>
<td>0.026</td>
<td>0.000</td>
</tr>
<tr>
<td>L1 TLB Misses per L2 TLB Miss</td>
<td>1.252</td>
<td>1052.421</td>
</tr>
<tr>
<td>Hardware Event Count</td>
<td></td>
<td></td>
</tr>
<tr>
<td>L2_DATA_READ_MISS_CACHE_FILL</td>
<td>4648000000</td>
<td>5481000000</td>
</tr>
<tr>
<td>L2_DATA_WRITE_MISS_CACHE_FILL</td>
<td>3619000000</td>
<td>357000000</td>
</tr>
<tr>
<td>L2_DATA_READ_MISS_MEM_FILL</td>
<td>7220500000</td>
<td>7918400000</td>
</tr>
<tr>
<td>L2_DATA_WRITE_MISS_MEM_FILL</td>
<td>1764000000</td>
<td>180600000</td>
</tr>
</tbody>
</table>
DA is a method to force the compiler to create data objects in memory on specific byte boundaries. This is done to increase efficiency of data loads and stores to and from the processor.

For MIC memory movement is optimal when the data starting address lies on 64 byte boundaries.

Two steps are needed:

1. Align the data

```c
float A[1000] __attribute__((aligned(64)));
buf = (char*) __mm_malloc(bufsizes[i], 64);
real, allocatable :: a(:)
!dir$ attributes align:64 :: a
```
(2) Use pragma/directives and clauses to tell the compiler that the accesses are aligned

For an i-loop that has a memory access of the form a[i+n1], the loop has to be structured in such a way that the starting-indices have good alignment properties.

```c
__assume_aligned(a, 64);
__assume(n1%16==0);
__assume(n2%16==0);
for(i=0;i<n;i++) {
    // Compiler vectorizes loop with all aligned accesses
    X[i] += a[i] + a[i+n1] + a[i-n1]+ a[i+n2] + a[i-n2];
}
```
Streaming store and prefetch

- Starting with Composer XE 2013 Update 1 compiler, streaming stores instructions are generated under certain conditions.
  - Instructions intended to speed up the performance in the case of vector-aligned unmasked stores in streaming kernels where we want to avoid wasting memory bandwidth by being forced to read the original content of an entire cache line from memory when we overwrite their whole content completely.
- Heuristics may be not sufficient: user can provide hints to the compiler, e.g.
  - `#pragma vector nontemporal A`
  - where `A[i]=...` is the store inside the loop.
Native vs Offload

Why offload mode?

Cons
- The porting is much more complex than to native mode
- And the programmer must take care of host-coprocessors data exchanges which may be disastrous wrt performances
- The symmetric mode allows to use both host and MIC at the same time

Pros
- It is also reasonable to assume that, the host being in charge of MPI calls (as it happens in offload mode), the MIC is free to execute, at its best, the computing intensive part of the code without wasting time in managing the communications
Consider a finite difference time domain code parallelized by standard domain decomposition. At each step:

- (a) update boundary and bulk values
- (b) exchange ghosts with neighboring processes

MPI optimizations: FDTD
MPI optimizations: FDTD/2

- MPI patterns allow to overlap computations with communications (hiding the communication cost)
- Standard CPU pattern using MPI non blocking functions (available for MIC native mode as well)
  - Update boundary
  - Exchange ghost – MPI non blocking
  - Update bulk
  - Wait exchanges – MPI wait
- To achieve full overlapping, the bulk updating time must be larger than the communication time
- Using MIC (native), sometimes the final performances are far from optimal
MIC-Offload pattern (similar to multi-GPU approach)
- Update boundary
- Update bulk – asynchronous (non blocking)
- Exchange ghost – MPI blocking
- Wait bulk update

```c
#pragma offload target(mic:0) ... async(a)
{
<code to be offloaded>
}
```

CPU operations (e.g. MPI calls)
```c
#pragma offload_wait(a)
```
## MPI optimizations: HSG

- Scaling results from Heisenberg Spin Glass code
- Strong scaling for native/offload and sync/async versions

### Scaling Results

<table>
<thead>
<tr>
<th>#MICS</th>
<th>Native-Sync</th>
<th>Native-Async</th>
<th>Offload-Sync</th>
<th>Offload-Async</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0.709</td>
<td>0.717</td>
<td>1.049</td>
<td>1.078</td>
</tr>
<tr>
<td>2</td>
<td>0.484</td>
<td>0.431</td>
<td>0.558</td>
<td>0.527</td>
</tr>
<tr>
<td>4</td>
<td>0.445</td>
<td>0.325</td>
<td>0.335</td>
<td>0.281</td>
</tr>
<tr>
<td>8</td>
<td>0.376</td>
<td>0.246</td>
<td>0.219</td>
<td>0.167</td>
</tr>
<tr>
<td>16</td>
<td>0.343</td>
<td>0.197</td>
<td>0.154</td>
<td>0.113</td>
</tr>
</tbody>
</table>

### Weak scaling comparison with other architectures

<table>
<thead>
<tr>
<th>#Procs</th>
<th>Size</th>
<th>CPU</th>
<th>GPU</th>
<th>MIC-n</th>
<th>MIC-o</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>256</td>
<td>3.73</td>
<td>0.67</td>
<td>0.78</td>
<td>1.34</td>
</tr>
<tr>
<td>8</td>
<td>512</td>
<td>0.48</td>
<td>0.068</td>
<td>0.25</td>
<td>0.17</td>
</tr>
</tbody>
</table>

Efficiency: 96.2% 123% 39.9% 100%