

# Introduction to accelerators

\*SuperComputing Applications and InnovationDepartment





# **Outline**

- · GPU
- · MIC
- · OpenACC





### GPU vs CPU: different philosophies



- Design of CPUs optimized for sequential code performance:
- multi-core
- sophisticated control logic unit
- large cache memories to reduce access latencies

Design of GPUs optimized for the execution of large number of threads dedicated to floating-points calculations:

- many-cores (several hundreds)
- minimized the control logic in order to manage leightweight threads and maximize execution throughput
- taking advantage of large number of threads to overcome long-latency memory accesses





Summer School on PARALLEL COMPUTING

- 512 cores (16 SM x 32 SP)
- first GPU architecture to support a true cache hierarchy:
   L1 cache per SM unified L2 caches (768 KB)
- Memory Bandwidth (GDDR5)
   148 GB/s (ECC off)
- · 6 GB of global memory
- 48KB of shared memory
- Concurrent Kernels execution
- support C++ (only in host code)







# CUDA core architecture

New IEEE 754-2008 floating point standard

**Fused multiply-add** (FMA) instruction for both single and double precision CUDA Core

**Newly designed** integer ALU optimized for 64-bit and extended precision operations









GPUs are designed as numeric computing engines, therefore they will not perform well on other tasks.

Applications should use both CPUs and GPUs, where the latter is exploited as a coprocessor in order to speed up numerically intensive sections of the code by a massive fine grained parallelism.

CUDA programming model introduced by NVIDIA in 2007, is designed to support joint CPU/GPU execution of an application.











- Compute Unified Device Architecture:
- extends ANSI C language with minimal extensions
- provides application programming interface (API) to manage host and device components

### **CUDA** program:

- · Serial sections of the code are performed by CPU (host)
- · The parallel ones (that exhibit rich amount of *data* parallelism) are performed by GPU (device) in the SIMD mode as CUDA kernels.
- host and device have separate memory spaces: programmers need to transfer data between CPU and GPU in a manner similar to "one-sided" message passing.





# CUDA threads organization

A kernel is executed as a **grid** of many parallel threads.

- They are organized into a two-level hierarchy:
- a grid is organized as up to 3-dim array of thread blocks
- each block is organized into up to 3-dim array of threads
- all blocks have the same number of threads
- organized in the same manner.

#### **Block of threads:**

set of concurrently executing threads that can *cooperate* among themselves through

- barrier synchronization, by
   using the function \_\_syncthreads();
- shared memory.







#### Summer School on PARALLEL COMPUTING

# CUDA threads organization

Because all threads in a grid execute the same code, they rely on unique coordinates assigned to them by the CUDA runtime system as built-in preinitialized variables

· Block ID up to 3 dimensions:

(blockldx.x, blockldx.y, blockldx.z)

Thread ID within the block up to 3 dimensions:

(threadIdx.x, threadIdx.y, threadIdx.z)

The exact organization of a grid is determined by the execution configuration provided at kernel launch.

Two additional variables of type dim3 (C struct with 3 unsigned integer fields) are declared:

- gridDim dimensions of the grid in terms of number of blocks
- blockDim —— dimensions of the block in terms of number of threads





The built-in variables are used to compute the global ID of the thread, in order to determine the area of data that it is designed to work on.



```
·1D:
```

int id = blockDim.x \* blockIdx.x + threadIdx.x;

#### ·2D:

- int iy = blockDim.y \* blockIdx.y + threadIdx.y;
- int ix = blockDim.x \* blockldx.x + threadIdx.x;
- int id = iy \* dimx + ix;



# Threads execution model













CUDA's hierarchy of threads/memories maps to the hierarchy of processors on the GPU. a GPU executes one or more kernel

- grids;
- a streaming multiprocessor (SM) executes one or more thread blocks;
- a streaming processor (SP) in the SM executes threads.

Multiprocesso

**CUD** 

Thre

**Thread** 

ad A maximum number of blocks can be assigned to each SM (8 for Fermi, 16 for Kepler) The runtime system maintains a list of blocks that need to execute and assigns new blocks to SMs as they complete the execution of blocks previously assigned to them.





# **Transparent scalability**

By not allowing threads in different blocks to synchronize with each other, CUDA runtime system can execute blocks in any order relative to each other.

This flexibility enables to execute the same application code on hardware with different numbers of SM (*transparent scalability*).





# Launching a kernel

A kernel must be called from the host with the following syntax:

```
global void KernelFunc(...);
   dim3 gridDim(100, 50); // 5000 thread blocks
   dim3 blockDim(8, 8, 4); // 256 threads per block
   //call the kernel
   KernelFunc<<< gridDim, blockDim >>>(<arguments>);
                                      Codice (CPU)
Typical CUDA grids contain
                                                                  Grid 0
thousands to millions of
                                    Kernel Parallelo (GPU)
threads.
                                 KernelA<<< nBa, nTa >>>(args);
                                      Codice (CPU)
All kernel calls are
                                                                  Grid 1
asynchronous!
                                    Kernel Parallelo (GPU)
                                 KernelB<<< nBb, nTb >>>(args);
```

# Kernel example



```
CPU code:
void increment cpu(float* a, float b, int n) {
  for (idx=0; idx<n; ++idx)
   a[idx]+=b;
int main(void) {
  //...
  increment cpu(h a,h b,16);
GPU code:
  global increment gpu(float* a, float b, int n) {
  int idx = threadIdx.x + blockIdx.x*blockDim.x;
  if (idx < n)
    a[idx]+=b;
int main(void) {
  //...
  increment gpu<<<ble>docks, threads>>>(d a,d b,16);
```





### **CUDA Function modifiers**

CUDA extends C function declarations with three qualifier keywords.

| Function declaration         | Executed on the | Only callable from the |
|------------------------------|-----------------|------------------------|
| device<br>(device functions) | device          | device                 |
| global<br>(kernel function)  | device          | host                   |
| host<br>(host functions)     | host            | host                   |





# **CUDA** variable qualifiers

| Variable declaration                 | memory   | lifetime    | scope  |
|--------------------------------------|----------|-------------|--------|
| Automatic scalar variables           | register | kernel      | thread |
| Automatic array variablesdevicelocal | local    | kernel      | thread |
| deviceshared                         | shared   | kernel      | block  |
| device                               | global   | application | grid   |
| deviceconstant                       | constant | application | grid   |

Global variables are often used to pass information from one kernel to another.

Constant variables are often used for providing input values to kernel functions.

### Hierarchy of device memories

CUDA's hierarchy of threads maps to a hierarchy of memories on the GPU:

- Each thread has some registers, used to hold automatic scalar variables declared in kernel and device functions, and a per-thread private memory space used for register spills, function calls, and C automatic array variables
- Each thread block has a per-block shared memory space used for inter-thread communication, data sharing, and result sharing in parallel algorithms
- Grids of thread blocks share results

  CINE In global memory space



Summer



#### on-chip memories:

- registers (~8KB) → SP
- shared memory (~16KB) → SM
- they can be accessed at very high speed in a highly parallel manner.

#### per-grid memories:

- ¶ global memory (~4GB)
  - long access latencies (hundreds of clock cycles)
  - finite access bandwidth
- - read only
  - short-latency (cached) and high bandwidth when all threads simultaneously access the same location
- texture memory (read only)
- CPU can transfer data to/from all per-grid memories.





Local memory is implemented as part of the global memory, therefore has a long access latencies too.



#### Summer School on PARALLEL COMPUTING

# **Shared memory allocation**

Static modality

```
inside the kernel:
__shared__ float f[100];
```

Dynamic modality

in the execution configuration of the kernel, define the number of bytes to be allocated per block in the shared memory:

```
kernel<<<DimGrid, DimBlock, SharedMemBytes>>>(...);
```

```
while inside the kernel: extern __shared__ float f[];
```





# Global memory allocation

CUDA API functions to manage data allocation on the device global memory:

```
cudaMalloc(void** bufferPtr, size_t n)
```

- It allocates a buffer into the device global memory
- The first parameter is the address of a generic pointer variable that must point to the allocated buffer
  - it should be cast to (void\*\*)!
- The second parameter is the size of the buffer to be allocated, in terms of bytes

```
cudaFree (void* bufferPtr)
```

The storage space of the object





### Global memory inizialization

```
cudaMemset(void* devPtr, int value, size_t count)
```

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte of the int value converted to unsigned char.

CUDA version of the C memset() function.

devPtr - Pointer to device memory

value - Value to set for each byte of specified memory

count - Size in bytes to set





#### **Data transfer CPU-GPU**



API blocking functions for data transfer between memories:

cudaMemcpy(dM, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, dM, size, cudaMemcpyDeviceToHost);

Destination source number of data bytes in

imber of symbolic constant bytes indicating the direction





# Data transfer to constant memory

```
cudaMemcpyToSymbol(const char * symbol,
const void * src,
size_t count,
size_t offset,
enum cudaMemcpyKind kind)
```

symbol - symbol destination on device, it can either be a variable that resides in global or constant memory space, or it can be a character string, naming a variable that resides in global or constant memory space.

src - source memory address

count - size in bytes to copy

offset - offset from start of symbol in bytes

kind - type of transfer, it can be either cudaMemcpyHostToDevice or cudaMemcpyDeviceToDevice





# **Device management**

- Application can query and select GPUs
  - FcudaGetDeviceCount(int \*count)
  - FcudaSetDevice(int device)
  - FcudaGetDevice(int \*device)
  - **\*CudaGetDeviceProperties**(cudaDeviceProp \*prop, int device)
- Multiple threads can share a device
- A single thread can manage multiple devices
  - **FcudaSetDevice(i)** to select current device
  - **TcudaMemcpy**(...) for peer-to-peer copies





# Device management (sample code)

```
int cudadevice;
struct cudaDeviceProp prop;
cudaGetDevice( &cudadevice );
cudaGetDeviceProperties (&prop, cudadevice);
mpc=prop.multiProcessorCount;
mtpb=prop.maxThreadsPerBlock;
shmsize=prop.sharedMemPerBlock;
printf("Device %d: number of multiprocessors
%d\n, max number of threads per block %d\n,
shared memory per block %d\n", cudadevice,
mpc, mtpb, shmsize);
```





# Error checking

All runtime functions return an error code of type: cudaError t.

No error is indicated as cudaSuccess.

char\* cudaGetErrorString(cudaError\_t code)
returns a string describing the error:

For asynchronous functions (i.e. kernels, asynchronous copies) the only way to check for errors just after the call is to synchronize: cudaDeviceSynchronize()

Then the following function returns the code of the last error:

```
cudaError_t cudaGetLastError()
cineca
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
```





### nvcc front-end for compilation:

- separates GPU code from CPU code
- CPU code -> C/C++ compiler (Microsoft Visual C/C++, GCC, ecc.)
- GPU code is converted in an intermediate assembly language: PTX, then in binary form (the *cubin* object)
- link all executables







```
Summer
School on
PARALLEL
COMPUTING
```

```
void MatrixMulOnHost(float* M, float* N, float* P,
                                               int Width) {
  for (int i = 0; i < Width; ++i) {
    for (int j = 0; j < Width; ++j) {
                                             N
      float pvalue = 0;
                                                        k
      for (int k = 0; k < Width; ++k) {
        float a = M[i * Width + k];
                                                           NIDTH
        float b = N[k * Width + j];
        pvalue += a * b;
      P[i * Width + j] = pvalue;
                            M
                P = M*N
                              k
                                   WIDTH
                                                     WIDTH
```



### Matrix-Matrix multiplication device code

```
global void MNKernel(float* Md, float *Nd, float *Pd, int
width)
{
  // 2D thread ID
  int col = threadIdx.x;
  int row = threadIdx.y;
  // Pvalue stores the Pd element that is computed by the
  // thread
  float Pvalue = 0;
  for (int k=0; k < width; k++)
     Pvalue += Md[row * width + k] * Nd[k * width + col];
  // write the matrix to device memory
  // (each thread writes one element)
  Pd[row * width + col] = Pvalue;
```



# •Matrix-Matrix multiplication host code

```
void MatrixMultiplication(float* M, float *N, float *P, int width)
  size t size = width*width*sizeof(float);
  float* Md, Nd, Pd;
  // allocate M, N and P on the device
  cudaMalloc((void**)&Md, size);
  cudaMalloc((void**)&Nd, size);
  cudaMalloc((void**)&Pd, size);
  // transfer M and N to the device memory
  cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
  cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
  // kernel invocation
  dim3 gridDim(1,1);
  dim3 blockDim(width, width);
  MNKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, width);
  // transfer P from the device to the host
  cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
cive cafree device matrices
  cudaFree (Md); cudaFree (Nd); cudaFree (Pd);
```



<u>Limitation:</u> a block can have up to 1024 threads (for Fermi and Kepler). Therefore the previous implementation can compute square matrices of order less or equal to 32.

### **Improvement:**

- use more blocks by breaking matrix Pd into square tiles
- · all elements of a tile are computed by a block of threads
- each thread still calculates one Pd element but it uses its blockldx values to identify the tile that contains its element.







```
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.y + threadIdx.y;
```



index = j \* gridDim.x \* blockDim.x + i;



```
global void MNKernel(float* Md, float *Nd, float *Pd, int
width)
   // 2D thread ID
   int col = blockIdx.x*blockDim.x + threadIdx.x;
   int row = blockIdx.y*blockDim.y + threadIdx.y;
   // Pvalue stores the Pd element that is computed by the thread
   float Pvalue = 0;
   for (int k=0; k < width; k++)
     Pvalue += Md[row * width + k] * Nd[k * width + col];
   Pd[row * width + col] = Pvalue;
         Kernel invocation:
```



MNKernel << dimGrid, dimBlock >>> (Md, Nd, Pd, width);

dim3 gridDim(width/TILE WIDTH, width/TILE WIDTH);

dim3 blockDim(TILE WIDTH, TILE WIDTH);



Which is the optimal dimension of the block (i.e. TILE\_WIDTH)?

Knowing that each SM of a Fermi can have up to 1536 threads, we have

- $\cdot$  8x8 = 64 threads  $\longrightarrow$  1536/64 = 24 blocks to fully occupy an SM; but we are limited to 8 blocks in each SM therefore we will end up with only 64x8 = 512 threads in each SM.
- 16x16 = 256 threads 1536/256 = 6 blocks we will have full thread capacity in each SM.
- $\cdot$  32x32 = 1024 threads > 1536/1024 = 1.5 > 1 block.







Which is the optimal dimension of the block (i.e. TILE\_WIDTH)?

Knowing that each SM of a <u>Kepler</u> can have up to 2048 threads, we have

- $\cdot$  8x8 = 64 threads  $\triangleright$  2048/64 = 32 blocks to fully occupy an SM; but we are limited to 16 blocks in each SM therefore we will end up with only 64x16 = 1024 threads in each SM.
- 16x16 = 256 threads 2048/256 = 8 blocks we will have full thread capacity in each SM.
- $\cdot$  32x32 = 1024 threads  $\ge$  2048/1024 = 2 blocks.







# Global memory access efficiency

Although having many threads available for execution can theoretically tolerate long memory access latency, one can easily run into a situation where traffic congestion prevents all but few threads from making progress, thus making some SM idle!

A common strategy for reducing global memory traffic (i.e. increasing the number of floating-point operations performed for each access to the global memory) is to partition the data into subsets called *tiles* such that each tile fits into the shared memory and the kernel computations on these tiles can be done independently of each other.

In the simplest form, the tile dimensions equal those of the block.



## In the previous kernel:

thread(x,y) of block(0,0) access the elements of Md row x and Nd column y from the global memory.

thread(0,0) and thread(0,1) access the same Md row 0



| Pd <sub>0,0</sub><br>Thread(0,0)      | Pd <sub>1,0</sub><br>Thread(1,0)      | Pd <sub>0,1</sub><br>Thread(0,1)      | Pd <sub>1,1</sub><br>Thread(1,1)      |
|---------------------------------------|---------------------------------------|---------------------------------------|---------------------------------------|
| Md <sub>0,0</sub> * Nd <sub>0,0</sub> | Md <sub>0,0</sub> *(Nd),0             | Md <sub>0,1</sub> * Nd <sub>0,0</sub> | Md <sub>0,1</sub> * (Nd) <sub>0</sub> |
| Ma),0 * Nd <sub>0,1</sub>             | MO_0 * Nd <sub>1,1</sub>              | Md <sub>1,1</sub> * Nd <sub>0,1</sub> | Md <sub>1,1</sub> * Nd <sub>1,1</sub> |
| Md <sub>2,0</sub> * Nd <sub>0,2</sub> | Md <sub>2,0</sub> * Nd <sub>1,2</sub> | Md <sub>2,1</sub> * Nd <sub>0,2</sub> | Md <sub>2,1</sub> * Nd <sub>1,2</sub> |
| Md <sub>3,0</sub> * Nd <sub>0,3</sub> | Md <sub>3,0</sub> * Nd <sub>1,3</sub> | Md <sub>3,1</sub> * Nd <sub>0,3</sub> | Md <sub>3,1</sub> * Nd <sub>1,3</sub> |





What if these threads collaborate so that the elements of this row are only loaded from the global memory once? We can reduce the total number of accesses to the global memory by N, using NxN blocks!

### Basic idea:

to have the threads within a block collaboratively load Md and Nd elements into the shared memory before they individually use these elements in their dot product calculation.







|                  | Phase 1            |                          |                                                                                                                 | Phase 2            |                                              |                                                                                                                 |
|------------------|--------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------|--------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| T <sub>0,0</sub> | Md <sub>0,0</sub>  | <b>Nd</b> <sub>0,0</sub> | PValue <sub>0,0</sub> +=                                                                                        | Md <sub>2,0</sub>  | Nd <sub>0,2</sub>                            | PValue <sub>0,0</sub> +=                                                                                        |
|                  | ↓                  | ↓                        | Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +                                                                        | ↓                  | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +                                                                        |
|                  | Mds <sub>0,0</sub> | Nds <sub>0,0</sub>       | Mds <sub>1,0</sub> *Nds <sub>0,1</sub>                                                                          | Mds <sub>0,0</sub> | Nds <sub>0,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>0,1</sub>                                                                          |
| T <sub>5,0</sub> | Md <sub>1,0</sub>  | Nd <sub>1,0</sub>        | PValue <sub>1,0</sub> +=                                                                                        | Md <sub>3,0</sub>  | Nd <sub>1,2</sub>                            | PValue <sub>1,0</sub> +=                                                                                        |
|                  | ↓                  | ↓                        | Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +                                                                        | ↓                  | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +                                                                        |
|                  | Mds <sub>1,0</sub> | Nds <sub>1,0</sub>       | Mds <sub>1,0</sub> *Nds <sub>1,1</sub>                                                                          | Mds <sub>1,0</sub> | Nds <sub>1,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>1,1</sub>                                                                          |
| T <sub>0,1</sub> | Md <sub>0,1</sub>  | Nd <sub>0,1</sub>        | PdValue <sub>0,1</sub> +=                                                                                       | Md <sub>2,f</sub>  | Nd <sub>0,3</sub>                            | PdValue <sub>0,1</sub> +=                                                                                       |
|                  | ↓                  | ↓                        | Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +                                                                        | ↓                  | ↓                                            | Mds <sub>0,1</sub> *Nds <sub>0,0</sub> +                                                                        |
|                  | Mds <sub>0,1</sub> | Nds <sub>0,1</sub>       | Mds <sub>11</sub> *Nds <sub>0,1</sub>                                                                           | Mds <sub>0,1</sub> | Nds <sub>0,1</sub>                           | Mds <sub>1,1</sub> *Nds <sub>0,1</sub>                                                                          |
| Τ <sub>1,1</sub> | Md <sub>1,1</sub>  | Nd <sub>1,1</sub>        | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> | Md <sub>3.1</sub>  | Nd <sub>1,3</sub><br>↓<br>Nds <sub>1,1</sub> | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,3</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> |

The dot product performed by each thread is now divided into phases: in each phase all threads in a block collaborate to load a tile of Md and a tile of Nd into the shared memory and use these values to compute a partial product. The dot product would be performed in width/TILE\_WIDTH phases.

the reduction of the accesses to the global memory is by a factor of TILE\_WIDTH.









```
global void MNKernel(float* Md, float *Nd, float *Pd, int width)
 shared float Mds[TILE WIDTH][TILE WIDTH];
 shared float Nds[TILE WIDTH][TILE WIDTH];
// 2D thread ID
int tx = threadIdx.x; int ty = threadIdx.y;
int col = blockIdx.x*BlockDim.x + tx;
int row = blockIdx.y*BlockDim.y + ty;
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
// m is the number of phases
for (int m=0; m < width/TILE WIDTH; m++) {</pre>
{//collaborative loading of Md and Nd tiles into shared memory
   Mds[ty][tx] = Md[row*width + (m*TILE WIDTH + tx)];
   Nds[ty][tx] = Nd[(m*TILE WIDTH + ty)*width + col];
    syncthreads();
   for (int k=0; k < TILE WIDTH; <math>k++) {
       Pvalue += Mds[ty][k] * Nds[k][tx];
     syncthreads();
Pd[row * width + col] = Pvalue;
```



# Memory as a limiting factor to parallelism

The limited amount of CUDA memory limits the number of threads that can simultaneously reside in the SM!

For the matrix multiplication example, shared memory can become a limiting factor:

TILE WIDTH = 16

each block requires 16x16x4 = 1KB of storage for Mds

+ 1KB for Nds

2KB of shared memory per block

The 48KB shared memory allows 24 blocks to simultaneously reside in an SM. OK!

But the maximum number of threads per SM is 1536 (for Fermi)

only 1536/256 = 8 blocks are allowed in each SM

only  $8 \times 2KB = 16KB$  of the shared memory will be used.

Hint: Use occupancy calculator



## Thread scheduling

Once a block is assigned to a SM, it is further partitioned into 32-thread units called warps.

Warps are the *scheduling units in SM*: all threads in a same warp execute the same instruction when the warp

the same instruction when the warp is selected for execution (Single-Instruction, Multiple-Thread) Threads often execute *long-latency operations*:

- global memory access
- pipelined floating point arithmetics
- branch instructions

It is convenient to assign a large number of warps to each SM, because the long waiting time of some warp instructions is hidden by executing instructions from other warps. Therefore the selection of ready warps for execution does not introduce any idle time into the execution timeline (zero-overhead thread scheduling).









- The hardware executes an instruction for all threads in the same warp before moving to the next instruction (SIMT).
- It works well when all threads within a warp follow the same control flow path when working their data.
- When threads in the same warp follow different paths of control flow, we say that these threads *diverge* in their execution.
- For an *if-then-else* construct the execution of the warp will require multiple passes through the divergent paths.

Try to avoid warp divergence





# Many Integrated Core (MIC): the Intel answer to GPGPU computing

- The Intel Xeon Phi KNC (Knight Corner\*) is a (60+1)-x86-core SMP chip. 60 cores are available for computation, 1 is reserved for the system.
- each core has a 512-bit wide SSE vector unit
- all cores are connected by a bi-directional
   512 bit ring bus
- 512 KB of L2 cache and 32KB of I/D L1 (each core)
- A Xeon Phi KNC is packaged into a PCle add-on card together with 8 GB of GDDR5 dedicated ram (theoretical peak perf.: 352 GB/s. actual peak with ECC: 200 GB/s)
- Intel claims: 'Up to 1 teraflops of double precision perfomance'





\*Knight Corner is the codename of the first-gen Intel MIC architecture processor. Second-gen MIC codename will be Knight Landing.



# Intel Xeon Phi: programming mode

- · Familiar OpenMP (or pthreads), MPI programming model
  - no new language or new parallel programming paradigm to learn: what you already know about parallel programming is basically all that you need to start programming a Xeon Phi processor.
- OpenCL support to the MIC architecture is on its way
  - an help the porting of CUDA application. CUDA→OpenCL MIC with the CU2CL source-to-source translator, for example.
- Porting an existing OpenMP/MPI application onto a Xeon Phi processor can be as easy as to recompile the application with a couple of new MICspecific pragmas and compiler flag activated,
  - \*but\* to make that application running at full speed on the Xeon Phi chip a little more effort is probably needed.







### Offload mode:

- using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s)
- Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library)

#### · Native mode:

- Recompiling source code to run the entire application directly on coprocessor as a separate many-core Linux SMP compute node
- Using each coprocessor as a node in an MPI cluster or, alternatively, as a device containing a cluster of MPI nodes





# Intel Xeon Phi Vs Nvidia K20. MAGMA LU factorization





these two images are taken from the presentations:

http://icl.cs.utk.edu/projectsfiles/magma/pubs/25-MAGMA\_1.3\_SC12.pdf

http://icl.cs.utk.edu/projectsfiles/magma/pubs/24-MAGMA\_MIC\_03.pdf

authors: ICL-group@University of Tennesee. The ICL-group is actively developing the MAGMA library which is a world-class performance open source Linear Algebra library for multicore+accelerator computer architecture: http://icl.cs.utk.edu/magma/index.html



## **OpenACC**

- OpenACC is a open parallel programming standard designed to easily take advantage of the heterogeneous CPU/GPU computing systems.
- OpenACC allows parallel programmers to provide simple hints, known as "directives," to the compiler, identifying whic areas of code to accelerate, without requiring programmers modify or adapt the underlying code itself.
- OpenACC 1.0 (http://www.openacc-standard.org)
- · Implementations available from PGI, Cray, and CAPS
- Will be rolled into OpenMP 4.0





Summer

School on PARALLEL

#### Key Advantages:

- THigh-Level: No involvement of OpenCL, CUDA, etc.
- Single source: Compile the same program for accelerators or serial (NO separate GPU code).
- Portable: Supports GPU accelerators and co-processors from multiple vendors, current and future versions.



### **OpenACC: A Simple Example**



```
pgcc -acc -ta=nvidia -Minfo=accel saxpy.c
(-ta stands for target architecture)
saxpy:
    3, Generating present_or_copyin(x[0:n])
    Generating present_or_copy(y[0:n])
    Generating compute capability 1.0 binary
    Generating compute capability 2.0 binary
```

4, Loop is parallelizable Accelerator kernel generated

Compiler was able to parallelize

```
4, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
CC 1.0: 8 registers; 48 shared, 0 constant, 0 local memory bytes
CC 2.0: 12 registers; 0 shared, 64 constant, 0 local memory bytes
```

```
int main() {
  int N = 1<<10;
  float *x, *y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

for (int i = 0; i < N; ++i) {
    x[i] = 2.0f; y[i] = 1.0f;
  }

  saxpy(N, 1.0f, x, y);
  return 0;
}</pre>
```

# -Reference



# http://developer.nvidia.com/cuda

- CUDA Programming Guide
- CUDA Zone tools, training, webinars and more

### **NVIDIA Books:**

- "Programming Massively Parallel Processors", D.Kirk - W.W. Hwu
- · "CUDA by example", J.Sanders E. Kandrot

