

## Introduction to Accelerators: CUDA+OpenCL

**Piero Lanucara** - p.lanucara@cineca.it SuperComputing Applications and Innovation Department









GPU

Intel MIC



### **35 YEARS OF MICROPROCESSOR TREND DATA**

#### Summer School on PARALLEL COMPUTING



Source: C. Moore's talk at Salishan, April 2011





| Rank | Site                                                                  | System                                                                                                                        | Cores   | Rmax<br>(TFlop/s) | Rpeak<br>(TFlop/s) | Power<br>(kW) |
|------|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|---------|-------------------|--------------------|---------------|
| 1    | National Super Computer Center in<br>Guangzhou<br>China               | Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel<br>Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel<br>Xeon Phi 31S1P<br>NUDT | 3120000 | 33862.7           | 54902.4            | 17808         |
| 2    | DOE/SC/Oak Ridge National Laboratory<br>United States                 | Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray<br>Gemini interconnect, NVIDIA K20x<br>Cray Inc.                           | 560640  | 17590.0           | 27112.5            | 8209          |
| 3    | DOE/NNSA/LLNL<br>United States                                        | Sequoia - BlueGene/Q, Power BQC 16C 1.60 GHz,<br>Custom<br>IBM                                                                | 1572864 | 17173.2           | 20132.7            | 7890          |
| 4    | RIKEN Advanced Institute for<br>Computational Science (AICS)<br>Japan | K computer, SPARC64 VIIIfx 2.0GHz, Tofu<br>interconnect<br>Fujitsu                                                            | 705024  | 10510.0           | 11280.4            | 12660         |
| 5    | DOE/SC/Argonne National Laboratory<br>United States                   | Mira - BlueGene/Q, Power BQC 16C 1.60GHz,<br>Custom<br>IBM                                                                    | 786432  | 8586.6            | 10066.3            | 3945          |
| 6    | Swiss National Supercomputing Centre<br>(CSCS)<br>Switzerland         | Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz,<br>Aries interconnect , NVIDIA K20x<br>Cray Inc.                             | 115984  | 6271.0            | 7788.9             | 2325          |
| 7    | Texas Advanced Computing Center/Univ.<br>of Texas<br>United States    | Stampede - PowerEdge C8220, Xeon E5-2680 8C<br>2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P<br>Dell                         | 462462  | 5168.1            | 8520.1             | 4510          |
| 8    | Forschungszentrum Juelich (FZJ)<br>Germany                            | JUQUEEN - BlueGene/Q, Power BQC 16C<br>1.600GHz, Custom Interconnect<br>IBM                                                   | 458752  | 5008.9            | 5872.0             | 2301          |
| 9    | DOE/NNSA/LLNL<br>United States                                        | Vulcan - BlueGene/Q, Power BQC 16C 1.600GHz,<br>Custom Interconnect<br>IBM                                                    | 393216  | 4293.3            | 5033.2             | 1972          |
| 10   | Leibniz Rechenzentrum<br>Germany                                      | SuperMUC - iDataPlex DX360M4, Xeon E5-2680 8C<br>2.70GHz, Infiniband FDR<br>IBM                                               | 147456  | 2897.0            | 3185.1             | 3423          |





**500**°

SUPERCOMPUTER SITES

# Figreen 500€

| Green500<br>Rank | MFLOPS/W | Site*                                                       | Computer*                                                                                                                  |          |   |
|------------------|----------|-------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|----------|---|
| 1                | 4,503.17 | GSIC Center, Tokyo Institute of<br>Technology               | TSUBAME-KFC - LX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-<br>2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20x               |          |   |
| 2                | 3,631.86 | Cambridge University                                        | Wilkes - Dell T620 Cluster, Intel Xeon E5-2630v2 6C 2.600GHz,<br>Infiniband FDR, NVIDIA K20                                | 52.62    |   |
| 3                | 3,517.84 | Center for Computational<br>Sciences, University of Tsukuba | HA-PACS TCA - Cray 3623G4-SM Cluster, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20x                       | 78.77    |   |
| 4                | 3,185.91 | Swiss National Supercomputing<br>Centre (CSCS)              | Piz Daint - Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect<br>, NVIDIA K20x<br>Level 3 measurement data available | 1,753.66 |   |
| 5                | 3,130.95 | ROMEO HPC Center -<br>Champagne-Ardenne                     | romeo - Bull R421-E3 Cluster, Intel Xeon E5-2650v2 8C 2.600GHz,<br>Infiniband FDR, NVIDIA K20x                             | 81.41    |   |
| 6                | 3,068.71 | GSIC Center, Tokyo Institute of<br>Technology               | TSUBAME 2.5 - Cluster Platform SL390s G7, Xeon X5670 6C<br>2.930GHz, Infiniband QDR, NVIDIA K20x                           | 922.54   |   |
| 7                | 2,702.16 | University of Arizona                                       | iDataPlex DX360M4, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR14, NVIDIA K20x                                         | 53.62    |   |
| 8                | 2,629.10 | Max-Planck-Gesellschaft MPI/IPP                             | iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x                                              | 269.94   |   |
| 9                | 2,629.10 | Financial Institution                                       | iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x                                              | 55.62    |   |
| 10               | 2,358.69 | CSIRO                                                       | CSIRO GPU Cluster - Nitro G16 3GPU, Xeon E5-2650 8C 2.000GHz,<br>Infiniband FDR, Nvidia K20m                               | 71.01    | 4 |

CINECA



Number of Processing Elements, Single Precision High-End Hardware





Theoretical Peak Performance, Double Precision

CINECA

#### CPUs, Intel GPUs, NVIDIA GPUs, AMD MIC, Intel 10<sup>1</sup> Tesla;K20X Tesla K20 Radeon HD 4870Radeon HD 5870 Radeon HD 3870 Radeon HD 6970 Radeon HD 6970 GFLOP/sec per Watt Radeon HD 7970 GHz Radeon, HD 8970 Xeon Phi X7120X Tesla C2090 Tesla C2050 10<sup>0</sup> Xeon E5-2690 Xeon X5690 Xeon X5680 Tesla C1060 Tesla C1060 Xeon W5590 Xeon X5492 Xeon X5482 10<sup>-1</sup> 2007 2008 2010 2012 2013 2009 2011 End of Year

 Peak Floating Point Operations per Watt, Double Precision

from http://www.karlrupp.net/

### GPU and many core computing: a view from the top

### Basic principle (today's GPUs, many-core coprocessors):

- accelerator "cards" for standard cluster nodes (PCIe)
- many (~50...500) "lightweight" cores (~ 1 GHz)
- high thread concurrency, fast (local) memories

### System architecture:

- currently: x86 "Linux-clusters" with nodes comprising
- 2 CPUs (2x 8 cores)
- max. 2...3 accelerator cards (GPU, MIC) per node
- future: smaller CPU component (extreme: "host-less", many-core chips)

### Programming paradigms:

- use CPU for program control, communication and maximum single-thread performance
- "offload" data-parallel parts of computation to accelerator for maximum throughput performance
- requires heterogeneous programming & load-balancing, careful assessment of "speedups"





Summer

School on PARALLEL

COMPUTIN

### **Motivation**

### **Compute performance**

• GPU/many-core computing is promising huge application-performance gains

Summer School on PARALLEL

COMPUTIN

- caveat: sustained performance on "real-world", scientific applications
- observations:
- apparent GPU success stories: PetaFlops performance (Gordon-Bell Price nominations)
- from aggressive marketing for Intel MIC, NVIDIA GPUs...
- ... towards more realistic attitudes: factor 2x..3x speedups (GPU vs. multi-core CPU)

### **Energy efficiency**

- GPU/many-core computing is promising substantial energy-efficiency gains (a must for exascale)
- caveat: sustained efficiency on "real-world" CPU-GPU clusters

### **Existing resources**

- there is significant GPU/many-core-based compute-power around in the world
- by many, the technology is considered inevitable for the future
- caveat: the price to pay for application development ?



### **NVIDIA GPU TECHNOLOGY**

### Hardware overview (NVIDIA Tesla series)

- since 2011: "Fermi": first product with HW support for double-precision and ECC memory
- up to 512 cores, 6 GB RAM
- high internal memory bandwidth ~180 GB/s
- 0.5 TFlops (DP, floating point)
- data exchange with host via PCIe (~8 GB/s)
- enhancements: MPI optimization, intra-node comm. ("GPU direct", "HyperQ", ...)

#### Q1/2013: "Kepler K20":

- GK110 GPU: up to 2880 cores, 6...12 GB RAM
- internal memory bandwidth: ~200 GB/s
- nominal peak performance: ~ 1.3 TFlops (DP)

plans for a "hostless" chip (for Exascale)



Summer

School on PARALLEL

COMPUTIN



### **NVIDIA GPU TECHNOLOGY**

### Software & programming models

- paradigm: split program into host code (CPU) and device code (GPU)
- GPU hardware architecture requires highly homogeneous program flow (SIMT, no ifbranches!)
- PCIe bottleneck for communication of data between CPU and GPU:
- O(n2)...O(n3) computations for communication of n data
- overlapping of communication and computation phases

#### **Programming languages**

- CUDA (NVIDIA), OpenCL (open standard)
- host program (C, executes on CPU) and device kernels (C, launch on GPU)
- numerical libraries: CUBLAS, CUFFT, higher LA: CULA, MAGMA
- tools: debuggers, profiling, system monitoring,...
- CUDA-FORTRAN (PGI)

66777

- directive-based approaches (PGI, CRAY, CAPS, OpenACC, OpenMP-4)
- high-level, comparable to OpenMP
- proprietary (CRAY, PGI, HMPP, ...)  $\rightarrow$  OpenACC  $\rightarrow$  OpenMP



Summer

School on PARALLEL COMPUTING

### **NVIDIA GPU TECHNOLOGY**

### OpenACC

- joint effort of vendors to shortcut/guide OpenMP 4.0 standardization effort
- functional (not performance) portability
- minimally invasive to existing code
- facilitates incremental porting
- compilers: PGI, CRAY, CAPS
- no free lunch!

99111



Summer

School on PARALLEL COMPUTING

### INTEL MIC TECHNOLOGY

### Hardware overview

- since 2011: "Knights Ferry": software development platform
- Q4/2012: "Knights Corner": first product of the new Intel Xeon Phi processor line (MIC arch)
- approx 60 x86 cores (~ 1GHz), 8 GB RAM
- internal memory bandwidth: 175 GB/s
- nominal peak performance: 1 TFlops (DP)
- more than a device: runs Linux OS, IP addressable
- data exchange with host via PCIe (~8 GB/s)
- towards a true many-core chip ("Knights Landing", 2014)

### Software & programming models

• paradigms:

1) offload model (like GPU: split program into host code (CPU) and device code (MIC))

2) cluster models (MPI ranks distributed across CPUs and/or MICs)

- tools & libraries: the familiar Intel tool chain: compilers, MPI/OpenMP, MKL, ...
- syntax: "data offload" directives + OpenMP (and/or MPI)
- OpenCL





### **GPU TECHNOLOGY**

Summer School on PARALLEL COMPUTING

- Graphics Processor Unit
  - a device equipped with an highly parallel microprocessor (*many-core*) and a private memory with very high bandwidth
- born in response to the growing demand for high definition 3D rendering graphic applications



16

# **CPU vs GPU Architectures**

# GPU hardware is specialized for problems which can be classified as *intense data- parallel computations*

the same set of operation is executed many times in parallel on different data designed such that more transistors are devoted to data processing rather than data caching and flow control





- Compute Unified Device Architecture (CUDA)
- a general purpose parallel computing platform and programming model that easy GPU programming, which provides:
  - a new hierarchical multi-threaded programming paradigm
  - a new architecture instruction set called PTX (Parallel Thread eXecution)
  - a small set of syntax extensions to higher level programming languages (C, Fortran) to express thread parallelism within a familiar programming environment
  - A complete collection of development tools to compile, debug and profile CUDA programs.



# **CUDA Programming Model**

Summer School on PARALLEL COMPUTING

- GPU is seen as an auxiliary coprocessor with its own memory space
- data-parallel, computational-intensive portions of a program can be executed on the GPU

each *data-parallel* computational portion can be isolated into a function, called CUDA kernel, that is executed on the GPU

CUDA kernels are executed by many different threads in parallel

each thread can compute different data elements independently

the GPU parallelism is very close to the SPMD (Single Program Multiple Data) paradigm. Single Instruction Multiple Threads (SIMT) according to the Nvidia definition.

### GPU threads are extremely light weight

no penalty in case of a *context-switch* (each thread has its own registers)

the more are the threads *in flight*, the more the GPU hardware is able to hide memory or computational latencies, i.e better overall performances at executing the kernel function

# **CUDA Execution Model**

- serial portions of a program, or those with low level of parallelism, keep running on the CPU (host)
- Data-parallel, computational intensive portions of the program are isolated into CUDA kernel function. The CUDA kernel are executed onto the GPU (device)



## More on the CUDA Execution Model

#### Summer School on PARALLEL COMPUTING



### when a CUDA kernel is invoked: each thread block is assigned to a SM in a round-robin mode

a maximum number of blocks can be assigned to each SM, depending on hardware generation and on how many resorses each block needs to be executed (registers, shared memory, etc) the runtime system maintains a list of blocks that need to execute and assigns new blocks to SMs as they complete the execution of blocks previously assigned to them.

once a block is assigned to a SM, it remains on that SM until the work for all threads in the block is completed each block execution is independent from the other (no synchronization is possible among them)

(no synchronization is possible among them) thread of each block are partitioned into warps of 32 *thread* each, so to map each thread with a unique consecutive thread index in the block, starting from index 0. the scheduler select for execution a warp from one of the residing blocks in each SM.

A warp execute one common instruction at a time

each CUDA core take care of one thread in the warp fully efficiency when all threads agree on their execution path



CUDA runtime system can execute blocks in any order relative to each other.

This flexibility enables to execute the same application code on hardware with different numbers of SM



CUDA syntax extensions to the C language



- CUDA defines a small set of extensions to the high level language as the C in order to define the kernels and to configure the kernel execution.
- A CUDA kernel function is defined using the <u>global</u> declaration
- when a CUDA kernel is called, it will be executed N times in parallel by N different CUDA threads on the device
- the number of CUDA threads that execute that kernel is specified using a new syntax, called kernel execution configuration
  - cudaKernelFunction <<<...>>> (arg\_1, arg\_2, ..., arg\_n)
- each thread has a unique thread ID
  - the thread ID is accessible within the CUDA kernel through the built-in threadIdx variable
- the built-in variables threadIdx are a 3-component vector

use .x, .y, .z to access its components

# A simple CUDA program

Summer School on PARALLEL

LOWPOTING

```
int main(int argc, char *argv[]) {
  int i;
  const int N = 1000;
  double u[N], v[N], z[N];
  initVector (u, N, 1.0);
  initVector (v, N, 2.0);
  initVector (z, N, 0.0);
  printVector (u, N);
  printVector (v, N);
  //z = u + v
  for (i=0; i<N; i++)
      z[i] = u[i] + v[i];
 printVector (z, N);
  return 0;
```

```
global
void gpuVectAdd( const double *u,
  const double *v, double *z)
{ // use GPU thread id as index
   i = threadIdx.x;
   z[i] = u[i] + v[i];
int main(int argc, char *argv[]) {
  // z = u + v
    // run on GPU using
    // 1 block of N threads in 1D
    gpuVectAdd <<<1, N>>> (u, v, z);
```

# **CUDA** Threads

- threads are organized into blocks of threads
  - blocks can be 1D, 2D, 3D sized in threads
  - blocks can be organized into a 1D, 2D, 3D grid of blocks
  - each block of threads will be executed independently
  - no assumption is made on the blocks execution order
- each block has a unique block ID
  - the block ID is accessible within the CUDA kernel through the built-in **blockIdx** variable
- The built-in variable **blockIdx** is a 3-component vector
  - use .x, .y, .z to access its components

blockIdx:
block coordinates inside the grid

blockDim: block dimensions in thread units

gridDim: grid dimensions in block units





|        | -      |        |        |        |
|--------|--------|--------|--------|--------|
| Thread | Thread | Thread | Thread | Thread |
| (0,0)  | (1,0)  | (2,0)  | (3,0)  | (4,0)  |
| Thread | Thread | Thread | Thread | Thread |
| (0,1)  | (1,1)  | (2,1)  | (3,1)  | (4,1)  |
| Thread | Thread | Thread | Thread | Thread |
| (0,2)  | (1,2)  | (2,2)  | (3,2)  | (4,2)  |
| Thread | Thread | Thread | Thread | Thread |
| (0,3)  | (1,3)  | (2,3)  | (3,3)  | (4,3)  |

CINECA

## Simple 1D CUDA vector add

```
global
void qpuVectAdd( int N, const double *u, const double *v, double *z)
{
  // use GPU thread id as index
   index = blockIdx.x * blockDim.x + threadIdx.x;
  // check out of border access
  if (index < N) {
      z[index] = u[index] + v[index];
}
int main(int argc, char *argv[]) {
  . . .
 // use 1D block threads
 dim3 blockSize = 512;
 // use 1D grid blocks
 dim3 gridSize = (N + blockSize-1) / blockSize.x;
  qpuVectAdd <<< gridSize,blockSize >>> (N, u, v, z);
```

CINECA

. . .

## **Composing 2D CUDA Thread Indexing**



## 2D array element-wise add (matrix add)

As an example, the following code adds two matrices A and B of size NxN and stores the result into the matrix C

Summer

School on PARALLEL COMPUTING

```
global void matrixAdd(int N, const float *A, const float *B, float *C) {
  int i = blockIdx * blockDim.x + threadIdx.x;
  int j = blockIdx * blockDim.y + threadIdx.y;
 // matrix elements are organized in row major order in memory
 int index = i * N + j;
  C[index] = A[index] + B[index];
}
int main(int argc, char *argv[]) {
 // add NxN matrices on GPU using 1 block of NxN threads
 matrixAdd <<< 1, N >>> (N, A, B, C);
  . . .
```

## Memory allocation on GPU device



CUDA API provides functions to manage data allocation on the device global memory:

### cudaMalloc(void\*\* bufferPtr, size\_t n)

It allocates a buffer into the device global memory

The first parameter is the address of a generic pointer variable that must point to the allocated buffer

it must be cast to (void\*\*)!

The second parameter is the size in bytes of the buffer to be allocated

### cudaFree(void\* bufferPtr)

It frees the storage space of the object

## Memory Initialization on GPU device

Summer School on PARALLEL COMPUTING

- cudaMemset(void\* devPtr, int value, size\_t count)
- It fills the first count <u>bytes</u> of the memory area pointed to by devPtr with the constant byte of the int value converted to unsigned char.

it's like the standard library C memset() function devPtr - Pointer to device memory value - Value to set for each byte of specified memory count - Size in bytes to set

• To initialize an array of double (float, int, ...) to a specific value you need to execute a CUDA kernel.



## Memory copy between CPU and GPU

#### Summer School on PARALLEL COMPUTING

### cudaMemcpy(void \*dst, void \*src, size\_t size, direction)

dst: destination buffer pointer src: source buffer pointer size: number of bytes to copy direction: macro name which defines the direction of data copy from CPU to GPU: cudaMemcpyHostToDevice (H2D) from GPU to CPU: cudaMemcpyDeviceToHost (D2H) on the same GPU: cudaMemcpyDeviceToDevice

the copy begins only after all previous kernel have finished the copy is blocking: it prevents CPU control to proceed further in the program until last byte has been transfered returns only after copy is complete



## Three steps for a CUDA porting

## 1. identify data-parallel, computational intensive portions

isolate them into functions (CUDA kernels candidates) identify involved data to be moved between CPU and GPU

# 2. translate identified CUDA kernel candidates into real CUDA kernels

Summer School on PARALLEL COMPUTING

choose the appropriate thread index map to access data change code so that each thead acts on its own data

# 3. modify code in order to manage memory and kernel calls

allocate memory on the device transfer needed data from host to device memory insert calls to CUDA kernel with execution configuration syntax transfer resulting data from device to host memory



```
program vectoradd
int main(int argc, char *argv[]) {
                                        integer :: i
  int i;
                                        integer, parameter :: N=1000
  const int N = 1000;
                                        real(kind(0.0d0)), dimension(N):: u, v, z
  double u[N], v[N], z[N];
                                        call initVector (u, N, 1.0)
  initVector (u, N, 1.0);
                                        call initVector (v, N, 2.0)
  initVector (v, N, 2.0);
                                        call initVector (z, N, 0.0)
  initVector (z, N, 0.0);
                                        call printVector (u, N)
  printVector (u, N);
                                        call printVector (v, N)
  printVector (v, N);
                                        ! z = u + v
  //z = u + v
                                        do i = 1, N
  for (i=0; i<N; i++)
                                            z(i) = u(i) + v(i)
      z[i] = u[i] + v[i];
                                        end do
                                        call printVector (z, N)
  printVector (z, N);
                                        end program
  return 0;
}
66777
```

### • each thread execute the same kernel, but act on different data:

- turn the loop into a CUDA kernel function
- map each CUDA thread onto a unique index to access data
- let each *thread* retrieve, compute and store its own data using the unique address
- prevent out of border access to data if data is not a multiple of thread block size



```
global__ void gpuVectAdd (int N, const double *u, const double *v, double *z)
{
    // index is a unique identifier for each GPU thread
    int index = blockIdx * blockDim.x + threadIdx.x;
    if (index < N)
        z[index] = u[index] + v[index];
}
CINECA</pre>
```





Insert calls to CUDA kernels using the execution configuration syntax.

### kernelCUDA<<<numBlocks,numThreads>>>(...)

specifying the thread/block hierarchy you want to apply:

- **numBlocks:** specify grid size in terms of thread blocks along each dimension
- numThreads: specify the block size in terms of threads along each dimension

```
dim3 numThreads(32);
dim3 numBlocks( ( N + numThreads - 1 ) / numThreads.x );
gpuVectAdd<<<<numBlocks, numThreads>>> ( N, u_dev, v_dev, z_dev );
type(dim3) :: numBlocks, numThreads
numThreads = dim3( 32, 1, 1 )
numBlocks = dim3( (N + numThreads.x - 1) / numThreads.x, 1, 1 )
call gpuVectAdd<<<numBlocks,numThreads>>> ( N, u_dev, v_dev, z_dev )
```

#### Summer School on PARALLEL COMPUTING

# Heterogeneous High Performance Programming framework

 <u>http://www.hpcwire.com/hpcwire/2012-02-</u> 28/opencl\_gains\_ground\_on\_cuda.html

"

As the two major programming frameworks for GPU computing, OpenCL and CUDA have been competing for mindshare in the developer community for the past few years. Until recently, CUDA has attracted most of the attention from developers, especially in the high performance computing realm. But OpenCL software has now matured to the point where HPC practitioners are taking a second look.

Both OpenCL and CUDA provide a general-purpose model for data parallelism as well as low-level access to hardware, but only OpenCL provides an open, industry-standard framework. As such, it has garnered support from nearly all processor manufacturers including AMD, Intel, and NVIDIA, as well as others that serve the mobile and embedded computing markets. As a result, applications developed in OpenCL are now portable across a variety of GPUs

and CPUs.

"





# Heterogeneous High Performance Programming framework (2)

A modern computing platform includes:

- One or more CPUs
- One of more GPUs
- DSP processors
- Accelerators
- ... other?



- E.g. Samsung® Exynos 5:
- Dual core ARM A15
   1.7GHz, Mali T604 GPU

## OpenCL lets Programmers write a single portable program that uses <u>ALL</u> resources in the heterogeneous platform

Individual processors have many (possibly heterogeneous) cores.





ATI™ RV770



Summer

School on PARALLEL

COMPUTI

NVIDIA® Tesla® C2090

Intel® Xeon Phi™ coprocessor

The Heterogeneous many-core challenge:

How are we to build a software ecosystem for the

Heterogeneous many core platform?

Third party names are the property of their owners.



# OpenCL Working Group within Khronos



- Diverse industry participation
  - Processor vendors, system OEMs, middleware vendors, application developers.
- OpenCL became an important standard upon release by virtue of the market coverage of the companies behind it.





- One *Host* and one or more *OpenCL Devices* 
  - Each OpenCL Device is composed of one or more Compute Units
    - Each Compute Unit is divided into one or more *Processing Elements*
- Memory divided into host memory and device memory



OpenCL Platform Example (One node, two CPU sockets, two GPUs)

#### Summer School on PARALLEL COMPUTING

# **CPUs:**

- Treated as one OpenCL device
  - One CU per core
  - 1 PE per CU, or if PEs mapped to SIMD lanes, *n* PEs per CU, where *n* matches the SIMD width
- Remember:
  - the CPU will also have to be its own host!

# **GPUs:**

- Each GPU is a separate OpenCL device
- One CU per Streaming Multiprocessor
- Can use CPU and all GPU devices concurrently through OpenCL

**CU = Compute Unit; PE = Processing Element** 



 The "hello world" program of data parallel programming is a program to add two vectors

### C[i] = A[i] + B[i] for i=0 to N-1

- For the OpenCL solution, there are two parts
  - Kernel code
  - Host code





- The host program is the code that runs on the host to:
  - Setup the environment for the OpenCL program
  - Create and manage kernels
- 5 simple steps in a basic host program:
  - 1. Define the *platform* ... platform = devices+context+queues
  - 2. Create and Build the *program* (dynamic library for kernels)
  - 3. Setup *memory* objects
  - 4. Define the *kernel* (attach arguments to kernel functions)
  - 5. Submit *commands* ... transfer memory objects and execute kernels



Please, refer to he reference card. This will help you get used to the reference card and how to pull information from the card and express it in code.

### Vector Addition – Host Program



# **OpenCL C for Compute Kernels**



- Derived from ISO C99
  - A few *restrictions*: no recursion, function pointers, functions in C99 standard headers ...
  - Preprocessing directives defined by C99 are supported (#include etc.)
- Built-in data types
  - Scalar and vector data types, pointers
  - Data-type conversion functions:
    - convert\_type<\_sat><\_roundingmode>
  - Image types:
    - image2d\_t, image3d\_t and sampler\_t



- Built-in functions *mandatory* 
  - Work-Item functions, math.h, read and write image
  - Relational, geometric functions, synchronization functions
  - printf (v1.2 only, so not currently for NVIDIA GPUs)
- Built-in functions *optional* (called "extensions")
  - Double precision, atomics to global and local memory
  - Selection of rounding mode, writes to image3d\_t surface





- Function qualifiers
  - \_\_kernel qualifier declares a function as a kernel
    - I.e. makes it visible to host code so it can be enqueued
  - Kernels can call other kernel-side functions
- Address space qualifiers
  - \_\_global, \_\_local, \_\_constant, \_\_private
  - Pointer kernel arguments must be declared with an address space qualifier
- Work-item functions
  - get\_work\_dim(), get\_global\_id(), get\_local\_id(), get\_group\_id()
- Synchronization functions
  - Barriers all work-items within a work-group must execute the barrier function before any work-item can continue
  - **Memory fences** provides ordering between memory operations



### Host programs can be "ugly

"

- OpenCL's goal is extreme portability, so it exposes everything

   (i.e. it is quite verbose!).
- But most of the host code is the same from one application to the next – the re-use makes the verbosity a non-issue.
- You can package common API combinations into functions or even C++ or Python classes to make the reuse more convenient.





# **PORTING CUDA TO OPENCL**





Summer School on

PARALI

- I.e. working out how to split up the problem to run effectively on a many-core device
- Switching between CUDA and OpenCL is mainly changing the host code syntax
  - Apart from indexing and naming conventions in the kernel code (simple to change!)



### Allocating and copying memory

#### CUDA C

#### Allocate

float\* d\_x; cudaMalloc(&d\_x, sizeof(float)\*size);

### OpenCL C

Summer

School on PARALLEL COMPUTING

cl\_mem d\_x =
 clCreateBuffer(context,
 CL\_MEM\_READ\_WRITE,
 sizeof(float)\*size,
 NULL, NULL);

Host to Device

cudaMemcpy(d\_x, h\_x, sizeof(float)\*size, cudaMemcpyHostToDevice); clEnqueueWriteBuffer(queue, d\_x, CL\_TRUE, 0, sizeof(float)\*size, h\_x, 0, NULL, NULL);

#### Device to Host

cudaMemcpy(h\_x, d\_x, sizeof(float)\*size, cudaMemcpyDeviceToHost); clEnqueueReadBuffer(queue, d\_x, CL\_TRUE, 0, sizeof(float)\*size, h\_x, 0, NULL, NULL);



### Allocating and copying memory

### CUDA C

Allocate

float\* d\_x; cudaMalloc(&d\_x, sizeof(float)\*size);

### OpenCL C++

Summer

School on PARALLEL COMPUTING

cl::Buffer d\_x(begin(h\_x), end(h\_x), true);

Host to Device cudaMemcpy(d\_x, h\_x, sizeof(float)\*size, cudaMemcpyHostToDevice); Device to Host cudaMemcpy(h\_x, d\_x, sizeof(float)\*size, cudaMemcpyDeviceToHost); cl::copy(d\_x, begin(h\_x), end(h\_x));

CINECA

Declaring dynamic local/shared memory

#### Summer School on PARALLEL COMPUTING

# CUDA C

- Define an array in the kernel source as extern
   \_\_shared\_\_\_int array[];
- 2. When executing the kernel, specify the third parameter as size in bytes of shared memory

func<<<<num\_blocks,

num\_threads\_per\_block,
shared\_mem\_size>>>(args);

### OpenCL C++

1. Have the kernel accept a local array as an argument

\_kernel void func(

\_local int \*array)

{}

- Define a local memory kernel kernel argument of the right size
   cl::LocalSpaceArg localmem = cl::Local(shared\_mem\_size);
- 3. Pass the argument to the kernel invocation

func(EnqueueArgs(...),localmem);



Declaring dynamic local/shared memory

#### Summer School on PARALLEL COMPUTING

# CUDA C

- Define an array in the kernel source as extern
   \_\_shared\_\_ int array[];
- 2. When executing the kernel, specify the third parameter as size in bytes of shared memory

func<<<num\_blocks,
num\_threads\_per\_block,
shared\_mem\_size>>>(args);

### OpenCL C

- Have the kernel accept a local array as an argument
   \_kernel void func( \_local int \*array) {}
- 2. Specify the size by setting the kernel argument

clSetKernelArg(kernel, 0, sizeof(int)\*num\_elements, NULL);





- To enqueue the kernel
  - CUDA specify the number of thread blocks and threads per block
  - OpenCL specify the problem size and (optionally) number of work-items per work-group



Enqueue a kernel (C)

### CUDA C

dim3 threads\_per\_block(30,20);

dim3 num\_blocks(10,10);

kernel<<<num\_blocks,
 threads\_per\_block>>>();

### **OpenCL C**

Summer School on PARALLEL COMPUTING

const size\_t global[2] = {300, 200};

const size\_t local[2] = {30, 20};

#### clEnqueueNDRangeKernel

queue, &kernel, 2, 0, &global, &local, 0, NULL, NULL); Enqueue a kernel (C++)

### CUDA C

dim3 threads\_per\_block(30,20);

**OpenCL C++** 

Summer School on PARALLEL COMPUTING

const cl::NDRange global(300, 200);

dim3 num\_blocks(10,10);

const cl::NDRange local(30, 20);

kernel<<<num\_blocks,
 threads\_per\_block>>>(...);

kernel(
 EnqueueArgs(global, local),
 ...);

CINECA



gridDim blockIdx blockDim gridDim \* blockDim threadIdx blockIdx \* blockdim + threadIdx

get\_num\_groups() get\_group\_id() get\_local\_size() get\_global\_size() get\_local\_id() get\_global\_id()





- Where do you find the kernel?
  - OpenCL either a string (const char \*), or read from a file
  - CUDA a function in the host code
- Denoting a kernel
  - OpenCL \_\_kernel
  - CUDA \_\_global\_\_\_
- When are my kernels compiled?
  - OpenCL at runtime
  - CUDA with compilation of host code





Summer

School on

PARALL

- If you needed anything more complicated (multidevice etc.) you must do so manually
- OpenCL always requires explicit device initialization
  - It runs not just on NVIDIA® GPUs and so you must tell it which device(s) to use



**Thread Synchronization** 



\_\_syncthreads()

barrier()

\_threadfenceblock()

No equivalent

No equivalent

\_threadfence()

mem\_fence( CLK\_GLOBAL\_MEM\_FENCE | CLK\_LOCAL\_MEM\_FENCE)

read\_mem\_fence()

write\_mem\_fence()

Finish one kernel and start another



### Translation from CUDA to OpenCL

CINECA



| CUDA                      | OpenCL                       |
|---------------------------|------------------------------|
| GPU                       | Device (CPU, GPU etc)        |
| Multiprocessor            | Compute Unit, or CU          |
| Scalar or CUDA core       | Processing Element, or PE    |
| Global or Device Memory   | Global Memory                |
| Shared Memory (per block) | Local Memory (per workgroup) |
| Local Memory (registers)  | Private Memory               |
| Thread Block              | Work-group                   |
| Thread                    | Work-item                    |
| Warp                      | No equivalent term (yet)     |
| Grid                      | NDRange                      |

### **OpenCL** live@Eurora





### Eurora

- Eurora CINECA-Eurotech
   prototype
- 1 rack
- Two Intel SandyBridge and
- two NVIDIA K20 cards per node or:
- Two Intel MIC card per node
- Hot water cooling
- Energy efficiency record (up to 3210 MFLOPs/w)
- 100 TFLOPs sustained





### **Running environment**

#### Summer School on PARALLEL COMPUTING

### NVIDIA Tesla K20

- 13 Multiprocessors
- 2496 CUDA Cores
- 5 GB of global memory
- GPU clock rate 760MHz

### Intel MIC Xeon Phi

- 236 compute units
- 8 GB of global memory
- CPU clock rate 1052 MHz





## Setting up OpenCL on Eurora



- Login on front-end. Then:
- >module load profile/advanced
  > module load intel\_opencl/none--intel--cs-xe-2013--binary

It defines:

#### INTEL\_OPENCL\_INCLUDE

and

#### INTEL\_OPENCL\_LIB

environmental variables that can be used:

>cc -I\$INTEL\_OPENCL\_INCLUDE -L\$INTEL\_OPENCL\_LIB -IOpenCL vadd.c -o vadd



### **Running on Intel**

PARALLEL

Summer

School on

PROFILE=FULL PROFILE VERSION=OpenCL 1.2 LINUX NAME=Intel(R) OpenCL VENDOR=Intel(R) Corporation EXTENSIONS=cl\_khr\_fp64 cl\_khr\_global\_int32\_base\_atomics cl khr global int32 extended atomics cl khr local int32 base atomics cl khr local int32 extended atomics cl khr byte addressable store --0--DEVICE NAME= Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz DEVICE VENDOR=Intel(R) Corporation DEVICE VERSION=OpenCL 1.2 (Build 67279) DEVICE MAX COMPUTE UNITS=16 DEVICE MAX WORK GROUP SIZE=1024 DEVICE MAX WORK ITEM DIMENSIONS=3 DEVICE\_MAX\_WORK\_ITEM\_SIZES=1024 1024 1024 DEVICE GLOBAL MEM SIZE=16685436928 --1--DEVICE NAME=Intel(R) Many Integrated Core Acceleration Card DEVICE VENDOR=Intel(R) Corporation DEVICE VERSION=OpenCL 1.2 (Build 67279) DEVICE\_MAX\_COMPUTE\_UNITS=236 DEVICE\_MAX\_WORK\_GROUP\_SIZE=1024 DEVICE MAX WORK ITEM DIMENSIONS=3 DEVICE MAX WORK ITEM SIZES=1024 1024 1024 DEVICE\_GLOBAL\_MEM\_SIZE=6053646336 --2--DEVICE NAME=Intel(R) Many Integrated Core Acceleration Card DEVICE VENDOR=Intel(R) Corporation DEVICE VERSION=OpenCL 1.2 (Build 67279) DEVICE MAX COMPUTE UNITS=236 DEVICE\_MAX\_WORK\_GROUP\_SIZE=1024 DEVICE\_MAX\_WORK\_ITEM\_DIMENSIONS=3 DEVICE MAX WORK ITEM SIZES=1024 1024 1024 DEVICE GLOBAL MEM SIZE=6053646336 Computed sum = 549754961920.0. Check passed.

Intel OpenCL platform found and 3 devices (cpu and Intel MIC card)

#### Intel MIC device was selected

Results are OK no matter what performances





- Goal:
  - To inspect and verify that you can run an OpenCL kernel on Eurora machines
- Procedure:
  - Take the provided C vadd.c and vadd.cl source programs from VADD directory
  - Compile and link vadd.c
  - Run on NVIDIA or Intel platform.
- Expected output:
  - A message verifying that the vector addition completed successfully
  - Some useful info about OpenCL environment (Intel and NVIDIA)



### Matrix-Matrix product: HOST

```
void MatrixMulOnHost (float* M, float* N, float* P, int Width)
ł
  // loop on rows
  for (int row = 0; row < Width; ++row) {</pre>
    // loop on columns
    for (int col = 0; col < Width; ++col) {
      // accumulate element-wise products
      float pval = 0;
      for (int k = 0; k < Width; ++k) {
        float a = M[row * Width + k];
        float b = N[k * Width + col];
        pval += a * b;
      }
      // store final results
      P[row * Width + col] = pval;
}
                                                      Μ
                                                         k
```



### Matrix-Matrix product: launch grid Summer School on PARALLEL COMPUTING (2,0) (0,0) MatrixWidth j (2,1) Matrix (2,2) index (2,3) gridDim.x \* blockDim.x i = blockIdx.x \* blockDim.x + threadIdx.x; j = blockIdx.y \* blockDim.y + threadIdx.y; index = j \* MatrixWidth + i; 99777

Summer School on PARALLEL COMPUTING



}

#### Summer School on PARALLEL COMPUTING

## Private Memory

- Per work-item
- Local Memory
  - Shared within a work-group
- Global/Constant Memory
  - Visible to all work-groups
- Host memory
  - On the CPU



Memory management is **explicit**: You are responsible for moving data from host  $\rightarrow$  global  $\rightarrow$  local *and* back



## **OpenCL** Memory model

#### Summer School on PARALLEL COMPUTING

- Private Memory
  - Fastest & smallest: O(10) words/WI
- Local Memory
  - Shared by all WI's in a work-group
  - But not shared between work-groups!
  - O(1-10) Kbytes per work-group
- Global/Constant Memory
  - O(1-10) Gbytes of Global memory
  - O(10-100) Kbytes of Constant memory
- Host memory
  - On the CPU GBytes

Memory management is **explicit**: O(1-10) Gbytes/s bandwidth to discrete GPUs for Host <-> Global transfers



## OpenCL mapping

CINECA







#### Kernels: Work-item and Work-group Example





Matrix multiplication: OpenCL kernel

```
Summer
School on
PARALLEL
COMPUTING
```

```
kernel void mat_mul(
const int Mdim, const int Ndim, const int Pdim,
  global float *A, __global float *B, __global float *C)
 int i, j, k;
 for (i = 0; i < Ndim; i++) {
   for (j = 0; j < Mdim; j++) {
      for (k = 0; k < Pdim; k++) {
        // C(i, j) = sum(over k) A(i,k) * B(k,j)
        C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
```

Remove outer loops and set work-item coordinates





```
kernel void mat_mul(
const int Mdim, const int Ndim, const int Pdim,
  global float *A, global float *B, global float *C)
 int i, j, k;
 j = get_global_id(0);
 i = get_global_id(1);
 // C(i, j) = sum(over k) A(i,k) * B(k,j)
 for (k = 0; k < Pdim; k++) {
   C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
```



### Matrix multiplication: OpenCL kernel improved

#### Summer School on PARALLEL COMPUTING

Rearrange and use a local scalar for intermediate C element values (a common optimization in Matrix Multiplication functions)

### \_kernel void mat\_mul(

const int Mdim, const int Ndim, const int Pdim, \_\_global float \*A, \_\_global float \*B, \_\_global float \*C)

```
{
    int k;
    int j = get_global_id(0);
    int i = get_global_id(1);
    float tmp = 0.0f;
    for (k = 0; k < Pdim; k++)
     tmp += A[i*Ndim+k]*B[k*Pdim+j];
    }
    C[i*Ndim+j] += tmp;
}</pre>
```



### Matrix multiplication: OpenCL kernel improved

#### Summer School on PARALLEL COMPUTING

Rearrange and use a local scalar for intermediate C element values (a common optimization in Matrix Multiplication functions)

| Matrix<br>Size | Platfor<br>m   | Kernel<br>time<br>(sec.) | GFLOP/<br>s | {<br>int k;<br>int j = get_global_id(0);<br>int i = get_global_id(1);                                                                             |
|----------------|----------------|--------------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| 2048           | NVIDIA<br>K20s | 0.24                     | 71          | <pre>int i = get_global_id(1); float tmp = 0.0f; for (k = 0; k &lt; Pdim; k++)    tmp += A[i*Ndim+k]*B[k*Pdin    }    C[i*Ndim+j] += tmp; }</pre> |
| 2048           | Intel MIC      | 0.47                     | 37          |                                                                                                                                                   |

66777

Matrix-Matrix product: selecting optimum thread block size



Which is the best thread block /work-group size to select (i.e. TILE\_WIDTH)? On <u>Fermi</u> architectures: each SM can handle up to **1536** total threads TILE\_WIDTH = 8

8x8 = 64 threads >>> 1536/64 = 24 blocks needed to fully load a SM ... yet there is a limit of maximum 8 resident blocks per SM for cc 2.x so we end up with just 64x8 = 512 threads per SM on a maximum of 1536 (only 33% occupancy)

TILE WIDTH = 16

**16x16** = 256 threads >>> 1536/256 = 6 blocks to fully load a SM 6x256 = 1536 threads per SM ... reaching **full occupancy** per SM!

TILE WIDTH = 32

**32x32** = 1024 threads >>> 1536/1024 = 1.5 = 1 block fully loads SM 1024 threads per SM (only 66% occupancy)

TILE WIDTH = 16



Matrix-Matrix product: selecting optimum thread block size

**16x16** = 256 threads >>> 2048/256 = 8 blocks to fully load a SM 8x256 = 2048 threads per SM ... reaching **full occupancy** per SM!

32x32 = 1024 threads >>> 2048/1024 = 2 blocks fully load a SM 2x1024 = 2048 threads per SM ... reaching full occupancy per SM!

Which is the best thread block size/work-group size to select (i.e. **TILE\_WIDTH**)? On <u>Kepler</u> architectures: each SM can handle up to **2048** total threads **TILE WIDTH = 8** 

8x8 = 64 threads >>> 2048/64 = 32 blocks needed to fully load a SM
... yet there is a limit of maximum 16 resident blocks per SM for cc 3.x
so we end up with just 64x16 = 1024 threads per SM on a maximum of 2048 (only 50% occupancy)

TILE WIDTH = 16

Summer

School on PARALLEL COMPUTING

TILE WIDTH = 32

TILE WIDTH = 16 or 32



Matrix-Matrix product: selecting optimum thread block size



8x8 = 64 threads >>> 2048/64 = 32 blocks needed to fully load a SM ... yet there is a limit of maximum 16 resident blocks per SM for cc 3.x so we end up with just 64x16 = 1024 threads per SM on a maximum of 2048 (only 50% occupancy)

TILE WIDTH = 16

TILE WIDTH = 32

Summer

School on PARALLEL COMPUTING

**16x16** = 256 threads >>> 2048/256 = 8 blocks to fully load a SM 8x256 = 2048 threads per SM ... reaching **full occupancy** per SM!

**32x32** = 1024 threads >>> 2048/1024 = 2 blocks fully load a SM 2x1024 = 2048 threads per SM ... reaching **full occupancy** per SM!

| TILE_WIDTH | Kernel time<br>(sec.) | GFLOP/s<br>(NVIDIA K20) |
|------------|-----------------------|-------------------------|
| 8          | 0.33                  | 52                      |
| 16         | 0.20                  | 82                      |
| 32         | 0.16                  | 104                     |



# Matrix-Matrix product: check inside matrix borders



| kernel chek (Yes/No) | Matrices Size | Kernel Error                              | GFLOP/s (Intel MIC) |
|----------------------|---------------|-------------------------------------------|---------------------|
| Yes                  | 2047          | 1                                         | 20                  |
| Yes                  | 2048          | 1                                         | 35                  |
| No                   | 2047          | Failed (different results from reference) | 21                  |
| No                   | 2048          | /                                         | 37                  |



- Tens of KBytes per Compute Unit
  - As multiple Work-Groups will be running on each CU, this means only a fraction of the total Local Memory size is available to each Work-Group
- Assume O(1-10) KBytes of Local Memory per Work-Group
  - Your kernels are responsible for transferring data between Local and Global/Constant memories ... there are optimized library functions to help
- Use Local Memory to hold data that can be reused by all the work-items in a work-group
- Access patterns to Local Memory affect performance in a similar way to accessing Global Memory
  - Have to think about things like coalescence & bank conflicts

\* Typical figures for a 2013 GPU



## Local Memory

#### Summer School on PARALLEL COMPUTING

- Local Memory doesn't always help...
  - CPUs, MICs don't have special hardware for it
  - This can mean excessive use of Local Memory might slow down kernels on CPUs
  - GPUs now have effective on-chip caches which can provide much of the benefit of Local Memory but without programmer intervention
  - So, your mileage may vary!



## Using Local/Shared Memory for Thread Cooperation

#### Summer School on PARALLEL COMPUTING

- Threads belonging to the same block can cooperate togheter using the shared memory to share data
- if a thread needs some data which has been already retrived by another thread in the same block, this data can be shared using the shared memory Typical Shared Memory usage:
- declare a buffer residing on shared memory (this buffer is per block)
- Joad data into shared memory buffer
   synchronize threads so to make sure
- synchronize threads so to make sure all needed data is present in the buffer
- 4. performe operation on data
- Synchronize threads so all operations have been performed
  write back regults to global memory.
- **b.** write back results to global memory

| (Device) Grid      |               |  |  |  |
|--------------------|---------------|--|--|--|
| Block (0, 0)       | Block (1, 0)  |  |  |  |
| Shared Memory      | Shared Memory |  |  |  |
| Registers          | Registers     |  |  |  |
| Global<br>Memory   | ↓             |  |  |  |
| Constant<br>Memory |               |  |  |  |
| Texture<br>Memory  |               |  |  |  |



### Matrix-matrix using Shared Memory: CUDA Kernel

#### Summer School on PARALLEL COMPUTING

// Matrix multiplication kernel called by MatMul\_gpu()
\_\_global\_\_ void MatMul\_kernel (float \*A, float \*B, float \*C, int N)
{

// Shared memory used to store Asub and Bsub respectively
\_\_shared\_\_ float Asub[NB][NB];
\_\_shared\_\_ float Bsub[NB][NB];

// Block row and column
int ib = blockIdx.y;
int jb = blockIdx.x;

// Thread row and column within Csub
int it = threadIdx.y;
int jt = threadIdx.x;

int a\_offset , b\_offset, c\_offset;

444

// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;

// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results

for (int kb = 0; kb < (A.width / NB); ++kb) {

// Get the starting address of Asub and Bsub
a\_offset = get\_offset (ib, kb, N);
b\_offset = get\_offset (kb, jb, N);

// Load Asub and Bsub from device memory to shared memory // Each thread loads one element of each sub-matrix Asub[it][jt] = A[a\_offset + it\*N + jt]; Bsub[it][jt] = B[b\_offset + it\*N + jt];

// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
\_\_syncthreads();

// Multiply Asub and Bsub together
for (int k = 0; k < NB; ++k) {
 Cvalue += Asub[it][k] \* Bsub[k][jt];
}
// Synchronize to make sure that the preceding
// computation is done
\_\_syncthreads();</pre>

// Get the starting address (c\_offset) of Csub c\_offset = get\_offset (ib, jb, N); // Each thread block computes one sub-matrix Csub of C C[c\_offset + it\*N + jt] = Cvalue;

}

## Matrix-matrix using Shared Memory: OpenCL Kernel

#### Summer School on PARALLEL COMPUTING

// Matrix multiplication kernel called by MatMul\_gpu()
\_\_kernel\_ void MatMul\_kernel (float \*A, float \*B, float \*C, int N)
{

 ${\ensuremath{/\!/}}$  Shared memory used to store Asub and Bsub respectively

\_\_local float Asub[NB][NB];
\_\_local float Bsub[NB][NB];

// Block row and column
int ib = get\_group\_id(1);
int jb = get\_group\_id(0);

// Thread row and column within Csub
int it = get\_local\_id(1);
int jt = get\_local\_id(0);

int a\_offset , b\_offset, c\_offset;

// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;

// Loop over all the sub-matrices of A and B that are // required to compute Csub // Multiply each pair of sub-matrices together

// and accumulate the results

\*\*\*

for (int kb = 0; kb < (A.width / NB); ++kb) {

// Get the starting address of Asub and Bsub a\_offset = get\_offset (ib, kb, N); b\_offset = get\_offset (kb, jb, N);

// Load Asub and Bsub from device memory to shared memory // Each thread loads one element of each sub-matrix Asub[it][jt] = A[a\_offset + it\*N + jt]; Bsub[it][jt] = B[b\_offset + it\*N + jt];

// Synchronize to make sure the sub-matrices are loaded // before starting the computation barrier(CLK\_LOCAL\_MEM\_FENCE);

// Multiply Asub and Bsub together
for (int k = 0; k < NB; ++k) {
 Cvalue += Asub[it][k] \* Bsub[k][jt];
}
// Synchronize to make sure that the preceding
// computation is done
barrier(CLK\_LOCAL\_MEM\_FENCE);</pre>

// Get the starting address (c\_offset) of Csub c\_offset = get\_offset (ib, jb, N); // Each thread block computes one sub-matrix Csub of C C[c\_offset + it\*N + jt] = Cvalue;

}

## Matrix-matrix using Shared Memory: OpenCL Kernel

#### Summer School on PARALLEL COMPUTING

| <pre>// Matrix multiplication kernel called by MatMul_gpug<br/>kernel_ void MatMul_kernel (float *A, float *B, float *</pre> |             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |         |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |             | for (int kb = 0; kb < (A.width / NB); ++kb) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |         |
| <pre>// Shared memory used to store Asub and Bsub respe<br/>local float Asub[NB][NB];<br/>local float Bsub[NB][NB];<br/>// Block row and column<br/>int ib = get_group_id(1);<br/>int jb = get_group_id(0);<br/>// Thread row and column within Csub<br/>int it = get_local_id(1);<br/>int jt = get_local_id(0);<br/>int a_offset , b_offset, c_offset;<br/>// Each thread computes one element of Csub<br/>// by accumulating results into Cvalue<br/>float Cvalue = 0;</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |             | <pre>// Get the starting address of Asub and Bsub<br/>a_offset = get_offset (ib, kb, N);<br/>b_offset = get_offset (kb, jb, N);<br/>// Load Asub and Bsub from device memory to<br/>// Each thread loads one element of each sub-matched<br/>Asub[it][jt] = A[a_offset + it*N + jt];<br/>Bsub[it][jt] = B[b_offset + it*N + jt];<br/>// Synchronize to make sure the sub-matrices ar<br/>// before starting the computation<br/>barrier(CLK_LOCAL_MEM_FENCE);<br/>// Multiply Asub and Bsub together<br/>for (int k = 0; k &lt; NB; ++k) {<br/>Cvalue += Asub[it][k] * Bsub[k][jt];<br/>}<br/>// Synchronize to make sure that the preceding<br/>// Synchronize to make sure that the preceding</pre> | atrix   |
| // Loop over all the sub-matrices of A and B that are                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |             | // computation is done                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |         |
| Matrix Size                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Platform    | Kernel time (sec.)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | GFLOP/s |
| 2048                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | NVIDIA K20s | 0.10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 166     |
| 2048                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Intel MIC   | 0.15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 115     |



- Intel MIC combines many core onto a single chip. Each core runs exactly 4 hardware threads. In particular:
- 1. All cores/threads are a single OpenCL device
- 2. Separate hardware threads are OpenCL CU.
- In the end, you'll have parallelism at the work-group level (vectorization) and parallelism between work-groups (threading).



#### Summer School on PARALLEL COMPUTING

- To reach performances, the number of work-groups should be not less than *CL\_DEVICE\_MAX\_COMPUTE\_UNITS* parameter (more is better)
- Again, automatic vectorization module should be fully utilized. This module:
   packs adiacent work-items (from dimension 0 of NDRange)
  - executes them with SIMD instructions
- Use the recommended work-group size as multiple of 16 (SIMD width for float, int, ...data type).



Matrix-matrix on Intel MIC (skeleton)

```
for i from 0 to NUM_OF_TILES_M-1
     for j from O to NUM OF TILES N-1
         C_BLOCK = ZERO_MATRIX(TILE_SIZE_M, TILE_SIZE_N)
         for k from O to size-1
              for ib = from 0 to TILE SIZE M-1
                  for jb = from 0 to TILE_SIZE_N-1
                        C_BLOCK(jb, ib) = C_BLOCK(ib, jb) + A(k, i*TILE_SIZE_M + ib)*B(j*TILE_SIZE_N + jb, k)
                   end for jb
               end for ib
         end for k
                                                                             TILE SIZE K = size
         for ib = from 0 to TILE SIZE M-1
                                                                             of block for
              for jb = from O to TILE SIZE N-1
                                                                             internal
                   C(j^*TILE\_SIZE\_M + jb, i^*TILE\_SIZE\_N + ib) = C\_BLOCK(jb, ib)
                                                                             computation of
              end for ib
                                                                             C BLOCK
         end for ib
     end for j
                      TILE GROUP M x TILE GROUP N =
end for i
                      number of WI within each WG
                                TILE SIZE M \mathbf{x} TILE SIZE N =
                                number of elements of C computed
                                by one WI
```

Summer

School on PARALLEL COMPUTING

```
for i from 0 to NUM_OF_TILES_M-1
     for j from 0 to NUM_OF_TILES_N-1
          C_{BLOCK} = ZERO_{MATRIX}(TILE_{SIZE}M, TILE_{SIZE}N)
          for k from O to size-1
                for ib = from 0 to TILE SIZE M-1
                     for jb = from \ O \ to \ TILE_SIZE_N-1
                           C_BLOCK(jb, ib) = C_BLOCK(ib, jb) + A(k, i*TILE_SIZE_M + ib)*B(j*TILE_SIZE_N + jb, k)
                      end for jb
                 end for ib
          end for k
          for ib = from 0 to TILE SIZE M-1
                for jb = from 0 to TILE_SIZE_N-1
                      C(j^*TILE\_SIZE\_M + jb, i^*TILE\_SIZE\_N + ib) = C\_BLOCK(jb, ib)
                end for jb
          end for ib
     end for j
end for i
```

Summer School on PARALLEL COMPUTING

|       | Matrices Size | Kernel time (sec.) | GFLOP/s (Intel<br>MIC) |
|-------|---------------|--------------------|------------------------|
|       | 3968          | 0.3                | 415                    |
| TTILL |               |                    |                        |



## The future of Accelerator Programming

Most of the latest supercomputers are based on accelerators platform. This huge adoption is the result of:

- High (peak) performances
- Good energy efficiency
- Low price





Accelerators should be used everywhere and all the time. So, why aren't there?



Summer School on PARALLEL COMPUTING

## The future of Accelerator Programming

There are two main difficulties with accelerators:

- They can only execute certain type of programs efficiently (high parallelism, data reuse, regular control flow and data access)
- Architectural disparity with respect to CPU (cumbersome programming, portability is an issue)



an issue) Accelerators should be used everywhere and all the time. So, why aren't there?



Summer School on PARALLEL COMPUTING

## The future of Accelerator Programming

GPUs are now more generalpurpose computing devices thanks to CUDA adoption. On the other hand, the fact that CUDA is a proprietary tool and its complexity triggered the creation of other programming approaches:

- OpenCL
- OpenAcc
- ..
- ...



Accelerators should be used everywhere and all the time. So, why aren't there?



#### Summer School on PARALLEL COMPUTING

## The future of Accelerator Programming

- OpenCL is the non-proprietary counterpart of CUDA (also supports AMD GPUs, CPUs, MIC, FPGAs....really portable!) but just like CUDA, is very low level and require a lot of programming skills to be used.
- OpenACC is a very high-level approach. Similar to OpenMP (they should be merged in a near(?) future) but still at its infancy and currently supported by a few compilers
- Other approaches like C++AMP only tied to exhotic HPC environment (Windows) and impractical for standard HPC applications





Accelerators should be used everywhere and all the time. So, why aren't there?



#### Summer School on PARALLEL COMPUTING

## The future of Accelerator Programming

- So, how to (efficiently) program actual and future devices?
- A possible answer could be surprisingly simple and similar to how today's multicore (CPUs) are used (including SIMD extensions, accelerators,...)
- Basically, there are three levels:
- libraries
- automated tools
- do-it-yourself
- Programmers will employ library approach whenever possible. In absence of efficient libraries, tools could be used.
- For the remaining cases, the do-it-yourself approach will have to be used (OpenCL or a derivative of it should be preferred to proprietary CUDA)



# Accelerators will be used everywhere and all the time. So, start to use them!





## Credits

Summer School on PARALLEL COMPUTING

Among the others:

- Simon McIntosh Smith for OpenCL
- CUDA Team in CINECA (Luca Ferraro, Sergio Orlandini, Stefano Tagliaventi)
- MontBlanc project (EU) Team

