# Overview of the Intel<sup>®</sup> Xeon and Xeon Phi technologies Broadwell and Knights Landing

Fabio Affinito (SCAI - Cineca)



## Intel<sup>®</sup> Xeon Processor Architecture



#### Intel® Xeon® Processor E5-2600 v4 Product Family - TICK



#### Intel® Xeon® E5-2600 v4 Product Family Overview

| <ul> <li>New Features:</li> <li>Broadwell microarchitecture</li> <li>Built on 14nm process technology</li> <li>Socket compatible<sup>6</sup> replacement/ upgrade on Grantley-EP platforms</li> </ul> |                                                                | <ul> <li>New Performance Technologies:</li> <li>Optimized Intel<sup>®</sup> AVX Turbo mode</li> <li>Intel TSX instructions<sup>^</sup></li> </ul> |                                | Other Enhancements: <ul> <li>Virtualization speedup</li> <li>Orchestration control</li> <li>Security improvements</li> </ul> |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| Features                                                                                                                                                                                              | Xeon E5-2600 v3 (Haswell-EP)                                   |                                                                                                                                                   | Xeon E5-2600 v4 (Broadwell-EP) | 4 Channels DDR4                                                                                                              |
| Cores Per Socket                                                                                                                                                                                      | Up to 18                                                       |                                                                                                                                                   | Up to 22                       | Intel® Yeon® Processor 2y Intel® OPI                                                                                         |
| Threads Per Socket                                                                                                                                                                                    | Up to 36 threads                                               |                                                                                                                                                   | Up to 44 threads               | DDR4 E5-2600 v4 1.1                                                                                                          |
| Last-level Cache (LLC)                                                                                                                                                                                | Up to 45 MB                                                    |                                                                                                                                                   | Up to 55 MB                    | QPI                                                                                                                          |
| QPI Speed (GT/s)                                                                                                                                                                                      | 2x C                                                           | QPI 1.1 channe                                                                                                                                    | els 6.4, 8.0, 9.6 GT/s         | DDR4 Core Core                                                                                                               |
| PCIe* Lanes / Speed(GT/s)                                                                                                                                                                             | 40 / 10 / PCle* 3.0 (2.5, 5, 8 GT/s)                           |                                                                                                                                                   |                                |                                                                                                                              |
| Memory Population                                                                                                                                                                                     | 4 channels of up to 3<br>LRDIMM                                | RDIMMs or 3<br>s                                                                                                                                  | + 3DS LRDIMM <sup>†</sup>      | DDR4 QPI                                                                                                                     |
| Memory RAS                                                                                                                                                                                            | ECC, Patrol Scrubbi<br>Scrubbing, Sparing<br>Lockstep Mode, x4 | ng, Demand<br>J, Mirroring,<br>4/x8 SDDC                                                                                                          | + DDR4 Write CRC               | DDR4 Shared Cache                                                                                                            |
| Max Memory Speed                                                                                                                                                                                      | Up to 2133                                                     |                                                                                                                                                   | Up to 2400                     | 3.0 DMI2                                                                                                                     |
| TDP (W)                                                                                                                                                                                               | 160 (Workstation only), 145, 135, 120, 105, 90, 85, 65, 55     |                                                                                                                                                   |                                |                                                                                                                              |



#### Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5-2600 v4 Product Family MCC/LCC



#### Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5-2600 v4 Product Family HCC

उर्ग्य उर्ग्य उर्ग्य उर्ग्य QPI Link QPI Link ×16 ×16 ×8 ×4 (0 10) ----PSQPI PEPCI High core count (HCC) die 10 401C configuration • Used by SKUs with 16 to 22 cores E5-2699 v4 LLC LLC · · · ••• ¢ .... Core Con Con • E5-2698 v4 • E5-2697 v4 c .... LLC ••• •••• Cone Core Cone E5-2697A v4 • E5-2695 v4 Coute LLC LLC ----C .... ----c .... -Core Core Core • E5-2683 v4 -• For each core LLC ----•••• -----•... • • • • Core Core Core • 2.5M last level cache (LLC) Casta B. LLC LLC c.... LLC ·... ·... Core • Caching agent (CBO) Core Core • For each ring LLC Casta LLC Couto Casta LLC LLC ·... C .... Core C: C matur C .... Core Core • Home agent (HA) • Memory Controller with 2 DDR4 channels Home Agent Home Agent DDP DDP DDP DDP Lie m Ctir Lie m Ctir



## What's next ...

- Broadwell (code name) E7 (4-socket server processor models)
- Skylake (code name) server (E5 and E7)
  - Micro-architecture launched in client processors Sep. 2015
  - Intel<sup>®</sup> AVX-512 (only for server )
  - Expect a lot of additional, key changes
- FPGA and Xeon server integration
- NVM (non-volatile memory) 3D XPoint<sup>™</sup> Technology



# Intel<sup>®</sup> Many Integrated Core Architecture (Intel<sup>®</sup> MIC) Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor



#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Product Family

based on Intel<sup>®</sup> Many Integrated Core (MIC) Architecture





\*Per Intel's announced products or planning process for future products

#### Knights Landing: Next-Generation Intel® Xeon Phi™





## **KNL Mesh Interconnect**



#### **Mesh of Rings**

- Every row and column is a (half) ring
- YX routing: Go in  $Y \rightarrow Turn \rightarrow Go$  in X
- Messages arbitrate at injection and on turn

#### **Cache Coherent Interconnect**

- MESIF protocol (F = Forward)
- Distributed directory to filter snoops

#### **Three Cluster Modes**

- (1) All-to-All
- (2) Quadrant
- (3) Sub-NUMA Clustering (SNC)







Address uniformly hashed across all distributed directories

No affinity between Tile, Directory and Memory

Lower performance mode, compared to other modes. Mainly for fall-back

#### Typical Read L2 miss

- 1. L2 miss encountered
- 2. Send request to the distributed directory
- 3. Miss in the directory. Forward to memory
- 4. Memory sends the data to the requestor







Chip divided into four virtual Quadrants

Address hashed to a Directory in the same quadrant as the Memory

Affinity between the Directory and Memory

Lower latency and higher BW than all-to-all. Software transparent.

L2 miss, 2. Directory access, 3. Memory access,
 Data return

# Cluster Mode: Sub-NUMA Clustering (SNC)



CINECA

Each Quadrant (Cluster) exposed as a separate NUMA domain to OS

Looks analogous to 4-Socket Xeon

Affinity between Tile, Directory and Memory

Local communication. Lowest latency of all modes

Software needs to be NUMA-aware to get benefit

2 Directory access, 3. Memory access, 4. Data return SuperComputing Applications and Innovation SuperComputing Applications and Innovation

# **KNL Core and VPU**

Out-of-order core w/ 4 SMT threads VPU tightly integrated with core pipeline 2-wide decode/rename/retire 2x 64B load & 1 64B store port for D\$ L1 prefetcher and L2 prefetcher Fast unaligned and cache-line split support Fast gather/scatter support





### Software Adaption for KNL – Key New Features

Large impact: Intel<sup>®</sup> AVX-512 instruction set

- Slightly different from future Intel<sup>®</sup> Xeon<sup>™</sup> architecture AVX-512 extensions
- Includes SSE, AVX, AVX-2
- Apps built for HSW and earlier can run on KNL (few exceptions like TSX )
- Incompatible with 1st Generation Intel<sup>®</sup> Xeon<sup>™</sup> Phi (KNC)

Medium impact: New, on-chip high bandwidth memory (MCDRAM) creates heterogeneous (NUMA) memory access

- can be used transparently too however

Minor impact: Differences in floating point execution / rounding due to FMA and new HW-accelerated transcendental functions - like exp()



#### AVX-512 - Greatly increased Register File





| The Intel <sup>®</sup> AVX-512 Subsets [2]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| AVX-512DQ<br>All of (packed) 52DI/04 DILOPERATIONS AVA-512F doesn't provide<br>Close 64bit gaps like VPMULLQ : packed 64x64 → 64<br>Extend mask architecture to word and byte (to handle vectors)<br>Packed/Scalar converts of signed/unsigned to SP/DP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |
| AVX-512 Byte and Word instructions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| <ul> <li>AVX-512BW</li> <li>Ex., particular (reference on a second condition of a second condition of</li></ul> |  |  |  |  |
| AVX-512VL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |  |  |
| <ul> <li>Support for 128 and 256 bits instead of full 512 bit</li> <li>Not a new instruction set but an attribute of existing 512bit instructions</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |
| SuperComputing Applications and Innovation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |

#### **Other New Instructions**

| Intel® MPX – Intel Memory Protection Extension                                      |              |
|-------------------------------------------------------------------------------------|--------------|
|                                                                                     |              |
| Set or instructions to implement checking a pointer against its bounds              |              |
| □Pointer Checker support in HW ( today a SW only solution of e.g. Intel compilers ) |              |
| Debug and security features                                                         |              |
|                                                                                     |              |
| Micro OCA - Intere Soliware Guard Extensions                                        |              |
| SGX                                                                                 |              |
| Intel® Software Guard Extensions enables applications to execute code and protect   | secrets from |
|                                                                                     |              |
|                                                                                     |              |
| Single Instruction - Flush a cache line                                             |              |
| CLFLUSHOPT                                                                          |              |
| D needed for future memory technologies                                             |              |
|                                                                                     |              |
| Save and restore extended processor state                                           |              |
| xSAVE{S,C}                                                                          |              |
|                                                                                     |              |
|                                                                                     |              |
| CINECA SuperComputing Applications and Innovation                                   |              |
| SuperComputing Applications and Innovation                                          |              |



### Intel<sup>®</sup> Compiler Processor Switches

| Switch                | Description                                      |
|-----------------------|--------------------------------------------------|
| -xmic-avx512          | KNL only; already in 14.0                        |
| -xcore-avx512         | Future XEON only, already in 15.0.1              |
| -xcommon-avx512       | AVX-512 subset common to both, already in 15.0.2 |
| -m, -march, /arch     | Not yet !                                        |
| -ax <avx512></avx512> | Same as for "-x <avx512>"</avx512>               |
| -mmic                 | No – not for KNL                                 |





#### MCDRAM: Cache vs Flat Mode



## High Bandwidth On-Chip Memory API

- API is open-sourced (BSD licenses)
  - <u>https://github.com/memkind</u>
  - Uses jemalloc API underneath
    - <u>http://www.canonware.com/jemalloc/</u>
    - <u>https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919</u>

Malloc replacement:

```
#include <memkind.h>
    hbw_check_available()
    hbw_malloc, _calloc, _realloc,... (memkind_t kind, ...)
    hbw_free()
    hbw_posix_memalign()
    hbw_get_size(), _psize()
    ld ... -ljemalloc -lnuma -lmemkind -lpthread
    SignerComputing Applications and Innovation
EverComputing Applications and Innovation
```

25

### HBW API for Fortran, C++

Fortran:

!DIR\$ ATTRIBUTES FASTMEM :: data\_object1, data\_object2

- All Fortran data types supported
- Global, local, stack or heap; scalar, array, ...
- Support in compiler 15.0 update 1 and later versions

C++:

standard allocator replacement for e.g. STL like #include <hbwmalloc.h>

std::vector<int, hbwmalloc::hbw\_allocator>



#### Porting codes on Knights Landing



# Trends that are here to stay

Data parallelism

- Lots of threads, spent on MPI ranks or OpenMP/TBB/pthreads
- Improving support for both peak tput and modest/single thread

Bigger, better, faster memory

- High capacity, high bandwidth, low latency DRAM
- Effective caching and paging
- Increasing support for irregular memory refs, modest tuning

ISA innovation

• Increasing support for vectorization, new usages



# **Evolution or revolution?**

Incremental changes, significant gains

Parallelization – consistent strategy

- MPI vs. OpenMP already needed to tune and tweak
- Less thread-level parallelism required
- Vectorization more opportunity, more profitable

Enable new features with memory tuning

- Access MCDRAM with special allocation
- Blocking for MCDRAM vs. just cache



# Compatibility





# KNL specific enabling

- Recompilation, with –xMIC-AVX512
- Threading: more MPI ranks, 1 thread/core
- Vectorization: increased efficiency
- MCDRAM and memory tuning: tile, 1GB pages



# What is needed?

• Building

Change compiler switches in make files

• Coding

Parallelization: vectorization, offload

Memory management: MCDRAM enumeration and memory allocation

• Tuning

Potentially fewer threads: more cores but less need for SMT More memory more MPI ranks



### Take aways

Keep doing what you were doing for KNC and Xeon

Some goodness comes for free with a recompile

With some extra enabling, use new MCDRAM feature



# Acknowledgements

Most of the material (slides, figures, etc) showed here is courtesy of Intel

In particular, thanks for providing material and support to: Georg Zitzlsberger, Heinrich Bockhorst, Han Benedict and CJ Newburn

