Intel Hard- and Software

7 Levels of Parallelism

Node level
Socket level
- NUMA
- 2-, 4-, or 8-socket
Hyperthreading: 2 logical threads on one core
- less cache per thread
- useful with jobs utilizing different parts of core - disable on HPC
- now not much of a penalty when enabled
Mesh
GPU-CPU
- subslice = core (8 on Intel Graphics Gen 11)
- exec-unit = thread (8)
Instruction level: out-of order execution on different ports/execution units (max 4 IPC [instructions per cycle])
Data parallelism

Vectorization

![ParStudio](2020_02_06-10_23-07-iPhone 8-1924.jpg "Inside Parallel Studio XE")

Performance libraries
Compiler (fully automized)
Compiler with vectorization hints (#pragma)
user mandated vectorization (SIMD directive)
SIMD intrinsic dlass (F32vec4 add)
Vector intrinsic (mm_add_ps())
assembler code (`addps)

oneAPI: New Foundation for Exascale Computing

unified memory (CPU/GPU)
all-to-all connectivity

Summary

Code modernization not always easy (analyze & optimize)
data / task parallelism ![IntelPy](2020_02_06-10_24-57-iPhone 8-1925.jpg "Intel Distribution for Python") ![MKL](2020_02_06-10_28-11-iPhone 8-1927.jpg "Math Kernel Library") ![DAAL](2020_02_06-10_36-47-iPhone 8-1928.jpg "Data Analytics Acceleration Library") ![DAAL Algo](2020_02_06-10_37-46-iPhone 8-1929.jpg "DAAL Algorithms") ![Diagnostic Tool](2020_02_06-10_46-13-iPhone 8-1930.jpg "Diagnostic Toolset for High Performance Compute Clusters") ![Demo](2020_02_06-10_55-10-iPhone 8-1931.jpg "Demo") ![noFP](2020_02_06-11_00-12-iPhone 8-1932.jpg "Removing FP converts")

Parallel Studio

transition to oneAPI in 2020

in 2020 version

VNNI (Vector Neural Network inference) for AI inference speedup
persistent memory (Optane) compatible with RAM
expanded standard support
- Fortran 2018
- C++ 17 (20 in initial stage)
- move to LLVM (slowly), backend switchable
Extended Coarse Grain Profiling
HPC cloud support
New OS support (e.g., Amazon)

Compiler (v19.0)

Python

Take advantage of Intel's python distribution: add Intel channel for anaconda (in general)

Speedup achieved by optimizing

numba (utilizing MKL)
scikit learn (utilizing DAAL)

URLs

Performance tools

VTune: HPC tuning - now even broader coverage (demo in afternoon)
performance snapshot for high level view
Advisor provides information on
- threads
- vectorization
- GPU (offload advisor, coming; so far only Intel gen9 or gen11)

Intel cluster checker

Vectorization issues

Default: compile to early 2000s CPU - only vector length 2 => need compiler flag(s) for better optimization
technical bit: when using AVX instructions clock frequency is reduced to compensate for extra power consumption

Example N-body problem

Convert Array of Structures (AoS) -> Structure of Arrays (SoA) for better aligned memory access

oneAPI

open standard for unified programming model across hardware platforms ![oneAPI base](2020_02_06-14_11-18-iPhone 8-1935.jpg "oneAPI Base Toolkit")

C++ (11) + SYCL + Extensions

API libraries

Math
Analytics/ML
DNN
...

oneAPI Toolkits

Base
- currently beta
- direct programming (DPC++)
HPC
DL
Rendering
OpenVINO
AI Analytics

![DPC++](2020_02_06-14_16-21-iPhone 8-1936.jpg "DPC++ Compatibility Tool") ![OA.py](2020_02_06-14_47-42-iPhone 8-1937.jpg "Behind run_oa.py")

Developer Access

GPU target currently OpenCL-based, unclear about NVidia support

Impressions on the way home

![pano](2020_02_06-18_11-49-iPhone 8-1939.jpg "Messesee Panorama") ![across](2020_02_06-18_12-18-iPhone 8-1941.jpg "across the lake") ![Entrance](2020_02_06-18_12-49-iPhone 8-1942.jpg "Entrance") ![across2](2020_02_06-18_12-58-iPhone 8-1943.jpg "across2")

IntelOOP2020/IntelSW.md