Intel Hard- and Software
7 Levels of Parallelism
- Node level
- Socket level
- NUMA
- 2-, 4-, or 8-socket
- Hyperthreading: 2 logical threads on one core
- less cache per thread
- useful with jobs utilizing different parts of core - disable on HPC
- now not much of a penalty when enabled
- Mesh
- GPU-CPU
- subslice = core (8 on Intel Graphics Gen 11)
- exec-unit = thread (8)
- Instruction level: out-of order execution on different ports/execution units (max 4 IPC [instructions per cycle])
- Data parallelism
Vectorization
![ParStudio](2020_02_06-10_23-07-iPhone 8-1924.jpg "Inside Parallel Studio XE")
- Performance libraries
- Compiler (fully automized)
- Compiler with vectorization hints (
#pragma
) - user mandated vectorization (SIMD directive)
- SIMD intrinsic dlass (
F32vec4 add
) - Vector intrinsic (
mm_add_ps()
) - assembler code (`addps)
oneAPI: New Foundation for Exascale Computing
- unified memory (CPU/GPU)
- all-to-all connectivity
Summary
- Code modernization not always easy (analyze & optimize)
- data / task parallelism ![IntelPy](2020_02_06-10_24-57-iPhone 8-1925.jpg "Intel Distribution for Python") ![MKL](2020_02_06-10_28-11-iPhone 8-1927.jpg "Math Kernel Library") ![DAAL](2020_02_06-10_36-47-iPhone 8-1928.jpg "Data Analytics Acceleration Library") ![DAAL Algo](2020_02_06-10_37-46-iPhone 8-1929.jpg "DAAL Algorithms") ![Diagnostic Tool](2020_02_06-10_46-13-iPhone 8-1930.jpg "Diagnostic Toolset for High Performance Compute Clusters") ![Demo](2020_02_06-10_55-10-iPhone 8-1931.jpg "Demo") ![noFP](2020_02_06-11_00-12-iPhone 8-1932.jpg "Removing FP converts")
Parallel Studio
transition to oneAPI in 2020
in 2020 version
- VNNI (Vector Neural Network inference) for AI inference speedup
- persistent memory (Optane) compatible with RAM
- expanded standard support
- Fortran 2018
- C++ 17 (20 in initial stage)
- move to LLVM (slowly), backend switchable
- Extended Coarse Grain Profiling
- HPC cloud support
- New OS support (e.g., Amazon)
Compiler (v19.0)
Python
Take advantage of Intel's python distribution: add Intel channel for anaconda (in general)
Speedup achieved by optimizing
- numba (utilizing MKL)
- scikit learn (utilizing DAAL)
URLs
Performance tools
- VTune: HPC tuning - now even broader coverage (demo in afternoon)
- performance snapshot for high level view
- Advisor provides information on
- threads
- vectorization
- GPU (offload advisor, coming; so far only Intel gen9 or gen11)
MPI library
Intel cluster checker
Vectorization issues
- Default: compile to early 2000s CPU - only vector length 2 => need compiler flag(s) for better optimization
- technical bit: when using AVX instructions clock frequency is reduced to compensate for extra power consumption
Example N-body problem
Convert Array of Structures (AoS) -> Structure of Arrays (SoA) for better aligned memory access
oneAPI
open standard for unified programming model across hardware platforms ![oneAPI base](2020_02_06-14_11-18-iPhone 8-1935.jpg "oneAPI Base Toolkit")
C++ (11) + SYCL + Extensions
API libraries
- Math
- Analytics/ML
- DNN
- ...
oneAPI Toolkits
- Base
- currently beta
- direct programming (DPC++)
- HPC
- DL
- Rendering
- OpenVINO
- AI Analytics
![DPC++](2020_02_06-14_16-21-iPhone 8-1936.jpg "DPC++ Compatibility Tool") ![OA.py](2020_02_06-14_47-42-iPhone 8-1937.jpg "Behind run_oa.py")
GPU target currently OpenCL-based, unclear about NVidia support
Impressions on the way home
![pano](2020_02_06-18_11-49-iPhone 8-1939.jpg "Messesee Panorama") ![across](2020_02_06-18_12-18-iPhone 8-1941.jpg "across the lake") ![Entrance](2020_02_06-18_12-49-iPhone 8-1942.jpg "Entrance") ![across2](2020_02_06-18_12-58-iPhone 8-1943.jpg "across2")