IntelSW.md

# Intel Hard- and Software

### 7 Levels of Parallelism

1. Node level
2. Socket level
    - NUMA
    - 2-, 4-, or 8-socket 
3. Hyperthreading: 2 logical threads on one core
     - less cache per thread
     - useful with jobs utilizing different parts of core - disable on HPC
     - now not much of a penalty when enabled
4. Mesh
5. GPU-CPU
    - subslice = core (8 on Intel Graphics Gen 11)
    - exec-unit = thread (8)
6. Instruction level: out-of order execution on different ports/execution units (max 4 IPC [instructions per cycle])
7. Data parallelism


<!--- ![](2020_02_06-13_48-01-iPhone 8-1933.jpg "")
![](2020_02_06-13_48-59-iPhone 8-1934.jpg "")
--->

### Vectorization
![ParStudio](2020_02_06-10_23-07-iPhone 8-1924.jpg "Inside Parallel Studio XE")

- Performance libraries
- Compiler (fully automized)
- Compiler with vectorization hints (`#pragma`)
- user mandated vectorization (SIMD directive)
- SIMD intrinsic dlass (`F32vec4 add`)
- Vector intrinsic (`mm_add_ps()`)
- assembler code (`addps)

### oneAPI: New Foundation for Exascale Computing
- unified memory (CPU/GPU)
- all-to-all connectivity

### Summary
- Code modernization not always easy (analyze & optimize)
- data / task parallelism
![IntelPy](2020_02_06-10_24-57-iPhone 8-1925.jpg "Intel Distribution for Python")
![MKL](2020_02_06-10_28-11-iPhone 8-1927.jpg "Math Kernel Library")
![DAAL](2020_02_06-10_36-47-iPhone 8-1928.jpg "Data Analytics Acceleration Library")
![DAAL Algo](2020_02_06-10_37-46-iPhone 8-1929.jpg "DAAL Algorithms")
![Diagnostic Tool](2020_02_06-10_46-13-iPhone 8-1930.jpg "Diagnostic Toolset for High Performance Compute Clusters")
![Demo](2020_02_06-10_55-10-iPhone 8-1931.jpg "Demo")
![noFP](2020_02_06-11_00-12-iPhone 8-1932.jpg "Removing FP converts")

### Parallel Studio
transition to **oneAPI** in 2020

in 2020 version

- VNNI (Vector Neural Network inference) for AI inference speedup 
- persistent memory (Optane) compatible with RAM
- expanded standard support
    - Fortran 2018
    - C++ 17 (20 in initial stage)
    - move to LLVM (slowly), backend switchable
- Extended Coarse Grain Profiling
- HPC cloud support
- New OS support (e.g., Amazon)

#### Compiler (v19.0)

#### Python
Take advantage of Intel's python distribution:
[add Intel channel for *anaconda*](https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda) ([in general](https://software.intel.com/en-us/distribution-for-python))

Speedup achieved by optimizing
- *numba* (utilizing MKL)
- *scikit learn* (utilizing [DAAL](http://software.intel.com/daal))

##### URLs
- [Installation](https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda)
- [Anaconda packages](https://anaconda.org/intel/repo)

### Performance tools
- VTune: HPC tuning - now even broader coverage (demo in afternoon)
- [performance snapshot](intel.com/performance-snapshot) for high level view
- [Advisor](software.intel.com/advisor) provides information on
    - threads
    - vectorization
    - GPU (offload advisor, coming; so far only Intel *gen9* or *gen11*)
 
### [MPI library](software.intel.com/intel-mpi-library)

### Intel cluster checker

### Vectorization issues
- Default: compile to early 2000s CPU - only vector length 2 => need compiler flag(s) for better optimization
- technical bit: when using AVX instructions clock frequency is reduced to compensate for extra power consumption

#### Example *N*-body problem
Convert Array of Structures (AoS) -> Structure of Arrays (SoA) for better aligned memory access

---

## [oneAPI](https://software.intel.com/en-us/oneapi)
open standard for unified programming model across hardware platforms
![oneAPI base](2020_02_06-14_11-18-iPhone 8-1935.jpg "oneAPI Base Toolkit")

**C++** (11) + **SYCL** + **Extensions**

### API libraries

- Math
- Analytics/ML
- DNN
- ...

### oneAPI Toolkits

- Base
	- currently beta
	- direct programming (DPC++)
- HPC
- DL
- Rendering
- OpenVINO
- AI Analytics

![DPC++](2020_02_06-14_16-21-iPhone 8-1936.jpg "DPC++ Compatibility Tool")
![OA.py](2020_02_06-14_47-42-iPhone 8-1937.jpg "Behind run_oa.py")

[Developer Access](http://software.intel.com/devcloud/oneapi)

GPU target currently **OpenCL-based**, unclear about NVidia support

- https://software.intel.com/en-us/oneapi
- https://github.com/intel/llvm


## Impressions on the way home
![pano](2020_02_06-18_11-49-iPhone 8-1939.jpg "Messesee Panorama")
![across](2020_02_06-18_12-18-iPhone 8-1941.jpg "across the lake")
![Entrance](2020_02_06-18_12-49-iPhone 8-1942.jpg "Entrance")
![across2](2020_02_06-18_12-58-iPhone 8-1943.jpg "across2")