Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
IntelOOP2020/IntelSW.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
141 lines (107 sloc)
4.51 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Intel Hard- and Software | |
### 7 Levels of Parallelism | |
1. Node level | |
2. Socket level | |
- NUMA | |
- 2-, 4-, or 8-socket | |
3. Hyperthreading: 2 logical threads on one core | |
- less cache per thread | |
- useful with jobs utilizing different parts of core - disable on HPC | |
- now not much of a penalty when enabled | |
4. Mesh | |
5. GPU-CPU | |
- subslice = core (8 on Intel Graphics Gen 11) | |
- exec-unit = thread (8) | |
6. Instruction level: out-of order execution on different ports/execution units (max 4 IPC [instructions per cycle]) | |
7. Data parallelism | |
<!--- ![](2020_02_06-13_48-01-iPhone 8-1933.jpg "") | |
![](2020_02_06-13_48-59-iPhone 8-1934.jpg "") | |
---> | |
### Vectorization | |
![ParStudio](2020_02_06-10_23-07-iPhone 8-1924.jpg "Inside Parallel Studio XE") | |
- Performance libraries | |
- Compiler (fully automized) | |
- Compiler with vectorization hints (`#pragma`) | |
- user mandated vectorization (SIMD directive) | |
- SIMD intrinsic dlass (`F32vec4 add`) | |
- Vector intrinsic (`mm_add_ps()`) | |
- assembler code (`addps) | |
### oneAPI: New Foundation for Exascale Computing | |
- unified memory (CPU/GPU) | |
- all-to-all connectivity | |
### Summary | |
- Code modernization not always easy (analyze & optimize) | |
- data / task parallelism | |
![IntelPy](2020_02_06-10_24-57-iPhone 8-1925.jpg "Intel Distribution for Python") | |
![MKL](2020_02_06-10_28-11-iPhone 8-1927.jpg "Math Kernel Library") | |
![DAAL](2020_02_06-10_36-47-iPhone 8-1928.jpg "Data Analytics Acceleration Library") | |
![DAAL Algo](2020_02_06-10_37-46-iPhone 8-1929.jpg "DAAL Algorithms") | |
![Diagnostic Tool](2020_02_06-10_46-13-iPhone 8-1930.jpg "Diagnostic Toolset for High Performance Compute Clusters") | |
![Demo](2020_02_06-10_55-10-iPhone 8-1931.jpg "Demo") | |
![noFP](2020_02_06-11_00-12-iPhone 8-1932.jpg "Removing FP converts") | |
### Parallel Studio | |
transition to **oneAPI** in 2020 | |
in 2020 version | |
- VNNI (Vector Neural Network inference) for AI inference speedup | |
- persistent memory (Optane) compatible with RAM | |
- expanded standard support | |
- Fortran 2018 | |
- C++ 17 (20 in initial stage) | |
- move to LLVM (slowly), backend switchable | |
- Extended Coarse Grain Profiling | |
- HPC cloud support | |
- New OS support (e.g., Amazon) | |
#### Compiler (v19.0) | |
#### Python | |
Take advantage of Intel's python distribution: | |
[add Intel channel for *anaconda*](https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda) ([in general](https://software.intel.com/en-us/distribution-for-python)) | |
Speedup achieved by optimizing | |
- *numba* (utilizing MKL) | |
- *scikit learn* (utilizing [DAAL](http://software.intel.com/daal)) | |
##### URLs | |
- [Installation](https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda) | |
- [Anaconda packages](https://anaconda.org/intel/repo) | |
### Performance tools | |
- VTune: HPC tuning - now even broader coverage (demo in afternoon) | |
- [performance snapshot](intel.com/performance-snapshot) for high level view | |
- [Advisor](software.intel.com/advisor) provides information on | |
- threads | |
- vectorization | |
- GPU (offload advisor, coming; so far only Intel *gen9* or *gen11*) | |
### [MPI library](software.intel.com/intel-mpi-library) | |
### Intel cluster checker | |
### Vectorization issues | |
- Default: compile to early 2000s CPU - only vector length 2 => need compiler flag(s) for better optimization | |
- technical bit: when using AVX instructions clock frequency is reduced to compensate for extra power consumption | |
#### Example *N*-body problem | |
Convert Array of Structures (AoS) -> Structure of Arrays (SoA) for better aligned memory access | |
--- | |
## [oneAPI](https://software.intel.com/en-us/oneapi) | |
open standard for unified programming model across hardware platforms | |
![oneAPI base](2020_02_06-14_11-18-iPhone 8-1935.jpg "oneAPI Base Toolkit") | |
**C++** (11) + **SYCL** + **Extensions** | |
### API libraries | |
- Math | |
- Analytics/ML | |
- DNN | |
- ... | |
### oneAPI Toolkits | |
- Base | |
- currently beta | |
- direct programming (DPC++) | |
- HPC | |
- DL | |
- Rendering | |
- OpenVINO | |
- AI Analytics | |
![DPC++](2020_02_06-14_16-21-iPhone 8-1936.jpg "DPC++ Compatibility Tool") | |
![OA.py](2020_02_06-14_47-42-iPhone 8-1937.jpg "Behind run_oa.py") | |
[Developer Access](http://software.intel.com/devcloud/oneapi) | |
GPU target currently **OpenCL-based**, unclear about NVidia support | |
- https://software.intel.com/en-us/oneapi | |
- https://github.com/intel/llvm | |
## Impressions on the way home | |
![pano](2020_02_06-18_11-49-iPhone 8-1939.jpg "Messesee Panorama") | |
![across](2020_02_06-18_12-18-iPhone 8-1941.jpg "across the lake") | |
![Entrance](2020_02_06-18_12-49-iPhone 8-1942.jpg "Entrance") | |
![across2](2020_02_06-18_12-58-iPhone 8-1943.jpg "across2") |