- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your English is quite good.
The CPU performs the SIMD arithmetic on data that is sourced from one or combination of
RAM
L3 Cache
L2 Cache
L1 Cache
register
Where RAM is the slowest and register is the fastest
It is the goal of the programmer to construct the algorithm such that it favors reuse of data brought into the faster end of the hierarchy.
Some computational problems only touch the source data once per iteration (some problems have one iteration others many).
Most computational problems touch the source data several times per iteration. For example a Finite Element analysis may analyze a point and its 26 nearest neighbors or its 124 nearest neighbors. For these types of problems, you often can structure the algorithm such that experiences more reuse of re-referencing of data in the order of register, L1, L2, L3, then RAM.
A good educational reference would be to consult some references:
https://software.intel.com/sites/products/papers/tpt_ieee.pdf
https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors
https://software.intel.com/en-us/articles/using-simd-technologies-on-intel-architecture-to-speed-up-game-code
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Data misalignment is a typical reason for ineffectiveness of simd parallel move. Most vectorizing compilers look for opportunities to adjust alignment assuming a long enough stream. Details vary with CPU. For example, misaligned 128 bit moves are ok for Sandy bridge where 256 bit moves are not.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel processors have supported aligned SIMD loads since they supported SIMD. These can be either MOV instructions (e.g., MOVAPD) or memory arguments to SIMD arithmetic instructions.
Different types of SIMD memory operations have different alignment restrictions, and different performance penalties for SIMD access to data that is not SIMD-aligned. AVX is much easier to work with than SSE2/3/4, so compilers tend to use the SIMD memory access instructions much more often than they used to.
SIMD memory operations don't necessarily provide any performance benefit. For data that is outside of the L1 cache, all data motion takes place in 64-Byte cache lines. In some cases, Sandy Bridge processors actually get slightly better memory performance with scalar SSE or scalar AVX loads than with SIMD loads. For Haswell processors the AVX instructions give better bandwidth in all the cases I have tested, but there may still be counter-examples.
As the vector width of the SIMD units increases, it does become increasingly difficult to deal with the cases where data rearrangement is required. For packed doubles in SSE you only needed to be able to load the low part, the high part, or swap the two parts. For packed doubles in AVX there are many more permutations of rearrangements that sometimes need to be dealt with. By the the time you get to 8-element vectors of doubles in AVX-512, it can be extremely challenging to figure out how to rearrange data in the registers without losing the speedup that you are trying to get from the wide SIMD architecture.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page