topic Your English is quite good. in Intel® Moderncode for Parallel Architectures

Memory to CPU (mov) bandwidth limitations

albus_d_ — Sun, 01 Feb 2015 13:20:00 GMT

(sorry for weak english I am not native english, Not sure if right forum, first time here - This is general about some hardware limits i do not understand technical reason and I would very like to know) We have now parallelised SIMD arithmetic (like 8 float mulls or divisions in one step) theoretical (but also nearly practical) arithmetical bandwidth per core is thus like 4GHz * 8 floats = about 30 GFLOPS per core or something like that But we still AFAIK have quite low RAM to CPU bandwidth at the level of read or write of 1 or 2 int of float per nanosecond, such ram-2-cpu bandwidth when i am testing it is like only 2 GLOP per second per core or something like that; (both those values are rough but this difference seem to be physical truth at least from my experience) I mean arithmetic can be paralelised (like 8-vectorised) but load/store movs are not - thus SIMD paralistation has obly a fraction of its potential power This is extremally crusial to increase this memory bandwith (much more important than increasing arithmetic) but from some technical reason I dont know this is not improved The question is what is the real reason for this, why simd movs CANT be parallelised, why they are not?

Your English is quite good.

jimdempseyatthecove — Sun, 01 Feb 2015 15:15:34 GMT

Your English is quite good.

The CPU performs the SIMD arithmetic on data that is sourced from one or combination of

RAM
L3 Cache
L2 Cache
L1 Cache
register

Where RAM is the slowest and register is the fastest

It is the goal of the programmer to construct the algorithm such that it favors reuse of data brought into the faster end of the hierarchy.

Some computational problems only touch the source data once per iteration (some problems have one iteration others many).

Most computational problems touch the source data several times per iteration. For example a Finite Element analysis may analyze a point and its 26 nearest neighbors or its 124 nearest neighbors. For these types of problems, you often can structure the algorithm such that experiences more reuse of re-referencing of data in the order of register, L1, L2, L3, then RAM.

A good educational reference would be to consult some references:

https://software.intel.com/sites/products/papers/tpt_ieee.pdf
https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors
https://software.intel.com/en-us/articles/using-simd-technologies-on-intel-architecture-to-speed-up-game-code

Jim Dempsey

Data misalignment is a

TimP — Mon, 02 Feb 2015 14:04:43 GMT

Data misalignment is a typical reason for ineffectiveness of simd parallel move. Most vectorizing compilers look for opportunities to adjust alignment assuming a long enough stream. Details vary with CPU. For example, misaligned 128 bit moves are ok for Sandy bridge where 256 bit moves are not.

Intel processors have

McCalpinJohn — Tue, 03 Feb 2015 00:40:04 GMT

Intel processors have supported aligned SIMD loads since they supported SIMD. These can be either MOV instructions (e.g., MOVAPD) or memory arguments to SIMD arithmetic instructions.

Different types of SIMD memory operations have different alignment restrictions, and different performance penalties for SIMD access to data that is not SIMD-aligned. AVX is much easier to work with than SSE2/3/4, so compilers tend to use the SIMD memory access instructions much more often than they used to.

SIMD memory operations don't necessarily provide any performance benefit. For data that is outside of the L1 cache, all data motion takes place in 64-Byte cache lines. In some cases, Sandy Bridge processors actually get slightly better memory performance with scalar SSE or scalar AVX loads than with SIMD loads. For Haswell processors the AVX instructions give better bandwidth in all the cases I have tested, but there may still be counter-examples.

As the vector width of the SIMD units increases, it does become increasingly difficult to deal with the cases where data rearrangement is required. For packed doubles in SSE you only needed to be able to load the low part, the high part, or swap the two parts. For packed doubles in AVX there are many more permutations of rearrangements that sometimes need to be dealt with. By the the time you get to 8-element vectors of doubles in AVX-512, it can be extremely challenging to figure out how to rearrange data in the registers without losing the speedup that you are trying to get from the wide SIMD architecture.