Your English is quite good.
The CPU performs the SIMD arithmetic on data that is sourced from one or combination of
Where RAM is the slowest and register is the fastest
It is the goal of the programmer to construct the algorithm such that it favors reuse of data brought into the faster end of the hierarchy.
Some computational problems only touch the source data once per iteration (some problems have one iteration others many).
Most computational problems touch the source data several times per iteration. For example a Finite Element analysis may analyze a point and its 26 nearest neighbors or its 124 nearest neighbors. For these types of problems, you often can structure the algorithm such that experiences more reuse of re-referencing of data in the order of register, L1, L2, L3, then RAM.
A good educational reference would be to consult some references:
Data misalignment is a typical reason for ineffectiveness of simd parallel move. Most vectorizing compilers look for opportunities to adjust alignment assuming a long enough stream. Details vary with CPU. For example, misaligned 128 bit moves are ok for Sandy bridge where 256 bit moves are not.
Intel processors have supported aligned SIMD loads since they supported SIMD. These can be either MOV instructions (e.g., MOVAPD) or memory arguments to SIMD arithmetic instructions.
Different types of SIMD memory operations have different alignment restrictions, and different performance penalties for SIMD access to data that is not SIMD-aligned. AVX is much easier to work with than SSE2/3/4, so compilers tend to use the SIMD memory access instructions much more often than they used to.
SIMD memory operations don't necessarily provide any performance benefit. For data that is outside of the L1 cache, all data motion takes place in 64-Byte cache lines. In some cases, Sandy Bridge processors actually get slightly better memory performance with scalar SSE or scalar AVX loads than with SIMD loads. For Haswell processors the AVX instructions give better bandwidth in all the cases I have tested, but there may still be counter-examples.
As the vector width of the SIMD units increases, it does become increasingly difficult to deal with the cases where data rearrangement is required. For packed doubles in SSE you only needed to be able to load the low part, the high part, or swap the two parts. For packed doubles in AVX there are many more permutations of rearrangements that sometimes need to be dealt with. By the the time you get to 8-element vectors of doubles in AVX-512, it can be extremely challenging to figure out how to rearrange data in the registers without losing the speedup that you are trying to get from the wide SIMD architecture.