Hi all! I am parallelizing a certain dynamic programming problem using AVX2.
In the main iteration of my calculation, I calculate column in matrix where each cell is an AVX2 register -> _m256i. I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of _m256i values, like this: _m256i prevColumn
I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
If you create a static array of __m256x type where the array size <= 16 I suppose that compiler will try to load every array member into corresponding AVX register. Of course aferomentioned array will need to be allocated either on the stack or in global memory pointed by DS segment on Windows platform. Iam bot sure if compiler upon attempt to compile such a array will load that array directly into AVX registers. Anyway the best option is to compile simple test case and step into with debugger.
> I know that _m256i basically represents an avx2 register
That's not exactly true. The compiler will do its best to keep relevant __m256i values in registers as much as possible, just as it does for ints, pointers and other fundamental types. When it runs out of registers it will happily spill least used values to the stack. Note that there is no way to force the compiler to store a value in a register - in fact in debug mode you will notice that __m256i are constantly loaded and stored back to the storage on the stack. The only way to enforce that is to write in assembler.
As for arrays, they pose additional difficulties to the optimizer. For example, unless the array is placed on the stack, the compiler would have a hard time proving that the array cannot be modified through different pointers. You would have to "promise" that to the compiler with the __restrict keyword. Then there is the array size vs register count consideration.
In general it is perfectly possible to have an array of __m256i. I wouldn't expect array elements to be mapped onto registers though. The array will consume memory which, by the way, must be properly aligned (and sometimes you must manually ensure this), and the compiler will produce code that will load and store elements to that memory.
> Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
IMHO, there's not much point in creating an array of vector types, especially since the original data is likely to be of an elementary type (i.e. int32_t and not __m256i). Your code will be a bit cleaner - no explicit intrinsics for loads and stores to the array needed, but on the assembler level there's no difference. Personally, I would prefer explicit memory loads and stores via intrinsics and operating on the intermediate __m256i values. This way you separate array accesses from the data modifications, which can hopefully be performed without spilling temporary values to memory. This is also useful if your input/output data can be unaligned.
IIRC __m256x type is a union of static arrays.
>>> Personally, I would prefer explicit memory loads and stores via intrinsics and operating on the intermediate __m256i values.>>>
This is mine preffered method to work with __m256x data types.
Thank you both a lot, now I understand better what is happening!
I agree with your advice: I will define array explicitly, that way I will have control over loading/storing, and I can do the modifications with registers, as you said.
I am not completely sure how to align this array correctly? If I use chars and I have 16 of them in avx register, is this ok then?
Martin S. wrote:
I am not completely sure how to align this array correctly? If I use chars and I have 16 of them in avx register, is this ok then?char myArray
YMM registers are 32 bytes, not 16.
The way you align the array is different depending on where this array is stored. __attribute__((aligned(32)) would only suffice if the array has static storage (e.g. is declared in namespace scope). If the array is on the stack you will probably have to manually align the pointer since the stack frame typically has lesser alignment (16 bytes on Linux x86_64, for example). You can create a larger storage and use std::align() to align the pointer to the beginning of the array. If the array is on the heap you will have to use specialized memory allocation routines like std::aligned_alloc() or posix_memalign().
And if you use std::vector allocation is done by an allocator and you need to wrap the _mm_malloc accordingly.
For this purpose I recommend: https://gist.github.com/donny-dont/1471329
There are other implementations of aligne allocator, but be aware not all of them are complete and might have strange side affects (like vector assignments won't work anymore).
I am using the one linked above without any problems for several months. Basically I prefer using vectors over doing allocation and freeing on my own, if I have C++. And with an aligned allocator they are fully compatible with SSE, AVX aligned load and store.