I have a AVX vector register reg containing 4 double values, let's call them (in order): 0 - 2 - 3 - 4
These values have to be added to distinct locations of an array A, namely to positions A, A, A, A
In other words:
A += reg, A += reg and so on
This is a quite recurrent situation in my program, i.e. sequences of load-add-stores that are "almost" unit-stride - but actually they are not.
At the beginning I thought I could have used some sort of shuffle instructions to shift values in reg, i.e. getting 0 - x - 2 - 3 (and maybe treating reg as a scalar value), and then perfom standard 256-bit instructions. However, as far as I know, I can't reduce that kind of shifting to a single instruction, right?
Related to this question, let's say that now reg is 0 - 2 - 3 - 5. Should I treat all 4 values as scalar values or is there a way of efficiently (1/2 instructions?) extracting the two values in the middles (i.e. those crossing the two 128-bits lanes) into a 128-bit register ?
Depending on the actual use case you could go three ways:
1. Use neutral data to fill the gaps and always operate on full 128/256 bit strides. E.g. if you need to add numbers to some elements of the array, just put zeros in the not-to-be-affected elements and perform regular additions.
2. A slight modification of point one, if no such neutral data exist. Just do the same and then perform blend to retain the untouched elements. You may have to construct the mask for the blend operation.
3. If the data you have to operate on is too disperse and especially when the indices of the elements are dynamically obtained, the above ways may become ineffective. In this case you can try gathering data (with a single instruction or manually), performing the operation and then scattering the results (this can only be done manually for now, e.g. with extract operations). However, if this routine is a hot spot, I would consider rearranging the input and output data because gather/scatter are expensive, no matter how implemented, and this overhead can easily overweight the win of the vectorization of the actual computation (this depends on the computation complexity, of course).
>>A += reg, A += reg and so on
I imagine that at potentialy a next time interval you will have another A += reg, A += reg and so on
This being the case then consider implementing something equivilent to a reduction variable (an internal variable use to avoid a time penalty of using the actual external/intended variable). In this case the internal variable, conceptualizes itself into a small vector A0234. This variable contains the accumulation of the above statement. Only then after your major computation loop do you apply them (reduction operation) to the external variables (non-unit stride array elements).