topic The 10 element array was just in Intel® Integrated Performance Primitives

Arbitrary interleaver (shuffle) using IPP

egrayver — Tue, 11 Sep 2012 18:11:46 GMT

I read somewhere that the new processors include special instructions for small lookup tables. Is there a way to optimize the following simple operation:

float data[10] = {0, ...9}

unsigned int idx[10] = {2,3,5,0,...9} // Arbitrary permutation of 0..9

float result[10];

result = data[idx]

I have to do this operation often and it takes quite a bit of time in a 'for' loop. Currently
for (int i=0;i<10;i++) result=data[idx];

Have yiou checked to see if

Chuck_De_Sylva — Tue, 11 Sep 2012 20:42:23 GMT

Have yiou checked to see if your compiler has the auto-vectorizer turned on? That will probably help you a lot. Since there are only 10 elements in the loop the overhead of threading the function may make the performance worse.

The 10 element array was just

egrayver — Wed, 12 Sep 2012 17:05:20 GMT

The 10 element array was just an example. Actual arrays may have 1000 elements. I believe icpc will auto-vectorize when /O3 switch is used.