Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1693 Discussions

## loop optimization for non-uniform access to an array

Beginner
286 Views

Hi,

I have following kind of loop that I am looking to optimize for intel compiler 18.0

SomeData* sourcePtr = GetMySourceSomeData();
SomeData* saPtr = GetMySomeDataPointer();
int size = GetSizeSomeData(saPtr);
// indexArray has series of indices based on some business logic, can be considered random.
int* indexArray = GetRandomIndexArray();
SomeData zero = SomeData(0);

//below loop needs to be optimized.
for (int point = 0; point < size; indexArray++, saPtr++, point++)
{
*(saPtr) = (GetDecisionMaker(point))?sourcePtr[*(indexArray)]:zero;
}

// GetDecisionMaker(point) returns a boolean value based on some business logic, can be considered random.

With intel compiler 13.0 we had a good performance, but with 18.0 we don't get a good performance.

All help is welcome!
Thanks.
4 Replies
Black Belt
286 Views

Try this first:

for (int point = 0; point < size; point++)
{
saPtr[point] = (GetDecisionMaker(point))?sourcePtr[indexArray[point]]:zero;
}

IOW use the index point as opposed to advancing pointers (which may have lifespan after loop).

Be aware that the newer CPU architectures have Scatter/Gather instructions. If the type of SomeData is a "standard" type (char, short, int, float, double, ...) .AND. if the GetDecisionMaker is suitable to (vectorwise) generate a mask, then the loop may be vectorized by the compiler (assuming appropriate compiler optimization options and/or #pragma are used).

Jim Dempsey

Beginner
286 Views

Thank you Jim for the reply!

I vectorized GetDecisionMaker(point) as decisionArray and now I have below code. saPtr points to array of structure containing two integers.

typedef struct SomeArrayStruct { int x; int y; } SomeArray;

for (int point = 0; point < size; indexArray++, saPtr++, point++)
{
saPtr[point] = decisionArray[point] ? sourcePtr[indexArray[point]]:zero;
}

the performance is still the same as previous code, no improvement.

Thanks,

~Abhishek

Black Belt
286 Views
__int64* saPtrX = (__int64*)saPtr;
__int64* sourcePtrX = (__int64*)sourcePtr;
__int64 zero = 0;
for (int point = 0; point < size; indexArray++, saPtr++, point++)
{
saPtrX[point] = decisionArray[point] ? sourcePtrX[indexArray[point]]:zero;
}

Jim Dempsey

Black Belt
286 Views

Note, the earlier post relating to scatter/gather requires CPU that supports this and compiler option and/or #pragma hint to use this.

Jim Dempsey