Unaccounted for split loads (Sandy Bridge) with 32 byte aligned

andy-nisbet · ‎11-03-2011

Hello,
Im using Amplifier XE on ubuntu. I am using AVX intrinsics and am compiling with -O2 -g -xavx (icpc 12.04).
for the code section below (that is contained in an OMP parallel section, with a dynamically scheduled for loop using a chunksize of 128, i is the loop iterator --- NOT SHOWN) I am getting unaccounted for split loads that suggests unaligned loads, that are tied to instructions in the loop body. I am also getting some loads blocked by store forwarding inside the loop. This occurs for 1,2,3,4 parallel thread versions, NTHREADS and OMP_NUM_THREADS are set appropriately along with KMP_AFFINITY.I have not tested it as a fully sequential (ie no OMP) code.

Ipp64f tempResults[4] __attribute ((aligned(32)));
avxResult = _mm256_setzero_pd ();
for(jpar= lowerBound; jpar < upperBound;j++) {
//assert(((unsigned long)vecStart[jpar])%4 ==0);
avxVector = _mm256_load_pd (vecStart[jpar]);
avxMatrix = _mm256_load_pd (vecNNZ+4*jpar);
avxProducts = _mm256_mul_pd(avxMatrix,avxVector);
avxResult = _mm256_add_pd(avxResult,avxProducts);

}
_mm256_store_pd (tempResults,avxResult);
val = tempResults[0] +tempResults[1] + tempResults[2] + tempResults[3];
myResults += alpha*val;

vecStart, vecNNZ, myResults are allocated using ippMalloc and are therefore 32byte aligned.
val and alpha are Ipp64f data types as well. jpar is an Ipp32u. Am I correct that I do not need an alignment attribute for Ipp data types?

I have ran the code with the assert uncommented and the addresses passed to the vector load are aligned.

Have I missed something obvious concerning the code or the compilation options? I am running with hyperthreads and turbo mode disabled and the performance governor (max clock rate set) for each of the 4 cores.

The for loop above will potentially run for a short number of iterations, as the lowerbound and upperbound are dependent on the number of nonzero values in a row of a sparse matrix.

Any helpful comments would be appreciated, I believe I must have misunderstood something, and/or that the events are not accurately tied to the source/assembler as indicated in amplifier XE.

If there is a more accurate tool then please let me know. I have used pin directly before, and am contemplating hooking on the function and examining directly.

Thanks,

Andy

TimP · ‎11-03-2011

In the past we have had issues with Amplifier family tools reporting unaligned loads in a situation like this where only aligned data are valid, so the application would terminate on unaligned data, and assertions didn't catch any misalignment. I spent a lot of time with Intel AVX compilers assuring myself that data objects were aligned as requested, when some evidence raised doubts.
I would guess the indications aren't valid, and you would be justified in filing a problem report on premier.intel.com, if you are able to include a reproducer. I would expect some noise, i.e. I would not pay attention to counts which amounted to less than 1% of instructions.
If you don't get an answer here about ipp specific questions, I would suggest the ipp forums. I would hope that such an ipp data type would translate to 32-byte aligned storage. Do the header files give a clue?

Unaccounted for split loads (Sandy Bridge) with 32 byte aligned data GENERAL EXPLORATION profile