Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
5255 Discussions

Unaccounted for split loads (Sandy Bridge) with 32 byte aligned data GENERAL EXPLORATION profile

andy-nisbet
Beginner
539 Views
Hello,
Im using Amplifier XE on ubuntu. I am using AVX intrinsics and am compiling with -O2 -g -xavx (icpc 12.04).
for the code section below (that is contained in an OMP parallel section, with a dynamically scheduled for loop using a chunksize of 128, i is the loop iterator --- NOT SHOWN) I am getting unaccounted for split loads that suggests unaligned loads, that are tied to instructions in the loop body. I am also getting some loads blocked by store forwarding inside the loop. This occurs for 1,2,3,4 parallel thread versions, NTHREADS and OMP_NUM_THREADS are set appropriately along with KMP_AFFINITY.I have not tested it as a fully sequential (ie no OMP) code.

Ipp64f tempResults[4] __attribute ((aligned(32)));
avxResult = _mm256_setzero_pd ();
for(jpar= lowerBound; jpar < upperBound;j++) {
//assert(((unsigned long)vecStart[jpar])%4 ==0);
avxVector = _mm256_load_pd (vecStart[jpar]);
avxMatrix = _mm256_load_pd (vecNNZ+4*jpar);
avxProducts = _mm256_mul_pd(avxMatrix,avxVector);
avxResult = _mm256_add_pd(avxResult,avxProducts);

}
_mm256_store_pd (tempResults,avxResult);
val = tempResults[0] +tempResults[1] + tempResults[2] + tempResults[3];
myResults += alpha*val;

vecStart, vecNNZ, myResults are allocated using ippMalloc and are therefore 32byte aligned.
val and alpha are Ipp64f data types as well. jpar is an Ipp32u. Am I correct that I do not need an alignment attribute for Ipp data types?

I have ran the code with the assert uncommented and the addresses passed to the vector load are aligned.

Have I missed something obvious concerning the code or the compilation options? I am running with hyperthreads and turbo mode disabled and the performance governor (max clock rate set) for each of the 4 cores.

The for loop above will potentially run for a short number of iterations, as the lowerbound and upperbound are dependent on the number of nonzero values in a row of a sparse matrix.

Any helpful comments would be appreciated, I believe I must have misunderstood something, and/or that the events are not accurately tied to the source/assembler as indicated in amplifier XE.

If there is a more accurate tool then please let me know. I have used pin directly before, and am contemplating hooking on the function and examining directly.

Thanks,

Andy

0 Kudos
1 Reply
TimP
Honored Contributor III
539 Views
In the past we have had issues with Amplifier family tools reporting unaligned loads in a situation like this where only aligned data are valid, so the application would terminate on unaligned data, and assertions didn't catch any misalignment. I spent a lot of time with Intel AVX compilers assuring myself that data objects were aligned as requested, when some evidence raised doubts.
I would guess the indications aren't valid, and you would be justified in filing a problem report on premier.intel.com, if you are able to include a reproducer. I would expect some noise, i.e. I would not pay attention to counts which amounted to less than 1% of instructions.
If you don't get an answer here about ipp specific questions, I would suggest the ipp forums. I would hope that such an ipp data type would translate to 32-byte aligned storage. Do the header files give a clue?
0 Kudos
Reply