Core Optimizations (1/8)

srimks · ‎04-17-2009

Hello,

I need to discuss core optimizations, e.g single core. The vec-report3 for one of the code says that for bottom or inner FOR loop, maximum vector data hazards (ANTI & FLOW) dependency exits within the bottom or inner FOR loop.
--

The pragma's using distribute, unroll_and_jam & swp call doesn't give better improvement if incorporated at the places identified within bottom or inner FOR loop.

I need to make this code optimized for "single core only" within a node of a SMP system which has 8 cores (Intel Xeon 5345 processor).

Somehow, incorporating OpenMP pragma calls can't give better performance or distribute the work within threads for single core, please correct? (If I had to use all 8 cores within the node that certainly using OpenMP pragma would perform better TLP alongwith having vectorization within TLP for selected loop or section would have given finally some better performance.)

The option left is to rewrite some section of code using SIMD intrinsics, am I correct to go for such decision for single core performance?

If writing SIMD would results in having some gain in performance for single core for above code, than which areas of code one can focus to write SIMD?

Above seems to be a great challenge and learning experience, please help with some suggestions?

~BR

jimdempseyatthecove · ‎04-17-2009

BR

In the above, abbtemp is not defined, is it a static array? If so, you will/may eventually have problems if/when you convert to parallel. abbtemp is not used within the loops.

Converson operations are expensive with respect to vectorization (when additional calculations are involved)

consider maintaining two equivilent arrays (vectors) for appropriate data, one as int for indexing and one as double for use in expressions that require double. You will have fewer conversions this way.

abbInt[2048][XYZ]
abbDouble[2048][XYZ]

where abbDouble = (double)abbInt

(same for crdabb and potentially elsewhere)

If you do not wish to maintain one of these for every team then use a seperate loop to perform the conversions to double. In this case, two loops will be faster than using one loop with conversions in the expressions. Also 2048*8*4 (or 3 if you wish) = 65,536 and will fit nicely within most L1 caches. So the time spent to fetch it back out may be neglegable (as compared to non-vector operations in your inner loop).

Current generation cores SSEn.nn work faster and optimize better when data is known to be aligned.

Consider changing your vectors from XYZ to XYZW (4 elements) but where W is not used. This way X and Y are known to be aligned. (remember to declare your arrays (VEC) as being 16 byte aligned)

for (a = 3; a < raleigh; a++ ) {
chicago = bye;
c = abbDouble[chicago] - crdabb;
c = abbDouble[chicago] - crdabb;
c = abbDouble[chicago] - crdabb;

crd[chicago] = crdabbDouble + c * p + c * p + c * p;
crd[chicago] = crdabbDouble + c * p + c * p + c * p;
crd[chicago] = crdabbDouble + c * p + c * p + c * p;
}

Jim Dempsey

srimks · ‎04-17-2009

Quoting - jimdempseyatthecove

BR

In the above, abbtemp is not defined, is it a static array? If so, you will/may eventually have problems if/when you convert to parallel. abbtemp is not used within the loops.

Converson operations are expensive with respect to vectorization (when additional calculations are involved)

consider maintaining two equivilent arrays (vectors) for appropriate data, one as int for indexing and one as double for use in expressions that require double. You will have fewer conversions this way.

abbInt[2048][XYZ]
abbDouble[2048][XYZ]

where abbDouble = (double)abbInt

(same for crdabb and potentially elsewhere)

If you do not wish to maintain one of these for every team then use a seperate loop to perform the conversions to double. In this case, two loops will be faster than using one loop with conversions in the expressions. Also 2048*8*4 (or 3 if you wish) = 65,536 and will fit nicely within most L1 caches. So the time spent to fetch it back out may be neglegable (as compared to non-vector operations in your inner loop).

Current generation cores SSEn.nn work faster and optimize better when data is known to be aligned.

Consider changing your vectors from XYZ to XYZW (4 elements) but where W is not used. This way X and Y are known to be aligned. (remember to declare your arrays (VEC) as being 16 byte aligned)

for (a = 3; a < raleigh; a++ ) {
chicago = bye;
c = abbDouble[chicago] - crdabb;
c = abbDouble[chicago] - crdabb;
c = abbDouble[chicago] - crdabb;

crd[chicago] = crdabbDouble + c * p + c * p + c * p;
crd[chicago] = crdabbDouble + c * p + c * p + c * p;
crd[chicago] = crdabbDouble + c * p + c * p + c * p;
}

Jim Dempsey

Hello Jim,

Few queries which I have, are -

(a) How to perform SIMD operation on single core (1/8) of a SMP node system?
(b) How do I know that corrected code with SIMD opeartions runs only on single core (1/8)?
(c) How do I make the compiler perform operation related to SIMD vectorization on a single core?
(d) How do I visualize remaining 7 cores of Intel Xeon 5345 processor not having any vectorization/SIMD performed? (I did check through Intel VTune, earlier the code was distributed in remaining cores too which my application doesn't wish too.)

All, I am performing using ICC-v11.0 on linux x86_64 on Intel Xeon 5345 processor.

~BR

jimdempseyatthecove · ‎04-18-2009

(a),(b) each core has its own set of SIMD registers so core to core interaction of SIMD is not of concern. Cache locality can be a concern. Most operating systems permit you to pin a thread to one or a set of cores/processors. Look in your documentation under "thread affinity".

(c) The compiler is not core specific, use a system call to specify thread affinity

(d) You would prefere that the remaining 7 cores use vectorization as much as possible. This is because this will reduce the number of memory accesses by those threads (making more memory bus time available for your 1 thread)

In the above code you will node the CVTSD2SS and CVTPS2DD etc. These are used to convert double precision FPsingle precision (float)and single precision to double. SIMD works best when all FP are of the same size. Use either all float or all double. When conversions are required you use more of SISD, when conversions are avoided you get SIMD. You want the MD variant.

Also, check your options, SIN and COS are not using the intrinsics (small part of problem).

Keep the ASM listing files (named appropriately and with comments), make timing runs (place info in comment files)

Check results as you experiment with coding. You will see how the code collapses as you approach the better solutions.

Jim Dempsey

alef_dos · ‎04-26-2009

Quoting - jimdempseyatthecove

(a),(b) each core has its own set of SIMD registers so core to core interaction of SIMD is not of concern. Cache locality can be a concern. Most operating systems permit you to pin a thread to one or a set of cores/processors. Look in your documentation under "thread affinity".

(c) The compiler is not core specific, use a system call to specify thread affinity

(d) You would prefere that the remaining 7 cores use vectorization as much as possible. This is because this will reduce the number of memory accesses by those threads (making more memory bus time available for your 1 thread)

In the above code you will node the CVTSD2SS and CVTPS2DD etc. These are used to convert double precision FPsingle precision (float)and single precision to double. SIMD works best when all FP are of the same size. Use either all float or all double. When conversions are required you use more of SISD, when conversions are avoided you get SIMD. You want the MD variant.

Also, check your options, SIN and COS are not using the intrinsics (small part of problem).

Keep the ASM listing files (named appropriately and with comments), make timing runs (place info in comment files)

Check results as you experiment with coding. You will see how the code collapses as you approach the better solutions.

Jim Dempsey

c) system calls

Can you explain a bit more how to implement affinity with system calls? (what instructions are needed?)
Thank you

(Sorry about my english)

TimP · ‎04-26-2009

The swp pragma is enabled only for ia64, and is so documented. For Intel compilers for Xeon, the vector pragmas are analogous in some ways. In either architecture, these pragmas don't affect the analysis of data hazards.
If the compiler reports a data hazard where you could assert there is none, you would use C99 restrict (C++ equivalent in icc or gcc) or possibly #pragma ivdep, or write in scalar replacements. Some of us were around before C introduced some of these complications and gave rise to a bunch of academic careers trying to invent optimizing compiler languages without reinventing Fortran.
Inserting #pragma distribute point at the top of the loop to prevent distribution inherently prevents the compiler from using that means to localize a data hazard into a single fragment of the loop. This may be what you want; given that the hazard still prevents full loop optimization, splitting the loop may reduce performance over all. Inserting the pragma in the middle of a loop to force a split does assert that doing so will not violate data hazards. This aspect of distribute point might benefit from better documentation.
The compiler team has invited suggestions about where unroll_and_jam might be effective. You would have to demonstrate it by writing the suggested optimized form explicitly and showing improved performance. It doesn't look like you have given this serious thought, it's certainly not a one-pragma-fits-all situation.
OpenMP definitely would not be a way to optimize performance of a single threaded application, unless the thought process led you to a useful optimization. The autoparallel options sometimes find optimizations which are useful in a single thread, such as improving data locality by loop nest changes.
You haven't expressed any motivation for use of intrinsics. If you are required to vectorize without using an auto-vectorizing compiler, or you have the rare situation which needs parallel instructions but is not vectorizable, you might consider them. If Microsoft hadn't advocated this as the only route to vectorization, they might have kept a lock on the Windows C++ market. Again, the thought process about how to use parallel intrinsics would help you understand how your code should be organized for optimization.
Are you referring to examples which you may have posted somewhere else, or to an attachment which no longer is present?

TimP · ‎04-26-2009

At the lowest application programming level, affinity setting is done by calls to the threading library (Windows threading, or pthreads for linux, in Intel compilers). You don't have visibility into which instructions are used unless you are modifying these libraries.
Most compilers have environment variable settings to invoke affinity calls automatically, like the Intel OpenMP environment variables KMP_AFFINITY and GOMP_CPU_AFFINITY
http://www.intel.com/software/products/compilers/docs/flin/main_for/mergedprojects/optaps_for/common/optaps_openmp_thread_affinity.htm
Even these settings don't avoid problems caused by multiple applications affinitized to the same core; you must avoid these conflicts yourself, and that is usually impossible internal to an application.
New beta versions of Windows may support similar affinity environment variables in a compiler-independent way. 3rd party utilities claim to do this for certain current production Windows versions; otherwise, you must capture a running thread and set its affinity in Task Manager.
Linux has the taskset and numactl commands.