Thank you very much for

susangao · ‎02-19-2013

Hi All,

I wish to get your kindly help on this basic question.

I find a nested loop in a paper as following: ("Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel")

for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
for(int j = 0; j < LEN2; j++) {
cc += aa * bb;
}
}
}

I modifed it as:

for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
t = bb;
for(int j = 0; j < LEN2; j++) {
cc += aa * t;
}
}
}

I use ICC to test these two pieces of code, I check execution time and the numbers of scalar and vector operations.

The CPU is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
The ICC version is icc (ICC) 12.1.5 20120612.
OS: GNU/Linux

The output is:

Loop Time(Sec) Checksum
tensor_kernel 2.25 0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26369718676

tensor_kernel 20.55 0.210001
PAPI_FP_OPS 9
PAPI_VEC_SP 26264735280

The execution time are quiet different: one is 2.x sec, the other is around 20 sec.

On the other hand, the PAPI result shows that the instruction numbers of scalar and vector instruction operations are similar. To comfirm the result from PAPI, I checked the .s code, and didn't find obviors difference. I dont understand what causes this execution time difference?

Code is as attached, it use the framework of TSVC. (By default the papi version will also be built, you could remove them from list.)

Thank you very much for reading this question.

Best Regards,

Susan

TimP · ‎02-19-2013

If you would give t local scope e.g.

for(int k = 0; k < LEN2; k++) {
float t = bb;

...

the compiler would be freer to optimize. However, it still looks intended to prevent optimization of loop nesting (at least if you set -O3 -ansi-alias) so as to get unity stride inner loop. Then possibly you might also get automatic unroll and jam, so that perhaps only every other iteration on k writes back, if the optimized loop nesting didn't already accomplish that.

If you would compare the opt-report results, you might get a text indication about what optimization is suppressed by giving t scope outside the nested loops.

It would be better to write the loops with optimized nesting yourself before inserting a temporary which might block optimization.

I don't know whether papi allows you to distinguish between "simulated gather" where the compiler vectorized with scalar loads and stride 1 parallel data access; it would seem not.

susangao · ‎02-19-2013

Hi Tim,

Thank you very much for the reply.

I am still confused and need your further help. Since the assembly code of them have almost the same instruction sequence, I dont understand why the execution time is different. (BTW, I didnt see the multiple choice in .s code according to runtime condition.) PAPI use counters to get these numbers at runtime, these number look reasonable to me since there are same instruction sequence in .s file.

I tried the modification you mensioned, the performance get no abvious improvement.

Thank you again!

Susan

Marián__VooDooMan__M · ‎02-19-2013

TimP (Intel) wrote:

If you would give t local scope e.g.

for(int k = 0; k < LEN2; k++) {
float t = bb;

...

I would go even further, when I use this kind of nesting, I often do:

for(int k = 0; k < LEN2; k++) {
const float t = bb;

to give compiler hint, that it should optimize it, since modification later is not possible.

susangao · ‎02-19-2013

Thank you very much for further suggestion. I tried const, register, neither of them helps.

Seems from .s and the counter result I can tell no obvious difference, but exact execution time differ a lot, it is a little bit confusing...

Susan

SergeyKostrov · ‎02-19-2013

Susan, Changes in codes like that one could have a severe impact on performance of some application. There are two possibilities: 1. An alignment related problem: There is a possibility that it creates some misalignment and addition a _declspec( align( xx ) ) specificator, or some dummy variable of a right size, could fix the problem. Note: I had similar problems with two legacy C/C++ compilers and it is a real problem especially when the software needs to be portable. A solution: declare with some alignment specificator. 2. A variable declared in a local scope is not initalized: It is possible that assignment to some default value will fix the problem ( it looks like a "magic" but a couple of times I was really puzzled with such kind of fixes! ). ... int tensor_kernel_1() { // example from the paper "Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel" clock_t start_t, end_t, clock_dif; double clock_dif_sec; float t = 0.0f; init( "tensor_kernel "); ... Note: These variables, that is, start_t, end_t, clock_dif; clock_dif_sec, are also not initialized. A Combined solution could look like: declare the variable in a scope where it is used with some alignment specificator and initialize it ( including all the rest automatic variables ).

susangao · ‎02-20-2013

Sergey,

Thank you very much for your detailed reply. In the code I used __attribute__((aligned(x))) for allocating global arrays..

I have another question about alignment: I feel like icc could generate conditional code to do loop peeling for unaligned part, is that true?

BTW,Initializing variables for this case doesn't help obviously on performance.

Susan

TimP · ‎02-20-2013

Yes, in most cases where a loop is vectorizable but alignment is not assured, the compiler peels some iterations so as to begin the vectorized alignment at an aligned point in an array for which alignment is recognized as useful.

Peeling for alignment is not so important when dealing with 128-bit parallel instructions on core I7-2 or newer, as those CPUs have excellent support for unaligned 128-bit parallel access. So it is possible to see AVX-128 code which doesn't peel for alignment even though it supports mis-aligned data. That option is chosen more frequently by gcc than by icc.

The matrix operation case which you pose at the beginning of this thread raises additional issues with alignment. In order for __atttribute__((aligned())) to support inner vector loops without peeling for alignment, not only do the loops need to be nested appropriately, but the corresponding extents of the arrays must be multiples of 16 bytes (32 for AVX-256...)...

susangao · ‎02-20-2013

Tim,

Thank you very much for the information.

"but the corresponding extents of the arrays must be multiples of 16 bytes (32 for AVX-256...)" I used 32 byte aligned allocation, the array size is multiple of 32 bytes. Did I miss anything or misunderstand your point?

Since the instruction related counter shows similar result, I will try to check the counter for L1/L2 by using PAPI later. Beside SIMD, I wish to check whether memory hierache is the reason.

Best Regards,

Susan

SergeyKostrov · ‎02-20-2013

Hi Susan, I'll check how your test project works on Ivy Bridge ( executables will be compiled with different C/C++ compilers ) and post results.

susangao · ‎02-21-2013

Sergey,

Thank you very much.

I got following result related to L1 and L2 cache: (L1 data miss: PAPI_L1_DCM; L1 instr miss: PAPI_L1_ICM; L2 data miss: PAPI_L2_DCM)

Loop Time(Sec) Checksum

tensor_kernel_1 2.22 0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26351151240
PAPI_L1_DCM 953799685
PAPI_L1_ICM 788
PAPI_L2_DCM 120717240

tensor_kernel_2 19.44 0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26288592672
PAPI_L1_DCM 13229545630
PAPI_L1_ICM 3169
PAPI_L2_DCM 7702311276

Seems that the L1 data miss got a great increase of 13.8x and it might be the reason. But I dont know why this happened, does it mean that whenever it execute t=bb[][], it will cause a L1 miss? I will try to modify this statement and have a look.

Susan

TimP · ‎02-21-2013

We've tried to explain how you've defeated compiler optimizations for data locality. It's interesting how your statistics show the effects of backward nested loops.

SergeyKostrov · ‎02-21-2013

>>...Seems that the L1 data miss got a great increase of 13.8x and it might be the reason. But I dont know why this >>happened, does it mean that whenever it execute t=bb[][], it will cause a L1 miss? I will try to modify this >>statement and have a look. Susan, these numbers are very interesting. I will look at a Datasheet of your CPU ( Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz ) and the test case tomorrow.

susangao · ‎02-22-2013

Thank you all very much for your help and reply.

About the increase of L1/L2 miss, I still wish to get your opinion on following things:

In the code, LEN2=256

So for the innermost loop (cache line size 64B): LEN2*4B are W/R for cc array (one row); LEN2*64B are W/R for aa aray (totally Len2 cache line are read); totally: 256 * 4 + 256 * 64=17KB. Data L1 is 32KB, but I dont understand why this also cause L1/L2 miss increase obviously. I changed the LEN2 to 128, the L1 miss is still around get a increase of 15x, also exist L2 miss increase. If I did anything incorrect, please pardon me and do me a favor to correct me.

BTW, a personal feeling: for normal (not expert) programmer like me, I might think that it is good to have t=bb and move it outside innermost loop since this array visiting has nothing to do with innermost loop iteration variable. Or I might just occasionally wrote the code to have such a t, I feel there is no big difference since this value is just a constant for innermost loop interation.... The paper says: since the last index for each array visited in the statement are different (one i, one j and one k) and hard to be used to determine loop permutation, the manual code they gave just do unrolling and move this kind of t to the outside of innermost loop.

I have an additional question, I know icc can recoganize typical routine of matrix multiply; can icc recogonize a typical 4x4 (or 8x8 for AVX) float point matrix transpose and use pack/unpack/shuffle to deal with that?

SergeyKostrov · ‎02-22-2013

Susan, Let me answer last part of your post at the moment. >>...can icc recogonize a typical 4x4 (or 8x8 for AVX) float point matrix transpose and use pack/unpack/shuffle to >>deal with that? It looks like No but I'm not sure for 100%. There is a macro _MM_TRANSPOSE4_PS declared in xmmintrin.h header file. I know about that macro for more than 5 years and used it in some image processing project. However, in 2012 I evaluated _MM_TRANSPOSE4_PS performance and it "failed" in a "competition" with a Hard-Coded-Non-SSE-Transponse for a 4x4 matrix of single-precision floating-point values. Hard-Coded-Non-SSE-Transponse for a 4x4 matrix looks like: ... float fValue = 0.0f; fValue = fMxIn[0][1]; fMxIn[0][1] = fMxIn[1][0]; fMxIn[1][0] = fValue; fValue = fMxIn[0][2]; fMxIn[0][2] = fMxIn[2][0]; fMxIn[2][0] = fValue; fValue = fMxIn[0][3]; fMxIn[0][3] = fMxIn[3][0]; fMxIn[3][0] = fValue; fValue = fMxIn[1][2]; fMxIn[1][2] = fMxIn[2][1]; fMxIn[2][1] = fValue; fValue = fMxIn[1][3]; fMxIn[1][3] = fMxIn[3][1]; fMxIn[3][1] = fValue; fValue = fMxIn[2][3]; fMxIn[2][3] = fMxIn[3][2]; fMxIn[3][2] = fValue; ... As you can see there are No any for-loops in the code and it outperformed _MM_TRANSPOSE4_PS macro. Then, I don't know anything about AVX-based transpose macro or support by Intel C++ compiler. Where did you see or read about it? I just checked immintrin.h header file and I didn't find any macros for transpose of a 8x8 matrix. Note: A transpose of 4x4 or 8x8 matricies are the trivial cases and if you're interested in the transpose of larger matricies then Diagonal or Eklundh Based algorithms need to be used.

TimP · ‎02-23-2013

When you force the compiler to follow the bad organization of loop nesting you quoted at the top of this thread, each element of aa[] comes from a different cache line, so you may expect to read the entire array each time through the inner loop. Your papi statistics appear to verify this.

The compiiler apparently vectorized the loop using "simulated gather," where the elements read separately each from a different cache line are packed into a vector register. Thus you execute roughly the same number of vector floating point operations as in the stride 1 vectorized case, but you replace the parallel reads from aa[] by scalar reads. You say you looked at the asm code; you would see the replacement of parallel data moves by groups of insertp instructions.

susangao · ‎02-23-2013

Sergay, thank you very much for the useful and detailed info, i will try to implement one.

susangao · ‎02-23-2013

Thank you for reply, Tim.

SergeyKostrov · ‎03-06-2013

>>...Seems that the L1 data miss got a great increase of 13.8x and it might be the reason. I used VTune and I confirm that ( reproduced on 3rd Gen i7 Ivy Bridge / Windows 7 Professional ). >>...But I dont know why this happened, does it mean that whenever it execute t=bb[][], it will cause a L1 miss? Yes, but only in case of aggressive optimizations by the compiler with /O3 option. As a summary this is what I had in VTune ( for your "optimized" function ): ...A significant proportion of cycles is being spent on data fetches that miss in the L2 but hit in the LLC...

SergeyKostrov · ‎03-06-2013

Here are results of investigation: Optimization: /Od Loop Time(Sec) Checksum tensor_kernel 99.0000 0.214022 tensor_kernel_1 85.0000 0.214022 Optimization: /O1 Loop Time(Sec) Checksum tensor_kernel 25.0000 0.214022 tensor_kernel_1 25.0000 0.214022 Optimization: /O2 Loop Time(Sec) Checksum tensor_kernel 20.0000 0.213768 tensor_kernel_1 20.0000 0.214172 Optimization: /O3 Note: Performance negatively affected Loop Time(Sec) Checksum tensor_kernel 3.0000 0.213823 tensor_kernel_1 11.0000 0.214172 Source codes attached.

SergeyKostrov · ‎03-06-2013

Summary of VTune results is attached and here are a couple of screenshots:

A basic question about auto vectorization of 3-level nested loop