Solved: It's certainly true that - Page 2

Anil_M_ · ‎01-31-2017

I wrote a subroutine mostly using compiler intrinsics of AVX2 and AVX, I used some SSE instructions too but I did set the enhanced instruction set to AVX2 in the project settings of Visual Studio.

My program has many other routines and every time I run it, it gives me the same result. But when I profile it or print out the times taken by different functions, the function that I wrote using AVX2 instructions shows weird behavior. It slows down in certain runs by a factor of 10 to 15 whereas other functions vary by a scale factor of at most 2.

Can anybody explain this? Any help will be greatly appreciated. For certain reasons I cannot paste my code here.

Thanks,

Anil Mahmud.

jimdempseyatthecove · ‎02-01-2017

Depending on CPU, 10x is approximately the difference between L1 cache and L3 cache latency. Is your thread pinned?

Jim Dempsey

View solution in original post

andysem · ‎02-17-2017

Could the address where the memory is allocated be the reason of the slowdown (page faults may be but then again Task Manager did not report any page fault during the slowdowns) ?

If you're using a multi-socket machine, memory location could be affecting performance, if the memory belongs to a "remote" node. The penalty may appear random, too, as the OS may relocate the memory region between the nodes, if it feels that the memory is needed more at a different node.

AFAIK, memory is relocated on page granularity, so if you happen to have two threads running on different physical CPUs access a single page (even different regions of it), that page may bounce between nodes, unless the OS is able to move the threads to a single node.

jimdempseyatthecove · ‎02-17-2017

The cache system is multi-level, exclusive of the levels of L1, L2, L3.

By this I mean the cache system includes the caching of the Translation Look aside Buffer (TLB) as well as cache lines within each TLB entry. The TLB entries map pages. The TLB cache is limited and depends on CPU design (8, 16, 32, 64, ...). For performance critical programs, you must pay attention to the number of TLB entries required as well as if the data fits within your chosen cache. The number of TLB's may be reduced by using larger page sizes, however, placing all your commonly used references, pointers, structs into a single page can often reclaim a few TLB entries.

Jim Dempsey

Anil_M_ · ‎02-17-2017

FOUND THE SOLUTION !!!!!!!!

It is denormal numbers! Everytime I got the exact solution and thought there could not be any denormal number but my code still produced denormal numbers in the bits of __m256 variables which I did not use (I was using 3/6 floats out of the 8 possible floats). I had forgotten to initialize them and thus they produced denormal numbers from time to time.

The AMD CodeXL profiler was showing that some simple fmadd instructions took a lot of time. When I looked into the assembly code I found only register operations so I thought that the profiler was wrong. Actually it was right. It was in those instructions the denormal values contributed and made things slow.

In the old version of my code, which did not show this behavior I was explicitly putting floating point numbers in all the 8 floating point fields of the __m256 variable, although the old code was slower it did not perform randomly.

I have ran my code more than 100 times without any slowdown not even with scale factors around 2. It is now more consistent than the other functions I wrote earlier.

Thank all of you very very very much for all your help I learned a lot form this discussion. :)

areid2 · ‎02-17-2017

It's certainly true that calculations on denormal numbers are ~100x slower! Sounds like you were able to fix it by initializing the unused values. You can also get the processor to flush the denormals to zero by setting some bits in the MXCSR register using _mm_setcsr/_mm_getcsr. I've found that a huge optimization in other situations where denormals can arise. Intel compiler also has a command line option to enable this, I think it is part of -O1 and higher optimization levels.

Anil_M_ · ‎02-17-2017

Thanks Areid.

\I would set the MXCSR register just to be safe. But I was trying to solve the issue by setting the MXCSR register but it seemed to me that it had a slight penalty and initializing the variables seemed to perform the best in the first place. But yeah I would set the register ... just in case.

Alexander_L_1 · ‎02-18-2017

Congratulations, Anil!

Here just some my anecdotical experience, just two days ago:
I've rewroten one older algorithms and had to comapre results just to be sure, result is still correct.
In order to do this, I've just allocated memory locally with malloc for the one of version.
As next I've immediatelly noticed, the results were not the same. looked, looked and looked at algorithms...
Hey, the allocated memory was not initialized. So I've wrote if (pt==NULL) ALLOC else INITIALIZE.
Because this don't help eiher I wrote, two hours ago: if (pt==NULL) {ALLOC ; INITIALIZE}
And next two hours ago I hat it correct if (pt==NULL) ALLOC ; INITIALIZE; ... because initialization to zero must be done for each run!

Is it not crazy? This 3x mistake stolen me more time as to write the new algo ;)

I don't know why, but from my long experience as programmer I can definitively say - uninitalized, partially initilaized or wrong initialized memory is still bug number one in software development. Probably just because more complicated things are closer in focus of developer as such simply things.

Alex

jimdempseyatthecove · ‎02-18-2017

Good catch. I will have to keep that in mind when a quirk like this shows up. I usually look to protect against possible divide by zero in the unused lanes, but not +, -, * in the unused lanes for denormals.

Alex, it is a common mistake to assume that malloc (and its like) will initialize the allocation to 0's. While this may be a security protection in a system, you may find that the zeroing on these systems only occurs on first touch of a page in the virtual memory of the application. IOW allocation of a node encompassing addresses never touched since start of process may be zeroed, however, after free of node, then subsequent allocation of node that overlaps regions of prior node are not necessarily zeroed. Note, the initial zeroing on first touch is O/S dependent. The optional zeroing of all applications is dependent on the language and/or memory allocator (e.g. CRTL, C#RTL, Fortran, Python, ...). You would be surprised by the number of posts on this forum relate to uninitialized variables/allocations.

Jim Dempsey

Alexander_L_1 · ‎02-18-2017

Hello Jim,

I don't assumed zeroing of memory, I just forgot it, and them done it at wrong place two times. The whole story should only demonstrate that often the error is much more simply, so simply, that this would just not taken into account most of the time.

Therefore I'm not surprised at all :)

I see a big problem if programmers starts with high-level language, because a lot of aspects would be unknown. As example, .NET developers may assume, array members are always initialized to their default values. On the other site, same developers often surprised, why memory allocation take a lot of time...

Other example from my real practice: someone (I will just not specify who) with university degree in computer science wrote following:

for (int x...)
{
for (int y...)
{
do something with memory a adress ptr[y*stride +x]
}
}

Than this guy spent several month of time with a lot of microoptimizations in order to get required performance. Is it not crazy?

After this guy was fired, that task was reassign to me, I reorder both loops inside/outside and get perf. boost at nearly factor 8 - the whole story taken 2 minutes to find and fix. Sometimes things are much much simply.

As optimization example - I always try to use datatype with lowest acceptable precision in order to save memory traffic and calculate as much as possible varibales with one SIMD instructions.

Alex

Jonathon_S_Intel · ‎02-23-2017

How would one run sde64 on a binary that uses a shared library, and then monitor the avx transitions on the shared library?

Alexander_L_1 · ‎02-23-2017

@Jonathon. I can highly recommend you to open a new topic for your very special question. By the way it would be very helpful if you can specify what you mean by the word "monitor" - should it only find if sse/avx transitions are possibly? But it says nothing, because penalties (as code factor) are distributed across library boundaries. So if one library method uses SSE and other AVX - you can't determine any transition possibility. You can win something only and only if AVX/SSE is intermixed inside same method without complex condition jumps.

AdyT_Intel · ‎02-23-2017

There is some documentation in the SDE main page under the section: Intel AVX/SSE transition checker (and there is a quick link for it).

Random slow downs with AVX2 code.